## **Species Prediction model demo**

The purpose of this demo is to showcase the Species Prediction models, and how it works, while properly referencing the original scripts where the functions come from.

For this demo we will be using static data queries from the original database, with the following [schema](../data/schema.md).

We start by importing the relevant libraries.

In [1]:
import pandas as pd
import json
import sys
import warnings
warnings.filterwarnings("ignore")

sys.path.append("..")
from scripts.process import *
from scripts.predict import *

In [2]:
field = pd.read_csv('../data/new_field.csv')

with open('../data/det.json', 'r') as file:
    det_table = json.load(file)



This table contains the deterministic features provided by the biologists.

In [3]:
det = pd.DataFrame(det_table)
det.head()

Unnamed: 0,species,eye_size,snout_shape,parr_marks,parr_marks_length,spotting_density,fin_type,parr_marks_spacing,spotting_characteristic
0,ck,large,pointy,slightly faded,long,medium,anal fin,wider than interspaces,circle
1,co,large,short and blunt,slightly faded,long,medium,anal fin,narrower than interspaces,circle
2,cm,medium,,faded,short,medium,caudal fin,,variable
3,pink,medium,,,,,caudal fin,half,
4,so,very large,,slightly faded,irregular,,caudal fin,variable,row


This table comes from the following query:

```sql
SELECT watershed, 
       river,
       site, 
       method, 
       local, 
       water_temp_start, 
       fork_length_mm, 
       species
FROM field
WHERE species IN ('ck', 'co', 'cm', 'so', 'stl', 'ct', 'rbt')
```

In [4]:
field.head()

Unnamed: 0,tag_id_long,watershed,river,site,method,local,water_temp_start,fork_length_mm,species
0,989.001038884511,nanaimo,nanaimo,jack point,beach seine,marine,10.6,80.0,ck
1,989.001038885629,nanaimo,nanaimo,jack point,beach seine,marine,13.3,,ck
2,989.001038888882,nanaimo,nanaimo,jack point,beach seine,marine,14.1,76.0,ck
3,989.001038889013,nanaimo,nanaimo,jack point,beach seine,marine,10.6,76.0,ck
4,989.001038888642,nanaimo,nanaimo,jack point,beach seine,marine,10.6,85.0,ck


The following cell, will call the `processing` function from `process.py`. This Python file can be found [here](../scripts/process.py).

In [17]:
processed_data = processing(data=field, det_data=det_table)

In [18]:
processed_data.head()

Unnamed: 0,tag_id_long,water_temp_start,fork_length_mm,species,watershed_black creek,watershed_chemainus,watershed_cowichan,watershed_englishman,watershed_koksilah,watershed_nanaimo,...,parr_marks_spacing_NA,parr_marks_spacing_half,parr_marks_spacing_narrower than interspaces,parr_marks_spacing_variable,parr_marks_spacing_wider than interspaces,spotting_characteristic_NA,spotting_characteristic_circle,spotting_characteristic_irregular,spotting_characteristic_row,spotting_characteristic_variable
0,989.001038884511,10.6,80.0,ck,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0
1,989.001038885629,13.3,,ck,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0
2,989.001038888882,14.1,76.0,ck,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0
3,989.001038889013,10.6,76.0,ck,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0
4,989.001038888642,10.6,85.0,ck,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0


The following cells, will call the `preprocess_data`, `voting_classifier_deterministic`, `voting_classifier_probabilistic`, and `voting_classifier` functions from `predict.py`. This Python file can be found [here](../scripts/predict.py).

The following function separates the deterministic data from the data queried from the database, as the deterministic models' input differs from that of the probabilistic models'.

In [19]:
det_data,prob_data = preprocess_data(processed_data)

In [20]:
prob_data.head()

Unnamed: 0,tag_id_long,species,water_temp_start,fork_length_mm,watershed_cowichan,watershed_englishman,watershed_nanaimo,watershed_puntledge,river_center creek,river_cowichan,...,snout_shape_short and rounded,parr_marks_NA,parr_marks_faded,parr_marks_slightly faded,parr_marks_length_long,parr_marks_length_short,spotting_density_high,spotting_density_medium,fin_type_anal fin,fin_type_caudal fin
0,989.001038884511,ck,10.6,80.0,0,0,1,0,0,0,...,0,0,0,1,1,0,0,1,1,0
1,989.001038885629,ck,13.3,,0,0,1,0,0,0,...,0,0,0,1,1,0,0,1,1,0
2,989.001038888882,ck,14.1,76.0,0,0,1,0,0,0,...,0,0,0,1,1,0,0,1,1,0
3,989.001038889013,ck,10.6,76.0,0,0,1,0,0,0,...,0,0,0,1,1,0,0,1,1,0
4,989.001038888642,ck,10.6,85.0,0,0,1,0,0,0,...,0,0,0,1,1,0,0,1,1,0


In [21]:
det_data.head()

Unnamed: 0,tag_id_long,species,eye_size_large,eye_size_medium,eye_size_small,eye_size_very large,snout_shape_NA,snout_shape_long and pointy,snout_shape_pointy,snout_shape_short and blunt,...,parr_marks_spacing_NA,parr_marks_spacing_half,parr_marks_spacing_narrower than interspaces,parr_marks_spacing_variable,parr_marks_spacing_wider than interspaces,spotting_characteristic_NA,spotting_characteristic_circle,spotting_characteristic_irregular,spotting_characteristic_row,spotting_characteristic_variable
0,989.001038884511,ck,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,1,0,0,0
1,989.001038885629,ck,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,1,0,0,0
2,989.001038888882,ck,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,1,0,0,0
3,989.001038889013,ck,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,1,0,0,0
4,989.001038888642,ck,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,1,0,0,0


Before moving forward, it is important to visualize the ensemble these models are working on.

![img1](../img/species_model_diagram.png)

The following two functions will call the models from the deterministic branch, and the probabilistic branch, respectively, and will both output their predictions.

In [22]:
det_results = voting_classifier_deterministic(det_data)

In [12]:
prob_results = voting_classifier_probabilistic(prob_data)

In [14]:
def voting_classifier(det_results,prob_results):
    df = det_results.merge(prob_results,on='tag_id_long',how='left')
    df.columns = ['tag_id_long','pred_1','pred_2','pred_3']

    ensemble_pred = []
    for row in range(len(df)):
        prediction = [df.iloc[row]['pred_1'],
                      df.iloc[row]['pred_2'],
                      df.iloc[row]['pred_3']]
        if 'pink' in prediction:
            ensemble_pred.append('pink')
        elif 'so' in prediction:
            ensemble_pred.append('so')
        else:
            ensemble_pred.append(max(set(prediction), key=prediction.count))


    df['prediction'] = ensemble_pred
    
    return df
        
        

In [15]:
df = voting_classifier(det_results,prob_results)

In [16]:
df

Unnamed: 0,tag_id_long,pred_1,pred_2,pred_3,prediction
0,989.001038884511,ck,ck,ck,ck
1,989.001038884511,ck,ck,ck,ck
2,989.001038885629,ck,,,
3,989.001038888882,ck,ck,ck,ck
4,989.001038889013,ck,ck,ck,ck
...,...,...,...,...,...
63326,989.001039718510,stl,stl,stl,stl
63327,989.001039718574,stl,stl,stl,stl
63328,989.001039718554,stl,stl,stl,stl
63329,989.001039718564,stl,stl,stl,stl
