# Random Forest
A Random Forest Classifier is a decision based ensemble method that uses a series of decision trees to determine the class of our input variables. It works by selecting a random set of features at each node then averaging them together. This helps reduce overfitting by not emphasizing any one particular feature over another. We felt this was our next best choice after Logistic Regression because decision trees make no assumptions about the spread of the data and are very quick to train. Though this was not our production model we did spend a good amount of time searching through hyperparameters to find an optimum fit. Our results with this modelexhibited overfitting with a low amount of
predictive power on unseen data.


In [17]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier

# Importing the Time Series split assets

In [18]:
X_train = pd.read_pickle('../assets/X_train.pkl')
X_test = pd.read_pickle('../assets/X_test.pkl')
y_train = pd.read_pickle('../assets/y_train.pkl')
y_test = pd.read_pickle('../assets/y_test.pkl')

In [19]:
X_train.head()

Unnamed: 0,Day_length,Tmax,Tmin,Tavg,ResultSpeed,ResultDir,AvgSpeed,Sunset,Sunrise,Heat,...,ResultDir_exp,PrecipTotal_exp,WetBulb_exp,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS
0,53760,88.0,62.5,75.5,5.8,17.0,6.95,1917,421,0.0,...,17.0,0.0,65.5,0,0,1,0,0,0,0
1,53760,88.0,62.5,75.5,5.8,17.0,6.95,1917,421,0.0,...,17.0,0.0,65.5,0,0,0,1,0,0,0
2,53760,88.0,62.5,75.5,5.8,17.0,6.95,1917,421,0.0,...,17.0,0.0,65.5,0,0,0,1,0,0,0
3,53760,88.0,62.5,75.5,5.8,17.0,6.95,1917,421,0.0,...,17.0,0.0,65.5,0,0,1,0,0,0,0
4,53760,88.0,62.5,75.5,5.8,17.0,6.95,1917,421,0.0,...,17.0,0.0,65.5,0,0,0,1,0,0,0


# Running a pipeline
With a standard scaler and a Random Forest Classifier

In [20]:
pipe = Pipeline([
    ('ss',StandardScaler()),
    ('rfc',RandomForestClassifier(n_jobs=3,random_state=42))   
])

# Setting the range of hyperparameters 

In [21]:
param_grid =  {
    'rfc__n_estimators': [100,500,1000],
    'rfc__min_samples_split': [2, 7, 10, 20]
        
}

# Gridsearching with  timeseries split
We used roc_auc as our scoring metric paired with a timeseries split.

In [22]:
gs = GridSearchCV(pipe, param_grid=param_grid,verbose=1,scoring='roc_auc', cv=TimeSeriesSplit())

In [23]:
gs.fit(X_train,y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:   47.9s finished


GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('rfc', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0....timators=10, n_jobs=3,
            oob_score=False, random_state=42, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'rfc__n_estimators': [100, 500, 1000], 'rfc__min_samples_split': [2, 7, 10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=1)

# Scoring
Our data was routinely overfit with this model. 

In [24]:
gs.score(X_train,y_train)

0.9889989913432868

In [25]:
gs.score(X_test,y_test)

0.7533411993440045

In [26]:
X_train_preds = gs.predict(X_train)

In [27]:
preds = pd.DataFrame({
    "preds":X_train_preds,
    "truth":y_train
})


In [28]:
preds.sum()

preds     37
truth    261
dtype: int64

In [29]:
gs.best_params_

{'rfc__min_samples_split': 20, 'rfc__n_estimators': 1000}

In [30]:
# with open('../assets/random_forest_model_0924_1247.pkl','wb+') as f:
#     pickle.dump(gs,f)

# Feature importances
The first 7 features account for most of the information in our model.

In [31]:
feat_importances = pd.DataFrame(gs.best_estimator_.named_steps['rfc'].feature_importances_, X_train.columns, columns=['importance'])

In [32]:
feat_importances.sort_values('importance', ascending=False)

Unnamed: 0,importance
Heat_exp,0.123869
Longitude,0.11653
PrecipTotal_exp,0.113432
WetBulb_exp,0.113345
Day_length_exp,0.100534
ResultDir_exp,0.096082
Latitude,0.086649
Sunrise,0.016541
ResultDir,0.015922
ResultSpeed_shift,0.01583
