### 3.  Preprocessing & Modelling

The goal here is to use the datasets provided to build a model that predicts the presence of the West Nile Virus. The model is meant for use by the City of Chicago for decisions involving pesticide spray.

### Contents:
- [Import Libraries](#Import-Libraries)
- [Import Data](#Import-Data)
- [Data prepared for Modelling](#Data-prepared-for-Modelling)
- [Modelling](#Modelling)
- [Models Evaluation & Next Steps](#Models-Evaluation-&-Next-Steps)

### Import Libraries

We import the necessary libraries used in analysis.

In [None]:
# import libraries

# maths
import numpy as np
import pandas as pd

# visual
#from matplotlib_venn import venn2
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# modelling
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix,accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.utils import resample, shuffle
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier,RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier

# Others
import warnings
warnings.filterwarnings("ignore")
from IPython.display import Image

### Import Data

We import the datasets with standardised file paths. 

In [None]:
# file paths

input_path = '../data/2_input/'
clean_path = '../data/3_clean/'
output_path = '../data/4_output/'

image_path = '../images/'

The datasets imported here have already been cleaned for null values. 

In [None]:
# import clean data

df_train = pd.read_csv(clean_path + 'train_clean.csv')
df_test = pd.read_csv(clean_path + 'test_clean.csv')
df_weather = pd.read_csv(clean_path + 'weather_clean.csv')

In [None]:
print('Size of train dataset: {}'.format(df_train.shape))
print('Size of test dataset: {}'.format(df_test.shape))
print('Size of weather dataset: {}'.format(df_weather.shape))

### Data prepared for Modelling

We combine train and test data to prepare the datasets for modelling.

In [None]:
#Drops id column from test not in train
test = df_test.drop('id', axis=1)
#Drops nummosquitos and wnvpresent columns from train not in test
train = df_train.drop(['nummosquitos', 'wnvpresent'], axis=1)

#Combines train and test datasets
combined_train_test = pd.concat([test,train])

print('Size of train/test dataset: {}'.format(combined_train_test.shape))

The weather dataset gives us information of weather conditions from 2007 to 2014, during the months of the virus tests. It includes data from two weather stations:

<br>Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933 Elev: 662 ft. above sea level
<br>Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752 Elev: 612 ft. above sea level

We split the data from each station and combine it across columns to prevent duplication of data points on columns in the main dataset. Then merge it with the train/test dataset to add information on weather conditions on the dates of virus test.

In [None]:
#Splits weather data by Station
only_station_1 = df_weather[df_weather['station'] == 1].reset_index(drop=True)
only_station_2 = df_weather[df_weather['station'] == 2].reset_index(drop=True)

#Renames Station 2 data for differentiation
only_station_2.columns = [str(col) + '_2' for col in only_station_2.columns]

In [None]:
# Combine weather data from both stations across columns and drop Station columns
parallel_weather = pd.concat([only_station_1,only_station_2], axis=1).drop(['station','station_2'], axis=1)

In [None]:
#Print combined weather data
parallel_weather.head()

In [None]:
#Combines weather data with train and test dataset
all_dataset = combined_train_test.merge(parallel_weather, how='left', on=['year','month','day'])

print('Size of train/test dataset with weather data: {}'.format(all_dataset.shape))

In [None]:
#Prints train/test dataset with weather information
all_dataset.head()

For the categorical data, we use One Hot Encoding to convert them to numerical data and drop the first column of each categorical feature as it represents duplicated information. We also drop the original columns with non-numeric values.

In [None]:
#Converts categorical data into numeric
df_get_dum = pd.concat([all_dataset, pd.get_dummies(all_dataset[['species', 'street', 'trap']],drop_first=True)], axis=1)
df_get_dum.drop(['species', 'street', 'trap'], inplace =True, axis=1)

print('Size of train/test dataset with weather data(One Hot Encoded): {}'.format(df_get_dum.shape))

We split the data back into seperate train and test datasets and only use train for training the model.

In [None]:
#Splits out train dataset using year
train = df_get_dum[df_get_dum['year']%2!=0]
train.reset_index(inplace=True, drop=True)

#Re-attaching original wnvpresent column
wnv = pd.Series(df_train['wnvpresent'])
train_with_wnv = pd.concat([train , wnv], axis=1)

print('Size of processed train data: {}'.format(train_with_wnv.shape))

In [None]:
#Splits out test dataset using year
test = df_get_dum.loc[df_get_dum['year']%2==0]
print('Size of processed train data: {}'.format(test.shape))

We also note that the data is imbalanced. Out of the 8475 rows in our training dataset, only 457 (~5%) data points represent the virus present class while 8018 represent virus not present.

In [None]:
#Size of training data
train_with_wnv.shape[0]

In [None]:
#Representation of classes
train_with_wnv.wnvpresent.value_counts()

To handle the imbalanced data, we conduct oversampling on the minority class i.e. the data points where wnv is present. While undersampling is an option, we decide that undersampling the majority class to match the size of the minority class dataset will reduce the data points immensely, and is hence not ideal.

We first split the data into the classes. Then, resample minority class with replacement until there are 8018 data points before combining the new minority dataset with the original majority class dataset.

In [None]:
#Splits data by presence of wnv
majority_class = train_with_wnv[train_with_wnv['wnvpresent']==0]
minority_class = train_with_wnv[train_with_wnv['wnvpresent']==1]

In [None]:
#Resamples minority class with replacement
minority_upsampled = resample( minority_class, replace=True, n_samples=majority_class.shape[0], random_state=42)

#Combine new minority class dataset with original majority class dataset
train_resampled = pd.concat([minority_upsampled,majority_class])

#Checks class representation
train_resampled.wnvpresent.value_counts()

We shuffle the dataset to inject randomness.

In [None]:
#Shuffles dataset
df = shuffle(train_resampled, random_state=42)
df.reset_index(drop=True, inplace=True)

# Print resampled, reshuffled new dataset
df.head()

### Modelling

We use all the features in the dataset to fit classification models and identify wnvpresent to be our target.

In [None]:
X = df.drop(['wnvpresent'], axis=1)

In [None]:
y = df.wnvpresent

Baseline prediction is all 0 as the dataset is imbalanced, hence the simplest way is to predict the majority class.

In [None]:
baseline_pred = np.zeros(y.shape[0])

In [None]:
roc_auc_score(y,baseline_pred)

ROC/AUC score of 0.5 means the model has no class separation capacity.

The data is split into random train and test subsets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

We fit the model on the following classifiers and test its performance using ROC/AUC score:
    <br> - Logistic Regression, 
    <br> - KNearestNeighbour Classifier, 
    <br> - Decision Tree Classifier,
    <br> - Random Forest Classifier,
    <br> - Adaboost Classifier,
    <br> - Gradient Boosting Classifier

In [None]:
estimators = {
    'Lr': LogisticRegression(),
    'Knn': KNeighborsClassifier(n_neighbors=5),
    'Dtree': DecisionTreeClassifier(),
    'Rf': RandomForestClassifier(),
    'Adaboost': AdaBoostClassifier(),
    'GradientBoost': GradientBoostingClassifier()
}.items()

Here we use a pipeline to scale the data before fitting to the classifiers. 

In [None]:
for k,v in estimators:
    pipe = Pipeline([
        ('sc', StandardScaler()),
        (k,v)])
    model = pipe.fit(X_train,y_train)
    pred = model.predict(X_test)
    print('{} Model Score: {}, Cross-Validation Standard Deviation: {}'.format(k, round(model.score(X_train,y_train),5), round(cross_val_score(model,X,y,cv=3, scoring='roc_auc').std(),5)))

The models were scored on training data(X_train) to determine the model's performance. Cross validation was performed on the whole train data set using metrics of ROC/AUC to determine if the model is able to generalise to data that it was not trained on.
<p>From the model score and standard deviation of the cross validation scores of the classifiers, we see that the best classifiers are decision tree and random forest. Decision tree had a score of 1.0 as it is a greedy model that will find the best split.
<p>As such, we will tune the hyperparameters of decision tree and random forest.

<b>Tuning hyperparameters through GridSearch<b>

In [None]:
params = {

    'Dtree': {
        'Dtree__max_features': ['auto', 'log2', None, 50, 100, 200],
        'Dtree__max_depth': [None, 5, 10, 15],
        'Dtree__min_samples_split': np.linspace(0.1, 0.5, 5)
    },
    'Rf': {
        'Rf__n_estimators': [10, 20, 50, 100],
        'Rf__max_depth': [None, 5, 10, 15],
        'Rf__max_features': ['auto', 'log2', None, 50, 100, 200],
        'Rf__min_samples_split': np.linspace(0.1, 0.5, 5)
    }
}

In [None]:
models = []
parameters = []
best_score = []
roc_auc = []

shortlist_estimators = {
    'Dtree': DecisionTreeClassifier(),
    'Rf': RandomForestClassifier(),
}.items()

for k,v in shortlist_estimators:
    pipe = Pipeline([
        ('sc', StandardScaler()),
        (k,v)])
    
    param = params[k]
    
    gridsearch = GridSearchCV(
        estimator=pipe,
        param_grid=param,
        verbose=1,
        cv= 3,
        n_jobs=-1,
        return_train_score= True,
        scoring = 'roc_auc'
    )

    gridsearch.fit(X_train, y_train)
    
    model = gridsearch.best_estimator_
    cv_score = gridsearch.cv_results_
    best_params = gridsearch.best_params_

    # predict y
    y_pred = model.predict(X_test)
    
    # print results
    print("Model: ", k)
    print("Best parameters:", best_params)
    print("Best score:", gridsearch.best_score_)
    print("AUC/ROC test:", roc_auc_score(y_test,y_pred))
    display(pd.DataFrame(cv_score, columns = cv_score.keys()))
    
    
    # append info to list
    models.append(k)
    best_score.append(gridsearch.best_score_)
    parameters.append(best_params)
    roc_auc.append(roc_auc_score(y_test,y_pred))

In [None]:
# print summary of results
summary = pd.DataFrame({
    'model': models,
    'parameters': parameters,
    'best score': best_score,
    'roc/auc test': roc_auc
})
pd.set_option('display.max_colwidth', -1)
summary

#### Prediction on kaggle test set

The best parameters from gridsearch were used for prediction on the kaggle test set.

In [None]:
# prediction for decision tree
pipe = Pipeline([
        ('sc', StandardScaler()),
        ('Dtree', DecisionTreeClassifier(min_samples_split=0.1))])
    
model = pipe.fit(X_train,y_train)
pred = model.predict_proba(test)
pred = pd.DataFrame(pred, columns=[0,1])

submission = pd.DataFrame()
submission['WnvPresent'] = pred[1]
submission['Id'] = submission.index + 1
submission[['Id', 'WnvPresent']].to_csv(output_path+'submission_dtree.csv', index = False)

In [None]:
# prediction for random forest
pipe = Pipeline([
        ('sc', StandardScaler()),
        ('rf', RandomForestClassifier(max_features='log2', min_samples_split=0.1, n_estimators=50))])

model = pipe.fit(X_train,y_train)
pred = model.predict_proba(test)
pred = pd.DataFrame(pred, columns=[0,1])

submission = pd.DataFrame()
submission['WnvPresent'] = pred[1]
submission['Id'] = submission.index + 1
submission[['Id', 'WnvPresent']].to_csv(output_path+'submission_rf.csv', index = False)

### Models Evaluation & Next Steps

We used both Decision Tree and Random Forest models to predict on the kaggle test set and the results obtained were as follows. 

In [None]:
Image(filename= image_path + 'submission_dtree_rf.PNG')

The differences between the kaggle scores and the training scores indicated that our models were overfitted. <p>Therefore, we proceeded to explore more work to better prepare the data for modelling (refer to other notebooks):
1. Regression to predict number of mosquitos - The number of mosquitos was not provided in the kaggle test set and was not considered in the previous model. We will predict the number of mosquitos first before predicting the probability of virus. 
2. Oversampling via SMOTE

Random Forest Classification was used in the exploration as Decision Tree is prone to overfitting.