### 3.  Preprocessing & Modelling -SMOTE

Here we explore another method of oversampling using the Synthetic Minority Oversampling Technique (SMOTE) which creates new data points using the nearest neighbours of the original data points in the minority class. New data points are created to balance out the classes in the data.

### Contents:
- [Import Libraries](#Import-Libraries)
- [Import Data](#Import-Data)
- [Data prepared for Modelling](#Data-prepared-for-Modelling)
- [Modelling](#Modelling)
- [Models Evaluation & Next Steps](#Models-Evaluation-&-Next-Steps)

### Import Libraries

We import the necessary libraries used in analysis.

In [None]:
# import libraries

# maths
import numpy as np
import pandas as pd

# visual
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# modelling
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Others
import warnings
warnings.filterwarnings("ignore")
from IPython.display import Image

### Import Data

We import the datasets with standardised file paths. 

In [None]:
# file paths

input_path = '../data/2_input/'
clean_path = '../data/3_clean/'
output_path = '../data/4_output/'

image_path = '../images/'

The datasets imported here have already been cleaned for null values. 

In [None]:
# import clean data

df_train = pd.read_csv(clean_path + 'train_clean.csv')
df_test = pd.read_csv(clean_path + 'test_clean.csv')
df_weather = pd.read_csv(clean_path + 'weather_clean.csv')

In [None]:
print('Size of train dataset: {}'.format(df_train.shape))
print('Size of test dataset: {}'.format(df_test.shape))
print('Size of weather dataset: {}'.format(df_weather.shape))

### Data prepared for Modelling

We combine train and test data to prepare the datasets for modelling.

In [None]:
#Drops id column from test not in train
test = df_test.drop('id', axis=1)
#Drops nummosquitos and wnvpresent columns from train not in test
train = df_train.drop(['nummosquitos', 'wnvpresent'], axis=1)

#Combines train and test datasets
combined_train_test = pd.concat([test,train])

print('Size of train/test dataset: {}'.format(combined_train_test.shape))

The weather dataset gives us information of weather conditions from 2007 to 2014, during the months of the virus tests. It includes data from two weather stations:

<br>Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933 Elev: 662 ft. above sea level
<br>Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752 Elev: 612 ft. above sea level

We split the data from each station and use only the data from Station 1 here, considering that there were many null values in Station 2 which we imputed from Station 1 data when cleaning the data. 

We then, merge the Station 1 weather data with the train/test dataset to add information on weather conditions as measured at Station 1 on the dates of virus test.

In [None]:
#Splits weather data by Station
only_station_1 = df_weather[df_weather['station'] == 1].reset_index(drop=True)

In [None]:
#Using weather data only from Station 1
all_dataset = combined_train_test.merge(only_station_1, how='left', on=['year','month','day'])
print('Size of train/test dataset with Station 1 weather data: {}'.format(all_dataset.shape))

In [None]:
#Print combined data
all_dataset.head()

For the categorical data, we use One Hot Encoding to convert them to numerical data and drop the first column of each categorical feature as it represents duplicated information. We also drop the original columns with non-numeric values.

In [None]:
#Converts categorical data into numeric
df_get_dum = pd.concat([all_dataset, pd.get_dummies(all_dataset[['species', 'street', 'trap']],drop_first=True)], axis=1)
df_get_dum.drop(['species', 'street', 'trap'], inplace =True, axis=1)

print('Size of train/test dataset with weather data(One Hot Encoded): {}'.format(df_get_dum.shape))

We split the data back into seperate train and test datasets and only use train for training the model.

In [None]:
#Splits out train dataset using year
train = df_get_dum[df_get_dum['year']%2!=0]
train.reset_index(inplace=True, drop=True)

print('Size of processed train data: {}'.format(train.shape))

In [None]:
#Splits out test dataset using year
test = df_get_dum.loc[df_get_dum['year']%2==0]
print('Size of processed test data: {}'.format(test.shape))

### Modelling

We use all the features in the dataset to fit classification models and identify wnvpresent to be our target.

In [None]:
X = train

In [None]:
y = df_train.wnvpresent

We conduct oversampling on the minority class using SMOTE.

In [None]:
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)

#Checks class representation
pd.Series(y_resampled).value_counts()

The data is split into random train and test subsets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.33, random_state=42)

We fit the model on Random Forest Classifer and tune the hyperparameters with GridSearch.

<b>Tuning hyperparameters through GridSearch<b>

In [None]:
params = {
    'n_estimators' : [10, 50, 100],
    'max_depth' : [3,9,15,20],
    'min_samples_split': np.linspace(0.1, 0.5, 5),
    'min_samples_leaf' : np.linspace(0.1, 0.5, 5),
    'max_features' : (20, 50, 200, None)
}

In [None]:
parameters = []
roc_auc = []

gridsearch = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=params,
    verbose=1,
    cv= 3,
    n_jobs=-1,
    return_train_score= True,
    scoring = 'roc_auc'
)

gridsearch.fit(X_train, y_train)

model = gridsearch.best_estimator_
cv_score = gridsearch.cv_results_
best_params = gridsearch.best_params_

# predict y
y_pred = pd.DataFrame(model.predict_proba(X_test), columns=['0','1'])

# print results
print("Best parameters:", best_params)
print("Best score:", gridsearch.best_score_)
print("AUC/ROC test:", roc_auc_score(y_test,y_pred['1']))
pd.set_option('display.max_rows', 750)
display(pd.DataFrame(cv_score, columns = cv_score.keys()), )


# append info to list
parameters.append(best_params)
roc_auc.append(roc_auc_score(y_test,y_pred['1']))

In [None]:
fi = pd.DataFrame({
    'feature':X.columns,
    'importance': model.feature_importances_
})

fig = fi.sort_values('importance', ascending=False).iloc[:20]
fig.plot(kind='barh', figsize=(10,6))
plt.yticks(range(len(fig)),fig['feature'])
plt.show()

In [None]:
# csv for kaggle submission
features = list(X.columns)
test = test[features]

pred = pd.DataFrame(model.predict_proba(test), columns=['0','1']) 
submission = pd.DataFrame()
submission['WnvPresent'] = pred['1']
submission['Id'] = submission.index + 1
submission[['Id', 'WnvPresent']].to_csv(output_path+'submission_SMOTE.csv', index = False)

In [None]:
features1 = list(fig.head(12).feature)
features1 

In [None]:
X1 = X[features1]
X_train = pd.DataFrame(X_train, columns=X.columns)
X_train1 = X_train[features1]
X_test = pd.DataFrame(X_test, columns=X.columns)
X_test1 = X_test[features1]
 
pipe = Pipeline([
        ('sc', StandardScaler()),
        ('rf', RandomForestClassifier(max_depth=9, 
                                       min_samples_leaf=0.1,
                                      min_samples_split=0.1,
                                      n_estimators=100))
         ])

model = pipe.fit(X_train1, y_train)
score = model.score(X_test1, y_test)

# print results
print("Model score:", score)
print("Cross validation scores mean:", round(cross_val_score(model,X1,y,cv=3,scoring='roc_auc').mean(),5))
print("Cross validation scores std dev:", round(cross_val_score(model,X1,y,cv=3,scoring='roc_auc').std(),5))

In [None]:
# csv for kaggle submission
test = test[features1]

pred = pd.DataFrame(model.predict_proba(test), columns=['0','1']) 
submission = pd.DataFrame()
submission['WnvPresent'] = pred['1']
submission['Id'] = submission.index + 1
submission[['Id', 'WnvPresent']].to_csv(output_path+'submission_SMOTE1.csv', index = False)

### Models Evaluation & Next Steps

The results from Kaggle is as follows:

In [None]:
Image(filename= image_path + 'submission_rfsmote1.PNG')

In [None]:
Image(filename= image_path + 'submission_rfSMOTE.PNG')

The results are comparable with the method of oversampling on the minority class when all features were used. The kaggle scores decreased after selecting features through feature importance.

SMOTE is closely related to the real-life scenario where presence of virus is expected to cluster in vicinity of an infected area. 


Test score reflected much lower variance.


In [None]:
prediction = model.predict(X_test1)
cm = confusion_matrix(y_test, prediction)  #tn, fp, fn, tp
pd.DataFrame(data=cm, columns=['predicted no wnv', 'predicted wmv'], index=['actual no wnv', 'actual wbv'])