# Assignment 4 - Data Set Description
The questions below relate to the data files associated with the contest with the title 'DengAI: Predicting Disease Spread' published at the following website. 
https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/data/

Anyone can join the contest and showcase your skills. To know about contest submissions visit the following webpage
https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/submissions/
You can showcase your Machine Learning skills by ranking top in the contest. 

Problem description:
Your goal is to predict the total_cases label for each (city, year, weekofyear) in the test set. There are two cities, San Juan and Iquitos, with test data for each city spanning 5 and 3 years respectively. You will make one submission that contains predictions for both cities. The data for each city have been concatenated along with a city column indicating the source: sj for San Juan and iq for Iquitos. The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data. Throughout, missing values have been filled as NaNs.

Assignment:
The goal is achieved through three subsequent Assignments 1, 2, 3 and 4, all using the same dataset


The features in this dataset
You are provided the following set of information on a (year, weekofyear) timescale:

(Where appropriate, units are provided as a _unit suffix on the feature name.)

City and date indicators

    city – City abbreviations: sj for San Juan and iq for Iquitos
    week_start_date – Date given in yyyy-mm-dd format

NOAA's GHCN daily climate data weather station measurements

    station_max_temp_c – Maximum temperature
    station_min_temp_c – Minimum temperature
    station_avg_temp_c – Average temperature
    station_precip_mm – Total precipitation
    station_diur_temp_rng_c – Diurnal temperature range
    
PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)

    precipitation_amt_mm – Total precipitation

NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)

    reanalysis_sat_precip_amt_mm – Total precipitation
    reanalysis_dew_point_temp_k – Mean dew point temperature
    reanalysis_air_temp_k – Mean air temperature
    reanalysis_relative_humidity_percent – Mean relative humidity
    reanalysis_specific_humidity_g_per_kg – Mean specific humidity
    reanalysis_precip_amt_kg_per_m2 – Total precipitation
    reanalysis_max_air_temp_k – Maximum air temperature
    reanalysis_min_air_temp_k – Minimum air temperature
    reanalysis_avg_temp_k – Average air temperature
    reanalysis_tdtr_k – Diurnal temperature range

Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements

    ndvi_se – Pixel southeast of city centroid
    ndvi_sw – Pixel southwest of city centroid
    ndvi_ne – Pixel northeast of city centroid
    ndvi_nw – Pixel northwest of city centroid

# Assignment 4 - Questions
Use the merged data frame from Assignment 1,2  and 3 for this assignment

This Assignment focuses on data preprocessing and model building. Continue with the datasets loaded in Assignment 1, 2 and 3 (or reload with same steps and create merged data frame). In this assignment you need to use Neural Network

1. Load the data (both features and label data set as before)
2. Preprocess the data - briefly comment if any special preprocessing is adopted to suit Neural Network
3. Optional: Build a Neural Network Multi-Layer Perceptron Regressor model (you can use sklearn neural network MLP Regressor)
4. Optional: Evaluate the model and compare it with the previous three assignments
5. Add a new column called 'above_average' with value 1 or 0. 1 if the total_cases > median of total_cases
6. Build a Neural Network MLP Classifier on the 'above_average' column with 80/20 train/test split
7. Explain the meaning of Precision, Recall and F1-Score and why these are used to evaluate Classification models (instead of using Accuracy as a metric). Evaluate the classifier using Precision, Recall and F1 score values

Submit the .ipynb, and .html (optional submission.csv if you performed MLP regressor)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

#The first dataframe as csv
df1 = pd.read_csv('dengue_features_train.csv')

In [2]:
df2 = pd.read_csv('dengue_labels_train.csv')

# Merging the two dataframes
df3 = pd.merge(df1, df2, how='outer', on=['city','year','weekofyear'])
df3.head()

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,...,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm,total_cases
0,sj,1990,18,30-04-1990,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,...,73.365714,12.42,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0,4
1,sj,1990,19,07-05-1990,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,...,77.368571,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6,5
2,sj,1990,20,14-05-1990,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,...,82.052857,34.54,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4,4
3,sj,1990,21,21-05-1990,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,...,80.337143,15.36,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0,3
4,sj,1990,22,28-05-1990,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,...,80.46,7.52,17.21,3.014286,28.942857,9.371429,35.0,23.9,5.8,6


In [4]:
df3.city = df3.city.astype('category')
df3.year = df3.year.astype('category')

df3.dtypes

city                                     category
year                                     category
weekofyear                                  int64
week_start_date                            object
ndvi_ne                                   float64
ndvi_nw                                   float64
ndvi_se                                   float64
ndvi_sw                                   float64
precipitation_amt_mm                      float64
reanalysis_air_temp_k                     float64
reanalysis_avg_temp_k                     float64
reanalysis_dew_point_temp_k               float64
reanalysis_max_air_temp_k                 float64
reanalysis_min_air_temp_k                 float64
reanalysis_precip_amt_kg_per_m2           float64
reanalysis_relative_humidity_percent      float64
reanalysis_sat_precip_amt_mm              float64
reanalysis_specific_humidity_g_per_kg     float64
reanalysis_tdtr_k                         float64
station_avg_temp_c                        float64


In [5]:
# Dropping column 'week_start_date', 're_an_sat_precip_amt_mm' and 'reanalysis_avg_temp_k'
df3= df3.drop(['week_start_date', 'reanalysis_sat_precip_amt_mm', 'reanalysis_avg_temp_k'], axis = 1)
df3.head()

Unnamed: 0,city,year,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,reanalysis_dew_point_temp_k,...,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm,total_cases
0,sj,1990,18,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,292.414286,...,32.0,73.365714,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0,4
1,sj,1990,19,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,293.951429,...,17.94,77.368571,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6,5
2,sj,1990,20,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,295.434286,...,26.1,82.052857,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4,4
3,sj,1990,21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,295.31,...,13.9,80.337143,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0,3
4,sj,1990,22,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,295.821429,...,12.2,80.46,17.21,3.014286,28.942857,9.371429,35.0,23.9,5.8,6


In [6]:
# Handling the NaN values

#Counting NAN values in merged dataframe
null_columns = df3.columns[df3.isna().any()]
print("Number of null values: %d"  %df3[null_columns].isna().sum().sum())

#Using forward fill to remove NAN values
df3.fillna(method='ffill',inplace=True)

#Displaying to verify the absence of NAN values
print("Null values after forward fill: %d"  %df3[null_columns].isna().sum().sum())

Number of null values: 525
Null values after forward fill: 0


In [7]:
# Abbreviating the column names

df3.columns = df3.columns.str.replace('station','stn')
df3.columns = df3.columns.str.replace('reanalysis','re_an')
df3.columns = df3.columns.str.replace('humidity','hd')
df3.columns = df3.columns.str.replace('precipitation','precip')

In [8]:
#Importing library to encode categorical data
from sklearn.preprocessing import LabelEncoder 
# creating instance of label-encoder
le = LabelEncoder()

In [9]:
le_df = df3[['city', 'year']].apply(le.fit_transform)
le_df

Unnamed: 0,city,year
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
5,1,0
6,1,0
7,1,0
8,1,0
9,1,0


In [10]:
# Dataframe of only numerical columns
num_df = df3.drop(['city','year'], axis=1)

In [11]:
#Importing library to normalize numerical feature
#I am using normalization as I do not know what kind of distribution do these columns of varying scales have
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() 

In [12]:
scaled_df = pd.DataFrame(scaler.fit_transform(num_df), columns = num_df.columns)
type(scaled_df)
scaled_df

Unnamed: 0,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precip_amt_mm,re_an_air_temp_k,re_an_dew_point_temp_k,re_an_max_air_temp_k,re_an_min_air_temp_k,re_an_precip_amt_kg_per_m2,re_an_relative_hd_percent,re_an_specific_hd_g_per_kg,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_precip_mm,total_cases
0,-0.566356,-0.062619,-0.203670,-0.055424,-0.291644,-0.760139,-0.826384,-1.854064,-1.119924,0.070466,-0.187274,-1.230763,-1.772476,-0.641063,-1.356941,-0.558712,-1.552648,-1.338708,-0.489876,-0.474407
1,-0.499753,0.278949,0.118034,-0.541096,-0.556484,-0.522502,-0.357358,-0.847423,-0.779569,0.265892,-0.511878,-0.671592,-0.890359,-0.713680,-0.363964,-0.807790,-0.379236,0.063840,-0.646168,-0.451461
2,-0.433150,-0.715063,0.375662,-0.610427,-0.372705,-0.254704,0.061303,0.123667,-0.903335,0.617658,-0.323488,-0.017231,0.066811,-0.733851,-0.363964,-0.753935,-0.124146,0.446353,0.046589,-0.474407
3,-0.366547,-0.019050,0.978910,0.335437,0.405659,-0.692961,0.212399,0.042275,-0.624862,0.500402,-0.605150,-0.256904,-0.047160,-0.697543,0.227360,-0.619298,0.437051,0.765113,-0.743323,-0.497353
4,-0.299943,0.468869,1.122261,0.653287,0.542732,-0.872102,0.602730,0.377199,-0.470156,0.695828,-0.644398,-0.239741,0.301239,-0.532139,1.376536,0.605899,1.304356,1.147626,-0.705306,-0.428515
5,-0.233340,0.468869,0.391419,0.695155,-0.242265,-0.825032,0.684574,0.396845,-0.315449,0.930338,-0.314484,-0.319167,0.303093,-0.790330,0.729427,-0.538516,0.998248,1.147626,-0.001989,-0.520298
6,-0.166737,-0.132665,-0.295078,0.033145,0.099134,-0.964415,0.373988,0.406200,-0.655804,0.773998,-0.034899,-0.024614,0.316992,-0.806467,0.182732,-0.619298,-0.124146,0.765113,-0.200523,-0.474407
7,-0.100134,-0.424406,-0.464924,-0.687441,-0.825227,2.409112,0.656243,0.842162,-0.872393,1.047594,-0.233448,0.167563,0.798820,-0.939597,0.930254,-0.188460,0.743159,0.446353,-0.382160,-0.451461
8,-0.033530,-0.208128,0.151501,-1.035635,-0.938059,-0.602476,0.646800,0.742059,-0.408273,0.773998,-0.060064,0.082750,0.677436,-0.850844,0.896783,-0.329829,0.743159,0.446353,-0.382160,-0.336732
9,0.033073,-0.208128,-0.054532,-0.563598,0.006932,-0.714668,1.069658,0.920747,-0.346390,1.164849,-0.270387,-0.124994,0.859975,-0.814535,0.896783,-0.740472,0.743159,1.466387,-0.804573,-0.428515


In [13]:
# Concatenating the scaled features, label encoded features, 0 and 1 vegetation features and percent column

X = pd.concat([le_df, scaled_df], axis=1, join='outer')
X

Unnamed: 0,city,year,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precip_amt_mm,re_an_air_temp_k,re_an_dew_point_temp_k,...,re_an_precip_amt_kg_per_m2,re_an_relative_hd_percent,re_an_specific_hd_g_per_kg,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_precip_mm,total_cases
0,1,0,-0.566356,-0.062619,-0.203670,-0.055424,-0.291644,-0.760139,-0.826384,-1.854064,...,-0.187274,-1.230763,-1.772476,-0.641063,-1.356941,-0.558712,-1.552648,-1.338708,-0.489876,-0.474407
1,1,0,-0.499753,0.278949,0.118034,-0.541096,-0.556484,-0.522502,-0.357358,-0.847423,...,-0.511878,-0.671592,-0.890359,-0.713680,-0.363964,-0.807790,-0.379236,0.063840,-0.646168,-0.451461
2,1,0,-0.433150,-0.715063,0.375662,-0.610427,-0.372705,-0.254704,0.061303,0.123667,...,-0.323488,-0.017231,0.066811,-0.733851,-0.363964,-0.753935,-0.124146,0.446353,0.046589,-0.474407
3,1,0,-0.366547,-0.019050,0.978910,0.335437,0.405659,-0.692961,0.212399,0.042275,...,-0.605150,-0.256904,-0.047160,-0.697543,0.227360,-0.619298,0.437051,0.765113,-0.743323,-0.497353
4,1,0,-0.299943,0.468869,1.122261,0.653287,0.542732,-0.872102,0.602730,0.377199,...,-0.644398,-0.239741,0.301239,-0.532139,1.376536,0.605899,1.304356,1.147626,-0.705306,-0.428515
5,1,0,-0.233340,0.468869,0.391419,0.695155,-0.242265,-0.825032,0.684574,0.396845,...,-0.314484,-0.319167,0.303093,-0.790330,0.729427,-0.538516,0.998248,1.147626,-0.001989,-0.520298
6,1,0,-0.166737,-0.132665,-0.295078,0.033145,0.099134,-0.964415,0.373988,0.406200,...,-0.034899,-0.024614,0.316992,-0.806467,0.182732,-0.619298,-0.124146,0.765113,-0.200523,-0.474407
7,1,0,-0.100134,-0.424406,-0.464924,-0.687441,-0.825227,2.409112,0.656243,0.842162,...,-0.233448,0.167563,0.798820,-0.939597,0.930254,-0.188460,0.743159,0.446353,-0.382160,-0.451461
8,1,0,-0.033530,-0.208128,0.151501,-1.035635,-0.938059,-0.602476,0.646800,0.742059,...,-0.060064,0.082750,0.677436,-0.850844,0.896783,-0.329829,0.743159,0.446353,-0.382160,-0.336732
9,1,0,0.033073,-0.208128,-0.054532,-0.563598,0.006932,-0.714668,1.069658,0.920747,...,-0.270387,-0.124994,0.859975,-0.814535,0.896783,-0.740472,0.743159,1.466387,-0.804573,-0.428515


In [14]:
int(df3['total_cases'].median())

12

In [15]:
df3['above_average'] = np.where((df3['total_cases']<=int(df3['total_cases'].median())), 0, 1)

In [16]:
# Putting the target value in a separate dataframe

y = df3[['above_average']]
y

Unnamed: 0,above_average
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [17]:
# preprocessing
from sklearn.model_selection import train_test_split
# Train test split - 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15)

In [18]:
from sklearn.neural_network import MLPClassifier
cnn = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(200), random_state=0, max_iter=500)

In [19]:
import itertools, time
start = time.time()

cnn.fit(X_train, np.array(y_train)[:,0])

end = time.time()
print("Time taken: %f" % (end-start))

Time taken: 4.413988


In [21]:
# postprocessing
from sklearn.metrics import accuracy_score, confusion_matrix

y_pred = cnn.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: %f" % acc)

Accuracy: 0.989726


In [22]:
from sklearn.metrics import f1_score,precision_recall_fscore_support
print ('F1 score: ', f1_score(y_test, y_pred))
print ('With Average as None: ', f1_score(y_test, y_pred, average=None))
print ('With Average as Macro: ', f1_score(y_test, y_pred, average='macro'))
print ('With Average as Micro: ', f1_score(y_test, y_pred, average='micro'))
print ('With Average as Weighted: ', f1_score(y_test, y_pred, average='weighted'))


F1 score:  0.990228013029316
With Average as None:  [0.98916968 0.99022801]
With Average as Macro:  0.9896988440597843
With Average as Micro:  0.9897260273972602
With Average as Weighted:  0.9897278396197586


In [23]:
import sklearn.metrics as metrics
target = ['Above average 1', 'Below Average 0']
print(metrics.classification_report (y_test, y_pred, target_names = target))

                 precision    recall  f1-score   support

Above average 1       0.99      0.99      0.99       138
Below Average 0       0.99      0.99      0.99       154

       accuracy                           0.99       292
      macro avg       0.99      0.99      0.99       292
   weighted avg       0.99      0.99      0.99       292



Precision talks about how precise/accurate your model is out of those predicted positive, how many of them are actual positive.
Precision is a good measure to determine, when the costs of False Positive is high.

Precision = True Positive/True Positive + False Positive

Recall calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive).
Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.

Recall = True Positive/ True Positve + False Negative

F1 is a function of Precision and Recall. F1 Score is needed when we want to seek a balance between Precision and Recall.

F1 = 2*Precision*Recall/Precision+Recall

So, when dealing with classification problems we are attempting to predict a binary outcome. Is it fraud or not? Will this person default on their loan or not? Etc. So what we care about in addition to this overall ratio is number predictions that were falsely classified positive and falsely classified negative, especially given the context of what we are trying to predict.