
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 4: Kaggle Competition: West Nile Virus Prediction


_Group 6: Wilson, Joey, GimPei_

---

## Context and Problem Statement

West Nile virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death. 

In 2002, the first human cases of West Nile virus were reported in Chicago. By 2004 the City of Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive surveillance and control program that is still in effect today. 

Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations.

There is a great deal of variation in outbreaks of West Nile Virus (WNV) intensity and duration year to year, which makes prediction of outberak and effort to contain the spread of WNV difficult.

Given weather, location, testing, and spraying data, the goal of this project is to build a **classification model** to **predict outbreaks of WNV in mosquitos**. This will help the City of Chicago and CDPH more efficiently and effectively allocate resourses towards preventing transmission of this potentially deadly virus.


## Executive Summary

### Contents

- [Data Collection](#Data-Collection)
- [Data Cleaning and Exploratory Data Analysis (EDA)](#Data-Cleaning-and-Exploratory-Data-Analysis-(EDA))
- [Preprocessing](#Preprocessing)
- [Modelling](#Modelling)
- [Model Evaluation](#Model-Evaluation)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)
- [Sources](#Sources)


### Data Collection
**Data Collection** is done by retrieving the datasets in Kaggle and upload it into a our github's directory name's **assets**. We can import the datasets for data cleaning, EDA and modelling.

### Data Cleaning and Exploratory Data Analysis (EDA)
Cleaning involved imputing missing value with mean data, converting `Date` to `datetime` datatype for time series data exploration, dropping unwanted prediction variables, which does not appear in test dataset.

EDA is carried out on the train dataset, and weather dataset. The train dataset is highly imbalanced as the WNV infection is only make-up to about **5%** of the dataset. Remaining datasets are without WNV infection. 

|WnvPresent|Normalized Counts|
|---|---|
|1|0.052446|
|0|0.947554|

Number of mosquitos increases the probability of WNV infection. The number of mosquitos peaks on June and reduces as the summer progresses. In addition, different years have different number of mosquitos detected with highest record in 2007. 

Among the 6 mosquitos' species detected, only 3 species are prevalent and 2 species (**Culex Pipiens, Culex Restuans**) have high number of WNV infection. 

Higher temperature, lower rainfall, and lower wind speed tends to have more WNV infection.


### Preprocessing
- treating the unbalance class by upsample the minority class, that is WnvPresent = 1 through `bootstraping` (random sampleing wiht replacement).
- `feature engineering` on weather dataset, which include parsing the date feature into year, month, day, day of year, and workweek. In addition, we are also exploring the interaction among the weather features as well.
- encode categorical variable into indicator variables using `get_dummies` to make it numerical predictor and included in modelling as well.
- Preparing the `X features` (predictors/ variables) and `y-target` (`WnvPresent`).
- `Train test split` the data.

### Modelling
This is supervised classification (2 categories/ binary) machine learning problem.
Two models are selected:
1. Random Forest
2. XGBoost (eXtreme Gradient Boosting)

**Baseline Accuracy**:

In EDA, we observed that we have imbalanced class. However, we still count the majority class, use this as null model to get our naive baseline accuracy score, which is **94.8%**
Thus, we proceed to build a basic model (default setting, without hyperparameters optimization) as our baseline model.

Model optimization was done by using GridSearchCV to identify the optimal hyperparameters and were built into the classificaiton models.

### Model Evaluation
Metrics used to evaluate: **accuracy, recall, roc_auc**

The **recall** (sensitivity) is the ratio tp / (tp + fn). The recall is intuitively the ability of the classifier to find all the positive samples, that is in this project, to predict WNV infected (WnvPresent = 1).

Thus, reducing fn is important as we would like to predict if there is WNV infection as accurately as possible. fn means, predict there is NO WNV infection, but in actual case there WNV is present.

**AUC - ROC** curve on the other hand, is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. In our case, the capability to distinguish WNV present (1) or NOT present (0). Higher the AUC, better the model is at predicting between mosquitos with WNV infection and no WNV infection.


### Conclusions and Recommendations
**Results summary and conclusion:**

|Classifier Model|Accuracy|roc_auc|Recall|Kaggle roc_aucvscore|
|---|---|---|---|---|
|RandomForest w GridSearchCV|81.5%|81.5%|x%|65%
|RandomForest w feature engineering|98.3%|98.5%|99.7%|70%
|XGBoost|98.3%|98.1%|x%|76%

In general, results from modeling (with optimum hyperparameter) are out-perform our baseline model accuracy score, that is >95%. 
In addition, they have high roc_auc score and recall score as well. This means, the models are able to **separate the WNV infection from no infection** quite well, and at the same time, have relatively **low fn** (low type II error). 

Among the models, **XGBoost** is having the highest Kaggle score on the un-labeled test dataset. Thus, our team propose to use this model for predicting the presence of WNV. 

**Recommendations :**


### Sources
1. [West Nile Virus Prediction](https://www.kaggle.com/c/predict-west-nile-virus)