![](https://www.phdmc.org/images/articles/2018/20180813-mosquitoes.jpg)
# Project 4 - Fighting the West Nile Virus in Chicago

# Problem Statement


**Background**  
The West Nile Virus has been raging through the city of Chicago. According to the CDC, the virus has been the main cause of mosquito-borne diesease in the United States and it is transmitted by the bite of an infected mosquito. With the help of the Department of Public Health, data on mosquito infestation has been made available through their surveillance and control system. The data will help in determining where the best use of resource can be allocated to with the highest efficacy.

**Hence our purpose here is to find the most cost-effective way of eliminating the spread of the West Nile Virus from the transmission through mosquitoes**


# Executive Summary
The West Nile Virus in Chicago has been spreading across the city of Chicago. The proliferation of the diesease can be halted with the certain tools at our disposal. The objective of the team is to use the data shared by the Department of Public Health to derive a plan. This plan is to deploy the spraying of pesticides to the areas which are most at risk. It is with the view that the spraying of pesticide will kill off the mosquitoes.  
Hence, it is the duty of the team to propose which areas are most in need.

**Data Science Process**

* Data extraction  
* Data cleaning
* EDA
 * Data visualisation
 * Feature engineering
* Model selection  
 * Develop baseline
 * Data preparation
 * Model evaluation
 * Model selection

* Model optimisation
 * Adjust hyperparamters
 * Revaluate models

* Model testing and recommendations
 * Test optimised models
 * Evaluation and conclusion

* Conclusions and Recommendations
 * Cost Benefit Analysis
 * Findings and takeaways
 * Further opportunities

# Contents

* [Import Libraries](#Import-Libraries)
* [Load Datasets](#Load-Datasets)
* [Data Cleaning - Train](#Data-Cleaning---Train)
 * [Convert Date column to datetime](#Convert-Date-column-to-datetime)
 * [Check for duplicates](#Check-for-duplicates)
 * [Create dummies for species column](#Create-dummies-for-species-column)
 * [Dividing the dataset for merging](#Dividing-the-dataset-for-merging)
* [Data Cleaning - Test](#Data-Cleaning---Test)
* [Export Datasets](#Export-Datasets)

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# Load Datasets

In [2]:
#load datasets
train = pd.read_csv('../assets/train.csv')
test = pd.read_csv('../assets/test.csv')

# Data Cleaning - Train

In [3]:
#shape of train data
print(train.shape)

#examine first 5 rows of train data
train.head()

(10506, 12)


Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    10506 non-null  object 
 1   Address                 10506 non-null  object 
 2   Species                 10506 non-null  object 
 3   Block                   10506 non-null  int64  
 4   Street                  10506 non-null  object 
 5   Trap                    10506 non-null  object 
 6   AddressNumberAndStreet  10506 non-null  object 
 7   Latitude                10506 non-null  float64
 8   Longitude               10506 non-null  float64
 9   AddressAccuracy         10506 non-null  int64  
 10  NumMosquitos            10506 non-null  int64  
 11  WnvPresent              10506 non-null  int64  
dtypes: float64(2), int64(4), object(6)
memory usage: 985.1+ KB


### Drop columns that are not required

We will drop the below columns as they are irrelevant to our prediction:
* Address
* Block
* AddressNumberAndStreet
* AddressAccuracy

In [5]:
#drop columns
train.drop(['Address', 'Block', 'AddressNumberAndStreet', 'AddressAccuracy'],axis=1,inplace=True)
train.head()

Unnamed: 0,Date,Species,Street,Trap,Latitude,Longitude,NumMosquitos,WnvPresent
0,2007-05-29,CULEX PIPIENS/RESTUANS,N OAK PARK AVE,T002,41.95469,-87.800991,1,0
1,2007-05-29,CULEX RESTUANS,N OAK PARK AVE,T002,41.95469,-87.800991,1,0
2,2007-05-29,CULEX RESTUANS,N MANDELL AVE,T007,41.994991,-87.769279,1,0
3,2007-05-29,CULEX PIPIENS/RESTUANS,W FOSTER AVE,T015,41.974089,-87.824812,1,0
4,2007-05-29,CULEX RESTUANS,W FOSTER AVE,T015,41.974089,-87.824812,4,0


### Convert Date column to datetime

In [6]:
train['Date'] = pd.to_datetime(train['Date'])
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          10506 non-null  datetime64[ns]
 1   Species       10506 non-null  object        
 2   Street        10506 non-null  object        
 3   Trap          10506 non-null  object        
 4   Latitude      10506 non-null  float64       
 5   Longitude     10506 non-null  float64       
 6   NumMosquitos  10506 non-null  int64         
 7   WnvPresent    10506 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(2), object(3)
memory usage: 656.8+ KB


### Check for duplicates

In [7]:
#drop all duplicates if any
train.drop_duplicates(inplace = True)

#check shape of df
train.info()
print(train.shape)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9693 entries, 0 to 10505
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          9693 non-null   datetime64[ns]
 1   Species       9693 non-null   object        
 2   Street        9693 non-null   object        
 3   Trap          9693 non-null   object        
 4   Latitude      9693 non-null   float64       
 5   Longitude     9693 non-null   float64       
 6   NumMosquitos  9693 non-null   int64         
 7   WnvPresent    9693 non-null   int64         
dtypes: datetime64[ns](1), float64(2), int64(2), object(3)
memory usage: 681.5+ KB
(9693, 8)


A total of 813 duplicates were dropped

### Create dummies for species column

In [8]:
#unique values in species column
train['Species'].unique()

array(['CULEX PIPIENS/RESTUANS', 'CULEX RESTUANS', 'CULEX PIPIENS',
       'CULEX SALINARIUS', 'CULEX TERRITANS', 'CULEX TARSALIS',
       'CULEX ERRATICUS'], dtype=object)

In [9]:
species_dummy = pd.get_dummies(train['Species'], drop_first=False)
train = pd.concat([train, species_dummy], axis=1)
train.head()

Unnamed: 0,Date,Species,Street,Trap,Latitude,Longitude,NumMosquitos,WnvPresent,CULEX ERRATICUS,CULEX PIPIENS,CULEX PIPIENS/RESTUANS,CULEX RESTUANS,CULEX SALINARIUS,CULEX TARSALIS,CULEX TERRITANS
0,2007-05-29,CULEX PIPIENS/RESTUANS,N OAK PARK AVE,T002,41.95469,-87.800991,1,0,0,0,1,0,0,0,0
1,2007-05-29,CULEX RESTUANS,N OAK PARK AVE,T002,41.95469,-87.800991,1,0,0,0,0,1,0,0,0
2,2007-05-29,CULEX RESTUANS,N MANDELL AVE,T007,41.994991,-87.769279,1,0,0,0,0,1,0,0,0
3,2007-05-29,CULEX PIPIENS/RESTUANS,W FOSTER AVE,T015,41.974089,-87.824812,1,0,0,0,1,0,0,0,0
4,2007-05-29,CULEX RESTUANS,W FOSTER AVE,T015,41.974089,-87.824812,4,0,0,0,0,1,0,0,0


### Dividing the dataset for merging

We will divide the the city of Chicago into 2 seperate areas using the latitutdes(north/south) of stations as reference:   
* Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933  
* Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752  

By finding the mid point of the latitudes of the 2 stations, the **top half** of the city will be labelled as **Station 1(O'Hare Airport)**, and the **bottom half** of the city will be labelled as **Station 2(Midway Intl Airport)**. 

### Midpoint of the 2 stations

In [10]:
#latitude midpoint
midpoint = (41.995+41.786)/2
midpoint

41.8905

In [11]:
def label_station(i):
    if i >= midpoint:
        return 1
    else:
        return 2

train['Station'] = train['Latitude'].apply(label_station)

In [12]:
train.head()

Unnamed: 0,Date,Species,Street,Trap,Latitude,Longitude,NumMosquitos,WnvPresent,CULEX ERRATICUS,CULEX PIPIENS,CULEX PIPIENS/RESTUANS,CULEX RESTUANS,CULEX SALINARIUS,CULEX TARSALIS,CULEX TERRITANS,Station
0,2007-05-29,CULEX PIPIENS/RESTUANS,N OAK PARK AVE,T002,41.95469,-87.800991,1,0,0,0,1,0,0,0,0,1
1,2007-05-29,CULEX RESTUANS,N OAK PARK AVE,T002,41.95469,-87.800991,1,0,0,0,0,1,0,0,0,1
2,2007-05-29,CULEX RESTUANS,N MANDELL AVE,T007,41.994991,-87.769279,1,0,0,0,0,1,0,0,0,1
3,2007-05-29,CULEX PIPIENS/RESTUANS,W FOSTER AVE,T015,41.974089,-87.824812,1,0,0,0,1,0,0,0,0,1
4,2007-05-29,CULEX RESTUANS,W FOSTER AVE,T015,41.974089,-87.824812,4,0,0,0,0,1,0,0,0,1


# Data Cleaning - Test

In [13]:
#shape of train data
print(test.shape)

#examine first 5 rows of train data
test.head()

(116293, 11)


Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
3,4,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX SALINARIUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
4,5,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TERRITANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9


In [14]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Id                      116293 non-null  int64  
 1   Date                    116293 non-null  object 
 2   Address                 116293 non-null  object 
 3   Species                 116293 non-null  object 
 4   Block                   116293 non-null  int64  
 5   Street                  116293 non-null  object 
 6   Trap                    116293 non-null  object 
 7   AddressNumberAndStreet  116293 non-null  object 
 8   Latitude                116293 non-null  float64
 9   Longitude               116293 non-null  float64
 10  AddressAccuracy         116293 non-null  int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 9.8+ MB


### Drop columns that are not required

We will drop the below columns as they are irrelevant to our prediction:
* Id
* Address
* Block
* Street
* AddressNumberAndStreet
* AddressAccuracy

In [15]:
#drop columns
test.drop(['Id', 'Address', 'Block', 'AddressNumberAndStreet', 'AddressAccuracy'],axis=1,inplace=True)
test.head()

Unnamed: 0,Date,Species,Street,Trap,Latitude,Longitude
0,2008-06-11,CULEX PIPIENS/RESTUANS,N OAK PARK AVE,T002,41.95469,-87.800991
1,2008-06-11,CULEX RESTUANS,N OAK PARK AVE,T002,41.95469,-87.800991
2,2008-06-11,CULEX PIPIENS,N OAK PARK AVE,T002,41.95469,-87.800991
3,2008-06-11,CULEX SALINARIUS,N OAK PARK AVE,T002,41.95469,-87.800991
4,2008-06-11,CULEX TERRITANS,N OAK PARK AVE,T002,41.95469,-87.800991


### Convert Date column to datetime

In [16]:
test['Date'] = pd.to_datetime(test['Date'])
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Date       116293 non-null  datetime64[ns]
 1   Species    116293 non-null  object        
 2   Street     116293 non-null  object        
 3   Trap       116293 non-null  object        
 4   Latitude   116293 non-null  float64       
 5   Longitude  116293 non-null  float64       
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 5.3+ MB


### Create dummies for species column

In [17]:
#unique values in species column
test['Species'].unique()

array(['CULEX PIPIENS/RESTUANS', 'CULEX RESTUANS', 'CULEX PIPIENS',
       'CULEX SALINARIUS', 'CULEX TERRITANS', 'CULEX TARSALIS',
       'UNSPECIFIED CULEX', 'CULEX ERRATICUS'], dtype=object)

In [18]:
test_species_dummy = pd.get_dummies(test['Species'], drop_first=False)
test = pd.concat([test, test_species_dummy], axis=1)
test.drop("Species", axis=1, inplace=True)
test.head()

Unnamed: 0,Date,Street,Trap,Latitude,Longitude,CULEX ERRATICUS,CULEX PIPIENS,CULEX PIPIENS/RESTUANS,CULEX RESTUANS,CULEX SALINARIUS,CULEX TARSALIS,CULEX TERRITANS,UNSPECIFIED CULEX
0,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,0,1,0,0,0,0,0
1,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,0,0,1,0,0,0,0
2,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,1,0,0,0,0,0,0
3,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,0,0,0,1,0,0,0
4,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,0,0,0,0,0,1,0


### Midpoint of the 2 stations

In [19]:
test['Station'] = test['Latitude'].apply(label_station)
test.head()

Unnamed: 0,Date,Street,Trap,Latitude,Longitude,CULEX ERRATICUS,CULEX PIPIENS,CULEX PIPIENS/RESTUANS,CULEX RESTUANS,CULEX SALINARIUS,CULEX TARSALIS,CULEX TERRITANS,UNSPECIFIED CULEX,Station
0,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,0,1,0,0,0,0,0,1
1,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,0,0,1,0,0,0,0,1
2,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,1,0,0,0,0,0,0,1
3,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,0,0,0,1,0,0,0,1
4,2008-06-11,N OAK PARK AVE,T002,41.95469,-87.800991,0,0,0,0,0,0,1,0,1


In [20]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 14 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   Date                    116293 non-null  datetime64[ns]
 1   Street                  116293 non-null  object        
 2   Trap                    116293 non-null  object        
 3   Latitude                116293 non-null  float64       
 4   Longitude               116293 non-null  float64       
 5   CULEX ERRATICUS         116293 non-null  uint8         
 6   CULEX PIPIENS           116293 non-null  uint8         
 7   CULEX PIPIENS/RESTUANS  116293 non-null  uint8         
 8   CULEX RESTUANS          116293 non-null  uint8         
 9   CULEX SALINARIUS        116293 non-null  uint8         
 10  CULEX TARSALIS          116293 non-null  uint8         
 11  CULEX TERRITANS         116293 non-null  uint8         
 12  UNSPECIFIED CULEX       116293

## Export Datasets

In [21]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9693 entries, 0 to 10505
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Date                    9693 non-null   datetime64[ns]
 1   Species                 9693 non-null   object        
 2   Street                  9693 non-null   object        
 3   Trap                    9693 non-null   object        
 4   Latitude                9693 non-null   float64       
 5   Longitude               9693 non-null   float64       
 6   NumMosquitos            9693 non-null   int64         
 7   WnvPresent              9693 non-null   int64         
 8   CULEX ERRATICUS         9693 non-null   uint8         
 9   CULEX PIPIENS           9693 non-null   uint8         
 10  CULEX PIPIENS/RESTUANS  9693 non-null   uint8         
 11  CULEX RESTUANS          9693 non-null   uint8         
 12  CULEX SALINARIUS        9693 non-null   uint8  

In [22]:
#export clean train dataset
train.to_csv("../dataset/train_clean.csv", index = False)

In [23]:
#export clean test dataset
test.to_csv("../dataset/test_clean.csv", index = False)