# Problem Statement

According to the Centers for Disease Control and Prevention, the West Nile Virus (WNV) is a viral infection transmitted across humans through the bite of an infected mosquito. Mosquitoes first become infected when they feed on birds. 1 in 5 people who become infected with the virus develop a fever along with other symptoms such as body aches, rashes and diarrhea. About 1 in 150 people develop a serious illness that affects the central nervous system.

Our project aims to answer below problem statements:

1. Predict the occurrence of presence of West Nile Virus given weather, location, testing and spray data
2. Analyze to see if there are main predictors reasons to the West Nile Virus
3. Recommendations on best practices to how to reduce human contraction of the virus

# Background

According to the Centers for Disease Control and Prevention, the West Nile Virus (WNV) is a viral infection transmitted across humans through the bite of an infected mosquito. Mosquitoes first become infected when they feed on birds. 1 in 5 people who become infected with the virus develop a fever along with other symptoms such as body aches, rashes and diarrhea. About 1 in 150 people develop a serious illness that affects the central nervous system.

Kaggle launched a competition with a $40,000 prize money back in 2015 to predict West Nile Virus in mosquitoes across the city of Chicago. Though the competition has already ended, we're working on our own version of model on the same problem set as part of our Machine Learning curriculum with General Assembly. 

# Data Dictionary

| Feature Name | Data Type | Description |
| ----| ---- | ---- |
| date | datetime64 | YYYY-MM-DD |
| nummosquitos | int | number of mosquitos caught in the trap |
| tavg | float | daily average temperature (F) |
| tmax | float | daily maximum temperature (F)|
| tmin | float | daily minimum temperature (F)|
| preciptotal | float | preciptation (inches) |
| avgspeed | int | average wind speed (mph) |
| resultspeed | int | resultant wind speed taking into account direction (mph) |
| resultdir | int | direction of wind (deg) |
| dewpoint | float | dewpoint temperature (F)|
| latitude | float | latitude from Geocoder|
| longitude | float | longitude from Geocoder|
| stnpressure | float | average daily station pressure|
| sealevel | float | average daily sealevel pressure|
| daylightmins | int | total number of minutes of daylight|
| year | int | year|
| month | int | month|
| day | int | day|
| wnvpresent | int | dummified value representing presence of WNV.  1 = Present, 0 = Not Present|
| species_X | int | dummified value representing presence of species_X in trap. 1 = Present, 0 = Not Present|

# 01_DataCleaning

#### Data cleaning

The objective of this notebook is to clean and feature engineer data for further exploration and analysis.

1. Initial exploration of data (shape, nulls)
2. Concat data
3. Impute or drop null values
4. Drop unnecessary features
5. Engineer additional features
6. Export data

# 1.1 Train Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import datetime as datetime
import seaborn as sns

In [2]:
# Loading datasets

In [3]:
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

In [4]:
#define function to convert column headers to lowercase
def lowercase_cols(columns):
    return [column.lower() for column in columns]

## Train

In [5]:
train.head()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


In [6]:
#check for null values
train.isnull().sum()

Date                      0
Address                   0
Species                   0
Block                     0
Street                    0
Trap                      0
AddressNumberAndStreet    0
Latitude                  0
Longitude                 0
AddressAccuracy           0
NumMosquitos              0
WnvPresent                0
dtype: int64

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    10506 non-null  object 
 1   Address                 10506 non-null  object 
 2   Species                 10506 non-null  object 
 3   Block                   10506 non-null  int64  
 4   Street                  10506 non-null  object 
 5   Trap                    10506 non-null  object 
 6   AddressNumberAndStreet  10506 non-null  object 
 7   Latitude                10506 non-null  float64
 8   Longitude               10506 non-null  float64
 9   AddressAccuracy         10506 non-null  int64  
 10  NumMosquitos            10506 non-null  int64  
 11  WnvPresent              10506 non-null  int64  
dtypes: float64(2), int64(4), object(6)
memory usage: 985.1+ KB


In [8]:
#change Date to datetime

train.Date = pd.to_datetime(train.Date)
test.Date = pd.to_datetime(test.Date)

In [9]:
#drop columns
train.drop(columns = ['AddressNumberAndStreet','Address','AddressAccuracy'], inplace = True)
test.drop(columns = ['AddressNumberAndStreet','Address','AddressAccuracy'], inplace = True)

In [10]:
train.head()

Unnamed: 0,Date,Species,Block,Street,Trap,Latitude,Longitude,NumMosquitos,WnvPresent
0,2007-05-29,CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,41.95469,-87.800991,1,0
1,2007-05-29,CULEX RESTUANS,41,N OAK PARK AVE,T002,41.95469,-87.800991,1,0
2,2007-05-29,CULEX RESTUANS,62,N MANDELL AVE,T007,41.994991,-87.769279,1,0
3,2007-05-29,CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,41.974089,-87.824812,1,0
4,2007-05-29,CULEX RESTUANS,79,W FOSTER AVE,T015,41.974089,-87.824812,4,0


In [11]:
#convert column headers to lowercase
train.columns = lowercase_cols(train.columns)
test.columns = lowercase_cols(test.columns)

In [12]:
train.head()

Unnamed: 0,date,species,block,street,trap,latitude,longitude,nummosquitos,wnvpresent
0,2007-05-29,CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,41.95469,-87.800991,1,0
1,2007-05-29,CULEX RESTUANS,41,N OAK PARK AVE,T002,41.95469,-87.800991,1,0
2,2007-05-29,CULEX RESTUANS,62,N MANDELL AVE,T007,41.994991,-87.769279,1,0
3,2007-05-29,CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,41.974089,-87.824812,1,0
4,2007-05-29,CULEX RESTUANS,79,W FOSTER AVE,T015,41.974089,-87.824812,4,0


In [13]:
#create individual columns for day, month, year for test and train dataset
train['year'] = train['date'].dt.year
train['month'] = train['date'].dt.month
train['day'] = train['date'].dt.day

test['year'] = test['date'].dt.year
test['month'] = test['date'].dt.month
test['day'] = test['date'].dt.day

In [14]:
train.head()

Unnamed: 0,date,species,block,street,trap,latitude,longitude,nummosquitos,wnvpresent,year,month,day
0,2007-05-29,CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,41.95469,-87.800991,1,0,2007,5,29
1,2007-05-29,CULEX RESTUANS,41,N OAK PARK AVE,T002,41.95469,-87.800991,1,0,2007,5,29
2,2007-05-29,CULEX RESTUANS,62,N MANDELL AVE,T007,41.994991,-87.769279,1,0,2007,5,29
3,2007-05-29,CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,41.974089,-87.824812,1,0,2007,5,29
4,2007-05-29,CULEX RESTUANS,79,W FOSTER AVE,T015,41.974089,-87.824812,4,0,2007,5,29


In [15]:
#export train data
train.to_csv('../data/clean_train.csv',index=False)
test.to_csv('../data/clean_test.csv',index=False)

In [16]:
train.shape

(10506, 12)