# Capstone Project: Criminal Case Database

### Overall Contents:
- Background
- Data Cleaning
- [Exploratory Data Analysis](#3.-Exploratory-Data-Analysis) **(In this notebook)**
- Modeling 1 Logistic Regression
- Modeling 2 k-Nearest Neighbours
- Modeling 3 Random Forest
- Evaluation
- Conclusion and Recommendation

### Data Dictionary

The dataset contains the weather, location, testing and spraying in the City of Chicago.The data source below are obtained from [kaggle](https://www.kaggle.com/c/predict-west-nile-virus/data).

The dataset used for this analysis are as followed:--

* train_df (2007, 2009, 2011, 2013)
* spray_df (2011 to 2013)
* weather_df (2007 to 2014)
* test_df (2008, 2010, 2012, 2014)

|Feature|Type|Dataset|Description|
|:---|:---|:---|:---|
|**date**|*object*|train_df<br>test_df|The date that the West Nile Virus test is performed|
|**species**|*object*|train_df<br>test_df|The species of mosquitos|
|**trap**|*object*|train_df<br>test_df|Id of the trap|
|**addressnumberandstreet**|*object*|train_df<br>test_df|Approximate address returned from GeoCoder|
|**latitude**|*float64*|train_df<br>test_df|Latitude returned from GeoCoder|
|**longitude**|*float64*|train_df<br>test_df|Longitude returned from GeoCoder|
|**nummosquitos**|*int64*|train_df|Number of mosquitoes caught in this trap|
|**wnvpresent**|*int64*|train_df|Whether West Nile Virus was present in these mosquitos.<br>1 means West Nile Virus is present, and 0 means not present|
|**station**|*int64*|weather_df|Weather stations<br>Station 1 is located at Chicago O'Hare International Airport<br>Station 2 is located at Chicago Midway Intl Arpt|
|**date**|*object*|weather_df|The date of the weather information is collected|
|**tmax**|*int64*|weather_df|Maximum temperature (&deg;F)|
|**tmin**|*int64*|weather_df|Minimum temperature (&deg;F)|
|**tavg**|*int32*|weather_df|Average temperature (&deg;F)|
|**dewpoint**|*int64*|weather_df|Average dew point (&deg;F)|
|**wetbulb**|*int32*|weather_df|Average wet bulb (&deg;F)|
|**heat**|*int32*|weather_df|Heating degree days (Base 65&deg;F)|
|**cool**|*int32*|weather_df|Cooling degree days (Base 65&deg;F)|
|**codesum**|*object*|weather_df|Weather phenomena for significant weather types|
|**preciptotal**|*float64*|weather_df|Total precipitation (Inches and Hundredths)|
|**stnpressure**|*float64*|weather_df|Average station pressure (Inches of Hg)|
|**sealevel**|*float64*|weather_df|Average sea level pressure (Inches of Hg)|
|**resultspeed**|*float64*|weather_df|Resultant wind speed (miles per hour)|
|**resultdir**|*int64*|weather_df|Resultant wind direction (tens of whole degrees)|
|**avgspeed**|*float64*|weather_df|Average wind speed|
|**date**|*object*|spray_df|The date of the spray|
|**latitude**|*float64*|spray_df|Latitude of the spray|
|**longitude**|*float64*|spray_df|Longitude of the spray|

## 3. Exploratory Data Analysis

### 3.1 Libraries Import

In [1]:
# Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import clear_output

%config InlineBackend.figure_format = 'retina'
%matplotlib inline 
# Maximum display of columns
pd.options.display.max_colwidth = 400
pd.options.display.max_rows = 400

mapdata = np.loadtxt("../assets/mapdata_copyright_openstreetmap_contributors.txt")
aspect = mapdata.shape[0] * 1.0 / mapdata.shape[1]
lon_lat_box = (-88, -87.5, 41.6, 42.1)

### 3.2 Data Import

In [2]:
# Import data of train, weather, spray and test from csv
train_df = pd.read_csv('../data/train_df_clean.csv')
test_df = pd.read_csv("../data/test_df_clean.csv")
weather_df = pd.read_csv('../data/weather_df_clean.csv')
spray_df = pd.read_csv('../data/spray_df_clean.csv')

### 3.3 Feature Engineering

### 3.3.1 Change to datetime

In [3]:
train_df.date = pd.to_datetime(train_df.date)
test_df.date = pd.to_datetime(test_df.date)
spray_df.date = pd.to_datetime(spray_df.date)
weather_df.date = pd.to_datetime(weather_df.date)
print(f"train_df dtype is {train_df.dtypes.date}")
print(f"test_df dtype is {test_df.dtypes.date}")
print(f"spray_df dtype is {spray_df.dtypes.date}")
print(f"weather_df dtype is {weather_df.dtypes.date}")

train_df dtype is datetime64[ns]
test_df dtype is datetime64[ns]
spray_df dtype is datetime64[ns]
weather_df dtype is datetime64[ns]


### 3.4 Analysis - Weather with number of mosquitos

#### 3.4.1 Correlations of weather with number of mosquitos

**Analysis: There does not appear to be much correlation between the features and number of mosquitos based on the heatmap, however, we know this to be untrue based on scientific research.**

#### 3.4.2 Relationships of weather features with number of mosquitos

#### 3.4.2.1  Define functions

#### 3.4.2.2  Weather features and the number of mosquitos in 2007

### 3.5 Analysis - Mosquitos and trap locations in Chicago

#### 3.5.1 Define functions

In [26]:
def plot_barplot(dataframe, x_data, y_data, hue_data, title, x_label, y_label,legend_label, orient, legend_location):
    """To plot a barplot with input of details.
    Orient can take a form of 'h' or 'v'. legend_location is a range of 0-10.
    hue_data can take a form of None. The output is a barplot"""
    plt.figure(figsize = (10,7))
    sns.barplot(x = x_data, y = y_data, hue = hue_data, data = dataframe, orient = orient)
    plt.xlabel(x_label, fontsize = 16)
    plt.xticks(fontsize = 14)
    plt.yticks(fontsize = 14)
    plt.ylabel(y_label, fontsize = 16)
    plt.legend(title = legend_label,prop = {'size' : 14}, title_fontsize = 14, loc = legend_location)
    plt.title((title +"\n"), weight = 'bold', fontsize = 20);
    return plt.show()

In [27]:
def wnv_normalizer(dataframe):
    """
    Input: dataframe with `species`, `wnvpresent` and `nummosquitos`
    Output: dataframe with `nummosquitos` normalized according to species
    """
    new_dataframe = pd.DataFrame(columns = dataframe.columns)
    for index, row in dataframe.iterrows():
        species = row['species']
        data = list(row)
        data[2] = row.nummosquitos/dataframe[dataframe['species'] == species]['nummosquitos'].sum()
        temp_dict = dict(zip(list(dataframe.columns), data))
        new_dataframe = new_dataframe.append(temp_dict, ignore_index=True)
    return new_dataframe

#### 3.5.2 Mosquitos, its species and presence of west nile virus across 2007 to 2013

**Summary**
* Majority of the mosquito species for both with and without west nile virus caught in the trap belongs to *culex pipiens* or *culex restuans*.
* There is a high proportion of number of mosquitos present in Chicago in 2007 could be due to the high number of trap locations in comparison with other years. The number of mosquitos with west nile virus are high in 2007 and 2013.**
* The locations containing an exceptional high number of mosquitos are 1200 s doty avenue (2007), 1000 w ohare airport (2011 and 2013).

### 3.6 Analysis - Mosquitos locations and the spray of mosquitos in Chicago

#### 3.6.1 Locations of mosquitos with west nile virus in Chicago across 2007 to 2013

### 3.8. Summary

1. We found a high correlation between the number of mosquitos and weather features such as precipitation, Relative Humidity, temperature, 
as well as some weather phenomena - for example rain, drizzle, and thunderstorms
2. The mosquito populations peaked between July and August each year. This coincided with the time of the year when it started
to get warmer, and there was more rain and humidity
3. Two species of mosquitos in particular had a much higher population than the others - Culex Pipiens and Culex Restuans.
4. There was a higher mosquito population in certain locations, but these locations were not where spraying efforts were previously done.

These will likely be the key features for our model in predicting the probability of WNV presence for given locations and time periods.

## Exporting Data

In [38]:
# train_df.to_csv("../data/train_df_model.csv", index = False)
# test_df.to_csv("../data/test_df_model.csv", index = False)

## References

[1] "Atmospheric Pressure," *Britannica The Editors of Encyclopedia*, May 27, 2020. [Online]. Available: [https://www.britannica.com/science/atmospheric-pressure](https://www.britannica.com/science/atmospheric-pressure) [Accessed: May. 1, 2021].  

[2] "Humidity Formulas," *Go Grean*, 2014. [Online]. Available: [http://www.reahvac.com/tools/humidity-formulas/](http://www.reahvac.com/tools/humidity-formulas/) [Accessed: May. 1, 2021].

[3] "House Mosquitoes," *Biogents USA*. [Online]. Available: [https://us.biogents.com/house-mosquitoes/](https://us.biogents.com/house-mosquitoes/) [Accessed: May. 6, 2021].
 
[4] "Rising temperatures could shift US West Nile virus transmission," *ScienceDaily*, September 15, 2020. [Online]. Available: [https://www.sciencedaily.com/releases/2020/09/200915105932.htm#:~:text=West%20Nile%20virus%20spreads%20most,published%20today%20in%20eLife%20shows.](https://www.sciencedaily.com/releases/2020/09/200915105932.htm#:~:text=West%20Nile%20virus%20spreads%20most,published%20today%20in%20eLife%20shows.) [Accessed: May. 6, 2021]