In [2]:
#library
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import seaborn as sns


# Statistics & Public Health
West Nile Virus (WNV) is a viral illness largely spread by mosquitoes. The disease is transmitted to a person when an infected mosquito bites them.
The city of Chicago, Illinois has been keeping track of mosquito populations and WNV prevalence using a series of traps that they place around the city. 
They are then able to study the captured specimens and monitor the state of WNV spread in the city.
#### Dataset
Mosquito tracking data from 2008 to 2019. 
#### Data Dictionary
- Year (int64)	Year that the WNV test is performed			
- Week  (int64)	Week that the WNV test is performed			
- Address (string)	 Block	Address of the location of trap.		
- Block (int64)	Block number of address		
- Trap_Id (string) of the trap. Some traps are "satellite traps". These are traps that are set up near (usually within 6 blocks) an established trap to enhance surveillance efforts. Satellite traps are post fixed with letters. For example, T220A is a satellite trap to T220.	
- Trap_type (string) Type of trap 		
- Date (string)	Date and time that the WNV test is performed 	Please note that not all the locations are tested at all times. Also, records exist only when a particular species of mosquitoes is found at a certain trap at a certain time.	
- Mosquito_number (int64) Number of mosquitoes caught in this trap	 These test results are organized in such a way that when the number of mosquitoes exceed 50, they are split into another record (another row in the dataset), such that the number of mosquitoes are capped at 50.	
- Mosquito_ID  (string)	Id for Mosquito species	string		
- WNV  Present  (string) 	Whether West Nile Virus was present in these mosquitos		
- Species (string) Mosquito species	string		
- Lat (float64)	Latitude of trap	
- Lon( float64)	Longitude of trap

#### Importing data

In [8]:
df = pd.read_csv("mosquito_data.csv")
df.sample(10)

Unnamed: 0,Year,Week,Address Block,Block,Trap,Trap type,Date,Mosquito number,Mosquito ID,WNV Present,Species,Lat,Lon
4318,2015,34,100XX W OHARE AIRPORT,100,T912,GRAVID,2015-08-27 00:08:00,3,Pip,negative,CULEX PIPIENS,,
17010,2012,30,70XX N MOSELLE AVE,70,T008,GRAVID,2012-07-27 00:07:00,32,Res,positive,CULEX RESTUANS,42.008001,-87.778234
2717,2017,29,12XX W GREENLEAF AVE,12,T018,GRAVID,2017-07-20 00:07:00,1,Ter,negative,CULEX TERRITANS,42.010529,-87.660845
14509,2007,32,122XX S STONY ISLAND AVE,122,T104,GRAVID,2007-08-15 00:08:00,3,Pip,negative,CULEX PIPIENS,41.672376,-87.575477
3607,2016,32,77XX S EBERHART AVE,77,T080,GRAVID,2016-08-11 00:08:00,19,Res,negative,CULEX RESTUANS,41.754148,-87.612721
12292,2009,27,131XX S TORRENCE AVE,131,T200,GRAVID,2009-07-13 00:07:00,2,Pip,negative,CULEX PIPIENS,41.656677,-87.559441
2725,2017,29,17XX W 95TH ST,17,T094,GRAVID,2017-07-20 00:07:00,9,Res,negative,CULEX RESTUANS,41.721288,-87.665236
15010,2007,30,42XX N RICHMOND ST,42,T146,GRAVID,2007-08-01 02:08:26,3,Res,negative,CULEX RESTUANS,41.958006,-87.702181
8424,2012,33,50XX S UNION AVE,50,T082,GRAVID,2012-08-17 00:08:00,1,Res,negative,CULEX RESTUANS,41.802359,-87.643076
2319,2017,36,100XX W OHARE,100,T904,GRAVID,2017-09-08 00:09:00,4,Res,negative,CULEX RESTUANS,,


#### Exploring the initial information about the Data Set

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18495 entries, 0 to 18494
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Year             18495 non-null  int64  
 1   Week             18495 non-null  int64  
 2   Address Block    18495 non-null  object 
 3   Block            18495 non-null  int64  
 4   Trap             18495 non-null  object 
 5   Trap type        18495 non-null  object 
 6   Date             18495 non-null  object 
 7   Mosquito number  18495 non-null  int64  
 8   Mosquito ID      18495 non-null  object 
 9   WNV Present      18495 non-null  object 
 10  Species          18495 non-null  object 
 11  Lat              15571 non-null  float64
 12  Lon              15571 non-null  float64
dtypes: float64(2), int64(4), object(7)
memory usage: 1.8+ MB


In [7]:
df.describe().round(2)

Unnamed: 0,Year,Week,Block,Mosquito number,Lat,Lon
count,18495.0,18495.0,18495.0,18495.0,15571.0,15571.0
mean,2012.91,31.0,54.31,10.88,41.84,-87.69
std,3.73,4.33,36.71,13.48,0.11,0.08
min,2007.0,20.0,1.0,1.0,41.64,-87.85
25%,2010.0,28.0,22.0,2.0,41.74,-87.75
50%,2013.0,31.0,51.0,5.0,41.85,-87.69
75%,2016.0,34.0,89.0,14.0,41.95,-87.64
max,2019.0,40.0,132.0,50.0,42.02,-87.53


#### Data preparation and cleaning
* **Formatting** - is our data presented in a way that makes sense?  Are the variable types correct? 
* **Validity** - Are there any values that seem incorrect/nonsensical?
* **Duplicate** or redundant data - Do we have duplicate rows or columns? Do we have columns that provide redundant information given what is contained in other columns?
* **Missing data** - are there any rows or columns that have blank, `np.NaN`, or otherwise missing data?  Should they be dropped or replaced?

### Part 1 - Basic Data Wrangling
- What is the shape of the dataframe?
- Convert the 'Date' column to have a datetime format.
- Pick two numeric and two categorical columns: What data they are storing? How are they distributed?
- Are there any columns that contain duplicate information? If so, remove the redundant columns.
- Are there any null values in the dataframe? If so, deal with them appropriately.