# 01 Cleaning: Train Data

Description: Ensuring that the `train` dataset is properly cleaned, and that it's ready for EDA with the combined dataset, along with feature engineering.

In [2]:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt

%matplotlib inline

In [3]:
df = pd.read_csv('../data/train.csv', parse_dates=['Date'])
df_t = pd.read_csv('../data/test.csv', parse_dates=['Date'])

In [14]:
df[df['Date'] > '2013-01-21'].to_csv()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
8114,2013-06-07,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.954690,-87.800991,9,19,0
8115,2013-06-07,"1100 Roosevelt Road, Chicago, IL 60608, USA",CULEX RESTUANS,11,W ROOSEVELT,T048,"1100 W ROOSEVELT, Chicago, IL",41.867108,-87.654224,8,2,0
8116,2013-06-07,"2200 North Cannon Drive, Chicago, IL 60614, USA",CULEX RESTUANS,22,N CANNON DR,T054,"2200 N CANNON DR, Chicago, IL",41.921965,-87.632085,8,1,0
8117,2013-06-07,"1700 West 95th Street, Chicago, IL 60643, USA",CULEX RESTUANS,17,W 95TH ST,T094,"1700 W 95TH ST, Chicago, IL",41.720848,-87.666014,9,4,0
8118,2013-06-07,"8900 South Carpenter Street, Chicago, IL 60620...",CULEX PIPIENS/RESTUANS,89,S CARPENTER ST,T159,"8900 S CARPENTER ST, Chicago, IL",41.732984,-87.649642,8,4,0
8119,2013-06-07,"5800 North Western Avenue, Chicago, IL 60659, USA",CULEX RESTUANS,58,N WESTERN AVE,T028,"5800 N WESTERN AVE, Chicago, IL",41.986921,-87.689778,9,4,0
8120,2013-06-07,"5000 South Central Avenue, Chicago, IL 60638, USA",CULEX RESTUANS,50,S CENTRAL AVE,T031,"5000 S CENTRAL AVE, Chicago, IL",41.801498,-87.763416,9,1,0
8121,2013-06-07,"1400 North Sacramento Avenue, Chicago, IL 6062...",CULEX TERRITANS,14,N HUMBOLDT DR,T033,"1400 N HUMBOLDT DR, Chicago, IL",41.906638,-87.701431,9,1,0
8122,2013-06-07,"South Vincennes Avenue, Chicago, IL, USA",CULEX PIPIENS/RESTUANS,10,S VINCENNES,T089,"1000 S VINCENNES, Chicago, IL",41.723195,-87.649970,5,2,0
8123,2013-06-07,"South Vincennes Avenue, Chicago, IL, USA",CULEX RESTUANS,10,S VINCENNES,T089,"1000 S VINCENNES, Chicago, IL",41.723195,-87.649970,5,1,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 12 columns):
Date                      10506 non-null datetime64[ns]
Address                   10506 non-null object
Species                   10506 non-null object
Block                     10506 non-null int64
Street                    10506 non-null object
Trap                      10506 non-null object
AddressNumberAndStreet    10506 non-null object
Latitude                  10506 non-null float64
Longitude                 10506 non-null float64
AddressAccuracy           10506 non-null int64
NumMosquitos              10506 non-null int64
WnvPresent                10506 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(4), object(5)
memory usage: 985.0+ KB


In [4]:
df.head()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


There are zero null values present, as well as zero features that need to be cast as either `float` or `int`. Given that, we don't believe there will be much cleaning necessary, however, we'll look at the general shape of each column (via `value_counts()`) to ensure that there aren't any wildly out of the ordinary values.

### Counts of WNV Presence & Mosuito Species

In [5]:
df['WnvPresent'].value_counts()

0    9955
1     551
Name: WnvPresent, dtype: int64

In [7]:
df['Species'].value_counts()

# plt.hist(df['Species'])

CULEX PIPIENS/RESTUANS    4752
CULEX RESTUANS            2740
CULEX PIPIENS             2699
CULEX TERRITANS            222
CULEX SALINARIUS            86
CULEX TARSALIS               6
CULEX ERRATICUS              1
Name: Species, dtype: int64

In [5]:
df['Species'][df['WnvPresent'] == 1].value_counts()

CULEX PIPIENS/RESTUANS    262
CULEX PIPIENS             240
CULEX RESTUANS             49
Name: Species, dtype: int64

In [8]:
df['Species'].value_counts().sum(), 4469 + 2672 + 2239

(10506, 9380)

**Looking into Mosquito Species**

Now that we have total mosquito figures which make up our training set, we're bringing in some outside information to help better shape our knowledge of mosquitos, and which species may be more likely to carry West Nile Virus.

Culex (from Wikipedia): Culex is a genus of mosquitoes, several species of which serve as vectors of one or more important diseases of birds, humans, and other animals. The diseases they vector include arbovirus infections such as West Nile virus, Japanese encephalitis, or St. Louis encephalitis, but also filariasis and avian malaria. They occur worldwide except for the extreme northern parts of the temperate zone, and are the most common form of mosquito encountered in some major US cities such as Los Angeles. 

Species Bionomics (from http://www.wrbu.org/index.html):

- **CULEX RESTUANS**: The larvae are found in a wide variety of aquatic habitats, such as ditches, pools in streams, woodland pools, and artificial containers. The females are regarded as troublesome biters by some observers, although others say that they rarely bite man. (Carpenter and LaCasse 1955:290)
- **CULEX PIPIENS**:  Larvae are found in numerous and variable breeding places ranging from highly polluted cesspits to clear water pools and containers. This species usually breeds in stagnant water in either shaded or unshaded situations. Females readily attack man both indoors and outdoors (Harbach 1988).
  - Medical Importance:  *It has been found naturally infected with Sindbis virus and West Nile viruses in Israel, West Nile and Rift Valley Fever in Egypt, and is a primary vector of periodic Bancroftian filariasis (Harbach 1988).*
- **CULEX TERRITANS**: Description- http://vectorbio.rutgers.edu/outreach/species/terr.htm
- **CULEX SALINARIUS**: Description- http://vectorbio.rutgers.edu/outreach/species/sp11a.htm
- **CULEX TARSALIS**: The larvae are found in clear or foul water in a variety of habitats including ditches, irrigation systems, ground pools, marshes, pools in stream beds, rain barrels, hoofprints, and ornamental pools. Foul water in corrals and around slaughter yards appear to be favorite larval habitats in many localities. Cx. tarsalis are biters, attacking at dusk and after dark, and readily entering dwellings for blood meals. Domestic and wild birds seem to be the preferred hosts. and man, cows, and horses are generally incidental hosts. (Carpenter and LaCasse 1955:296)
  - Medical Importance: Culex tarsalis is believed to be the chief vector of western equine encephalitis virus under natural conditions. The virus has been isolated from wild-caught C. tarsalis on several occasions in areas in which the disease was both epidemic and epizoitic. The viruses of both St. Louis and California encephalitis have been isolated from this mosquito. (Carpenter and LaCasse 1955:296) It also a vector of West Nile Virus (Hayes et al. 2005)
- **CULEX ERRATICUS**: The larvae have been found in semipermanent and permanent pools including ditches, floodwater areas, grassy pools, streams, and occasionally in bilge water of boats and other artificial collections of water. (Carpenter and LaCasse 1955:315305)

**Analysis**

The combination of Culex Restuans and Pipiens make up roughly 96% of the total mosquito population and 100% of the WNV observations. 

- While there is a roughly equivelent amount of pure Restuans and Pipiens located in traps, there is a large discrepancy between the percentage of pure Pipiens observed with WNV (8%) and percentage of pure Restuans with WNV (1.7%). 
  - Given that, it's likely that within the 262 WNV observations within the duel Pipiens/Restuans, the majority of those observations are coming from Culex Pipiens

That leads us to the first takeaway in our dataset:

**TAKEAWAY 1:** *In order for us to make successful predictions, our model will need to be able to delineate between Culex Pipiens and the remaining mosquitos, which are less likely to carry WNV.*

### Looking at the Spread of Traps

It turns out that mosquito counts in a given observations cap out at 50. This means that multiple observations will be necessary for mosquito traps that have greater than 50 mosquitos. This means we'll have to merge these observations before training our model.

In [6]:
df['Block'].value_counts().head(10)

10    1722
11     736
12     605
22     500
13     345
37     330
17     305
42     300
70     295
52     277
Name: Block, dtype: int64

In [7]:
df['Trap'].value_counts().head(10)

T900    750
T115    542
T138    314
T002    185
T135    183
T054    163
T128    160
T151    156
T212    152
T090    151
Name: Trap, dtype: int64

In [8]:
df['NumMosquitos'].value_counts().head(10)

1     2307
2     1300
50    1019
3      896
4      593
5      489
6      398
7      326
8      244
9      237
Name: NumMosquitos, dtype: int64

### Examining Test Data

In [15]:
df_t.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 11 columns):
Id                        116293 non-null int64
Date                      116293 non-null datetime64[ns]
Address                   116293 non-null object
Species                   116293 non-null object
Block                     116293 non-null int64
Street                    116293 non-null object
Trap                      116293 non-null object
AddressNumberAndStreet    116293 non-null object
Latitude                  116293 non-null float64
Longitude                 116293 non-null float64
AddressAccuracy           116293 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(3), object(5)
memory usage: 9.8+ MB


In [10]:
df_t.head(10)

Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
3,4,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX SALINARIUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
4,5,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TERRITANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
5,6,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TARSALIS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
6,7,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",UNSPECIFIED CULEX,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
7,8,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX ERRATICUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
8,9,2008-06-11,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX PIPIENS/RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9
9,10,2008-06-11,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9


**Analysis:**

It looks like the test data consists of an observation for **each** species of mosquito at **each** trap. This further underlies the necessity of our model differentiating between differen species of mosquito.

In [23]:
df.to_csv('../data/train_clean.csv')