## Feature Interactivity and Engineering

We'll use this notebook to dive into feature interactivity and engineering. Below, we load in the appropriate libraries and read in our csvs from the original EDA notebook.

In [11]:
import pandas as pd
import seaborn as sns
pd.core.common.is_list_like = pd.api.types.is_list_like
import pandas_datareader.data as web
from datetime import datetime
import matplotlib.pyplot as plt
#from geopy.distance import geodesic
import time
from sklearn.preprocessing import LabelEncoder

In [12]:
train = pd.read_csv('./data/trainclean.csv')
test = pd.read_csv('./data/testclean.csv')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


#### There is missing data in the dataset.
Many missing values are entered a M, - and T in the data set.  This code sets them all to -1 do distinquish from other data.

In [13]:
# replace some missing values and T with -1
train = train.replace('M', -1)
train = train.replace('-', -1)
train = train.replace('T', -1)
train = train.replace(' T', -1)
train = train.replace('  T', -1)

In [14]:
# replace some missing values and T with -1
test = test.replace('M', -1)
test = test.replace('-', -1)
test = test.replace('T', -1)
test = test.replace(' T', -1)
test = test.replace('  T', -1)

In [15]:
#train.drop(columns = ['codesum_x', 'codesum_y'], inplace=True)

## Encoding Trap, Block, and Street

Below, we used LabelEncoder to turn the values from the trap, block, and street columns into meaningful numbers that can be computed as such. LabelEncoder is a preprocessing method that encodes labels with values between 0 and n_classes-1.

In [16]:
lbl = LabelEncoder()
lbl.fit(list(train['trap'].values) + list(test['trap'].values))
train['trap'] = lbl.transform(train['trap'].values)
test['trap'] = lbl.transform(test['trap'].values)

In [17]:
lbl.fit(list(train['block'].values) + list(test['block'].values))
train['block'] = lbl.transform(train['block'].values)
test['block'] = lbl.transform(test['block'].values)

In [18]:
lbl.fit(list(train['street'].values) + list(test['street'].values))
train['street'] = lbl.transform(train['street'].values)
test['street'] = lbl.transform(test['street'].values)

In [19]:
#This is just a shortcut for now. If we rerun the original EDA doc, this row will kick an error out.
#test['distance'] = test['distance.1']

In [20]:
#test.drop(columns = 'distance.1', inplace=True)

## Distance and Species

Since vertical transmission of disease -- the transmission of disease from a parent mosquito to its offspring -- is prevalent amongst mosquitoes, we make interactive variables for each species of mosquito and distance. 

In [21]:
train['dist_species_culex_pipiens'] = train['species_culex_pipiens'] * train['distance']
test['dist_species_culex_pipiens'] = test['species_culex_pipiens'] * test['distance']

In [22]:
train['dist_species_culex_pipiens/restuans'] = train['species_culex_pipiens/restuans'] * train['distance']
test['dist_species_culex_pipiens/restuans'] = test['species_culex_pipiens/restuans'] * test['distance']

In [23]:
train['dist_species_culex_restuans'] = train['species_culex_restuans'] * train['distance']
test['dist_species_culex_restuans'] = test['species_culex_restuans'] * test['distance']

In [24]:
train['dist_species_culex_salinarius'] = train['species_culex_salinarius'] * train['distance']
test['dist_species_culex_salinarius'] = test['species_culex_salinarius'] * test['distance']

In [25]:
train['dist_species_culex_tarsalis'] = train['species_culex_tarsalis'] * train['distance']
test['dist_species_culex_tarsalis'] = test['species_culex_tarsalis'] * test['distance']

In [26]:
train['dist_species_culex_territans'] = train['species_culex_territans'] * train['distance']
test['dist_species_culex_territans'] = test['species_culex_territans'] * test['distance']

In [27]:
test['species_unspecified_culex_dist'] = test['species_unspecified_culex'] * test['distance']

## Interaction with Wetbulb

Below, we create interaction terms between temperature, trap, block and precipitation (termed "Wetbulb"). Water seems like it would be similarly indicative of disease movement to precipitation, but most of the water values are missing from the initial dataset, so we won't use this as part of an interaction term.

**Temperature and Wetbulb**

In [28]:
train['wetbulb_x'] = train['wetbulb_x'].astype(int) #Might want to put this back into original EDA 

In [29]:
train['temp_and_precip_x'] = train['tmax_x'] * train['wetbulb_x'] 
test['temp_and_precip_x'] = test['tmax_y'] * test['wetbulb_x'] 

**Trap and Wetbulb**

In [30]:
train['trap_and_precip_y'] = train['trap'] * train['wetbulb_y'] 
test['trap_and_precip_x'] = test['trap'] * test['wetbulb_x']

**Block and Wetbulb**

In [31]:
train['block_and_precip_x'] = train['block'] * train['wetbulb_x'] 
test['block_and_precip_x'] = test['block'] * test['wetbulb_x'] 

**Trap and Temperature**

Since maximum temperature and traps are significant as indicated by feature importance of our Adaboost model, we create a trap and temperature interactive variable here.

In [32]:
train['trap_and_temp_x'] = train['trap'] * train['tmax_x']
test['trap_and_temp_x'] = test['trap'] * test['tmax_x']

In [33]:
train['trap_and_temp_y'] = train['trap'] * train['tmax_y']
test['trap_and_temp_y'] = test['trap'] * test['tmax_y']

## Sealevel Binary Variable Creation by Standard Deviation

We called .describe() on our sealevel variables with the idea that mosquitoes gravitate around water and water will pool in areas with low sealevel. We then engineered a term called "low_sealevel" that consists only of the lowest standard deviation of sealevel in the grouping.

In [34]:
train['sealevel_x'].describe()

count    8475.000000
mean       29.967791
std         0.119335
min        29.600000
25%        29.890000
50%        29.980000
75%        30.050000
max        30.330000
Name: sealevel_x, dtype: float64

In [35]:
test['sealevel_x'].describe() 

count    116293.000000
mean         29.985536
std           0.122456
min          29.710000
25%          29.890000
50%          29.990000
75%          30.070000
max          30.330000
Name: sealevel_x, dtype: float64

In [36]:
for sealevel in train['sealevel_x']:
    if sealevel <= 29.89000:
        train['low_sealevel_x'] = 1
    else:
        train['low_sealevel_x'] = 0

In [37]:
for sealevel in test['sealevel_x']:
    if sealevel <= 29.89000:
        test['low_sealevel_x'] = 1
    else:
        test['low_sealevel_x'] = 0

In [38]:
train['sealevel_y'].describe()

count    8475.000000
mean       29.955738
std         0.121189
min        29.590000
25%        29.870000
50%        29.960000
75%        30.050000
max        30.330000
Name: sealevel_y, dtype: float64

In [39]:
test['sealevel_y'].describe()

count    116293.000000
mean         29.970243
std           0.123908
min          29.690000
25%          29.870000
50%          29.980000
75%          30.050000
max          30.310000
Name: sealevel_y, dtype: float64

In [40]:
for sealevel in train['sealevel_y']:
    if sealevel <= 29.87000:
        train['low_sealevel_y'] = 1
    else:
        train['low_sealevel_y'] = 0

In [41]:
for sealevel in test['sealevel_y']:
    if sealevel <= 29.87000:
        test['low_sealevel_y'] = 1
    else:
        test['low_sealevel_y'] = 0

## Interaction with Month

In visualizing the data in Tableau, we can see that the West Nile cases start to be more prolific in the mid summer with fewer cases in the early and late summer, so we thought that distance, amongst other important features, might interact with month.

In [42]:
train['month_and_dist'] = train['month'] * train['distance']
test['month_and_dist'] = test['month'] * test['distance']

**Sealevel and Month Interaction**

Since month may be a predictor of West Nile cases and sealevel pressure may indicate areas in which water may pool, we thought that month and sealevel might interact.

In [43]:
train['month_and_sealevel_x'] = train['month'] * train['low_sealevel_x']
test['month_and_sealevel_x'] = test['month'] * test['low_sealevel_x']

In [44]:
train['month_and_sealevel_y'] = train['month'] * train['low_sealevel_y']
test['month_and_sealevel_y'] = test['month'] * test['low_sealevel_y']

**Month, Distance, and Temperature**

In [45]:
train['month_dist_temp_x'] = train['month'] * train['distance'] * train['tavg_x']
test['month_dist_temp_x'] = test['month'] * test['distance'] * test['tavg_x']

In [46]:
train['month_dist_temp_y'] = train['month'] * train['distance'] * train['tavg_y']
test['month_dist_temp_y'] = test['month'] * test['distance'] * test['tavg_y']

In [47]:
train['month_dist_maxtemp_x'] = train['month'] * train['distance'] * train['tmax_x']
test['month_dist_maxtemp_x'] = test['month'] * test['distance'] * test['tmax_x']

In [48]:
train['month_dist_maxtemp_y'] = train['month'] * train['distance'] * train['tmax_y']
test['month_dist_maxtemp_y'] = test['month'] * test['distance'] * test['tmax_y']

**Month, Distance, and Sealevel**

In [49]:
train['month_dist_sealevel_x'] = train['month'] * train['distance'] * train['low_sealevel_x']
test['month_dist_sealevel_x'] = test['month'] * test['distance'] * test['low_sealevel_x']

In [50]:
train['month_dist_sealevel_y'] = train['month'] * train['distance'] * train['low_sealevel_y']
test['month_dist_sealevel_y'] = test['month'] * test['distance'] * test['low_sealevel_y']

**Saving the Next CSV for Analysis Notebook**

In [51]:
train.to_csv('./data/trainw.csv')
test.to_csv('./data/testw.csv')

In [52]:
train.columns

Index(['index', 'date', 'address', 'block', 'street', 'trap',
       'addressnumberandstreet', 'latitude', 'longitude', 'addressaccuracy',
       'nummosquitos', 'wnvpresent', 'year', 'month', 'day', 'tmax_x',
       'tmin_x', 'tavg_x', 'depart_x', 'dewpoint_x', 'wetbulb_x', 'heat_x',
       'cool_x', 'sunrise_x', 'sunset_x', 'codesum_x', 'depth_x', 'water1_x',
       'snowfall_x', 'preciptotal_x', 'stnpressure_x', 'sealevel_x',
       'resultspeed_x', 'resultdir_x', 'avgspeed_x', 'tmax_y', 'tmin_y',
       'tavg_y', 'depart_y', 'dewpoint_y', 'wetbulb_y', 'heat_y', 'cool_y',
       'sunrise_y', 'sunset_y', 'codesum_y', 'depth_y', 'water1_y',
       'snowfall_y', 'preciptotal_y', 'stnpressure_y', 'sealevel_y',
       'resultspeed_y', 'resultdir_y', 'avgspeed_y', 'species_culex_pipiens',
       'species_culex_pipiens/restuans', 'species_culex_restuans',
       'species_culex_salinarius', 'species_culex_tarsalis',
       'species_culex_territans', 'distance', 'distance_3', 'distance_5',
 

In [53]:
test.columns

Index(['index', 'id', 'date', 'address', 'block', 'street', 'trap',
       'addressnumberandstreet', 'latitude', 'longitude', 'addressaccuracy',
       'year', 'month', 'day', 'tmax_x', 'tmin_x', 'tavg_x', 'depart_x',
       'dewpoint_x', 'wetbulb_x', 'heat_x', 'cool_x', 'sunrise_x', 'sunset_x',
       'codesum_x', 'depth_x', 'water1_x', 'snowfall_x', 'preciptotal_x',
       'stnpressure_x', 'sealevel_x', 'resultspeed_x', 'resultdir_x',
       'avgspeed_x', 'tmax_y', 'tmin_y', 'tavg_y', 'depart_y', 'dewpoint_y',
       'wetbulb_y', 'heat_y', 'cool_y', 'sunrise_y', 'sunset_y', 'codesum_y',
       'depth_y', 'water1_y', 'snowfall_y', 'preciptotal_y', 'stnpressure_y',
       'sealevel_y', 'resultspeed_y', 'resultdir_y', 'avgspeed_y',
       'species_culex_pipiens', 'species_culex_pipiens/restuans',
       'species_culex_restuans', 'species_culex_salinarius',
       'species_culex_tarsalis', 'species_culex_territans',
       'species_unspecified_culex', 'distance', 'distance_3', 'distance_