## Feature Interactivity and Engineering

We'll use this notebook to dive into feature interactivity and engineering. Below, we load in the appropriate libraries and read in our csvs from the original EDA notebook.

In [1]:
import pandas as pd
import seaborn as sns
pd.core.common.is_list_like = pd.api.types.is_list_like
import pandas_datareader.data as web
from datetime import datetime
import matplotlib.pyplot as plt
#from geopy.distance import geodesic
import time
from sklearn.preprocessing import LabelEncoder

In [2]:
train = pd.read_csv('./data/trainclean.csv')
test = pd.read_csv('./data/testclean.csv')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


#### There is missing data in the dataset.
Many missing values are entered a M, - and T in the data set.  This code sets them all to -1 do distinquish from other data.

In [3]:
# replace some missing values and T with -1
train = train.replace('M', -1)
train = train.replace('-', -1)
train = train.replace('T', -1)
train = train.replace(' T', -1)
train = train.replace('  T', -1)

In [4]:
# replace some missing values and T with -1
test = test.replace('M', -1)
test = test.replace('-', -1)
test = test.replace('T', -1)
test = test.replace(' T', -1)
test = test.replace('  T', -1)

In [5]:
#train.drop(columns = ['codesum_x', 'codesum_y'], inplace=True)

## Encoding Trap, Block, and Street

Below, we used LabelEncoder to turn the values from the trap, block, and street columns into meaningful numbers that can be computed as such. LabelEncoder is a preprocessing method that encodes labels with values between 0 and n_classes-1.

In [6]:
lbl = LabelEncoder()
lbl.fit(list(train['trap'].values) + list(test['trap'].values))
train['trap'] = lbl.transform(train['trap'].values)
test['trap'] = lbl.transform(test['trap'].values)

In [7]:
lbl.fit(list(train['block'].values) + list(test['block'].values))
train['block'] = lbl.transform(train['block'].values)
test['block'] = lbl.transform(test['block'].values)

In [8]:
lbl.fit(list(train['street'].values) + list(test['street'].values))
train['street'] = lbl.transform(train['street'].values)
test['street'] = lbl.transform(test['street'].values)

In [9]:
#This is just a shortcut for now. If we rerun the original EDA doc, this row will kick an error out.
#test['distance'] = test['distance.1']

In [10]:
#test.drop(columns = 'distance.1', inplace=True)

## Distance and Species

Since vertical transmission of disease -- the transmission of disease from a parent mosquito to its offspring -- is prevalent amongst mosquitoes, we make interactive variables for each species of mosquito and distance. 

In [11]:
train['dist_species_culex_pipiens'] = train['species_culex_pipiens'] * train['distance']
test['dist_species_culex_pipiens'] = test['species_culex_pipiens'] * test['distance']

In [12]:
train['dist_species_culex_pipiens/restuans'] = train['species_culex_pipiens/restuans'] * train['distance']
test['dist_species_culex_pipiens/restuans'] = test['species_culex_pipiens/restuans'] * test['distance']

In [13]:
train['dist_species_culex_restuans'] = train['species_culex_restuans'] * train['distance']
test['dist_species_culex_restuans'] = test['species_culex_restuans'] * test['distance']

In [14]:
train['dist_species_culex_salinarius'] = train['species_culex_salinarius'] * train['distance']
test['dist_species_culex_salinarius'] = test['species_culex_salinarius'] * test['distance']

In [15]:
train['dist_species_culex_tarsalis'] = train['species_culex_tarsalis'] * train['distance']
test['dist_species_culex_tarsalis'] = test['species_culex_tarsalis'] * test['distance']

In [16]:
train['dist_species_culex_territans'] = train['species_culex_territans'] * train['distance']
test['dist_species_culex_territans'] = test['species_culex_territans'] * test['distance']

In [17]:
test['species_unspecified_culex_dist'] = test['species_unspecified_culex'] * test['distance']

## Interaction with Wetbulb

Below, we create interaction terms between temperature, trap, block and precipitation (termed "Wetbulb"). Water seems like it would be similarly indicative of disease movement to precipitation, but most of the water values are missing from the initial dataset, so we won't use this as part of an interaction term.

**Temperature and Wetbulb**

In [18]:
train['wetbulb_x'] = train['wetbulb_x'].astype(int) #Might want to put this back into original EDA 

In [19]:
train['temp_and_precip_x'] = train['tmax_x'] * train['wetbulb_x'] 
test['temp_and_precip_x'] = test['tmax_y'] * test['wetbulb_x'] 

**Trap and Wetbulb**

In [20]:
train['trap_and_precip_y'] = train['trap'] * train['wetbulb_y'] 
test['trap_and_precip_x'] = test['trap'] * test['wetbulb_x']

**Block and Wetbulb**

In [21]:
train['block_and_precip_x'] = train['block'] * train['wetbulb_x'] 
test['block_and_precip_x'] = test['block'] * test['wetbulb_x'] 

**Trap and Temperature**

Since maximum temperature and traps are significant as indicated by feature importance of our Adaboost model, we create a trap and temperature interactive variable here.

In [22]:
train['trap_and_temp_x'] = train['trap'] * train['tmax_x']
test['trap_and_temp_x'] = test['trap'] * test['tmax_x']

In [23]:
train['trap_and_temp_y'] = train['trap'] * train['tmax_y']
test['trap_and_temp_y'] = test['trap'] * test['tmax_y']

## Sealevel Binary Variable Creation by Standard Deviation

We called .describe() on our sealevel variables with the idea that mosquitoes gravitate around water and water will pool in areas with low sealevel. We then engineered a term called "low_sealevel" that consists only of the lowest standard deviation of sealevel in the grouping.

In [24]:
train['sealevel_x'].describe()

count    8475.000000
mean       29.967791
std         0.119335
min        29.600000
25%        29.890000
50%        29.980000
75%        30.050000
max        30.330000
Name: sealevel_x, dtype: float64

In [25]:
test['sealevel_x'].describe() 

count    116293.000000
mean         29.985536
std           0.122456
min          29.710000
25%          29.890000
50%          29.990000
75%          30.070000
max          30.330000
Name: sealevel_x, dtype: float64

In [26]:
train['low_sealevel_x'] = train['sealevel_x'].map(
    lambda  x:   1 if x<= 29.89  else 0)
        
test['low_sealevel_x'] = test['sealevel_x'].map(
    lambda  x:   1 if x<= 29.89  else 0)

In [27]:
train['low_sealevel_y'] = train['sealevel_y'].map(
    lambda  x:   1 if x<= 29.89  else 0)
test['low_sealevel_y'] = test['sealevel_y'].map(
    lambda  x:   1 if x<= 29.89  else 0)

In [28]:
train['sealevel_y'].describe()

count    8475.000000
mean       29.955738
std         0.121189
min        29.590000
25%        29.870000
50%        29.960000
75%        30.050000
max        30.330000
Name: sealevel_y, dtype: float64

In [29]:
test['sealevel_y'].describe()

count    116293.000000
mean         29.970243
std           0.123908
min          29.690000
25%          29.870000
50%          29.980000
75%          30.050000
max          30.310000
Name: sealevel_y, dtype: float64

## Interaction with Month

In visualizing the data in Tableau, we can see that the West Nile cases start to be more prolific in the mid summer with fewer cases in the early and late summer, so we thought that distance, amongst other important features, might interact with month.

In [30]:
train['month_and_dist'] = train['month'] * train['distance']
test['month_and_dist'] = test['month'] * test['distance']

**Sealevel and Month Interaction**

Since month may be a predictor of West Nile cases and sealevel pressure may indicate areas in which water may pool, we thought that month and sealevel might interact.

In [31]:
train['month_and_sealevel_x'] = train['month'] * train['low_sealevel_x']
test['month_and_sealevel_x'] = test['month'] * test['low_sealevel_x']

In [32]:
train['month_and_sealevel_y'] = train['month'] * train['low_sealevel_y']
test['month_and_sealevel_y'] = test['month'] * test['low_sealevel_y']

**Month, Distance, and Temperature**

In [33]:
train['month_dist_temp_x'] = train['month'] * train['distance'] * train['tavg_x']
test['month_dist_temp_x'] = test['month'] * test['distance'] * test['tavg_x']

In [34]:
train['month_dist_3_temp_x'] = train['month'] * train['distance_3'] * train['tavg_x']
test['month_dist_3_temp_x'] = test['month'] * test['distance_3'] * test['tavg_x']

In [35]:
train['month_dist_5_temp_x'] = train['month'] * train['distance_5'] * train['tavg_x']
test['month_dist_5_temp_x'] = test['month'] * test['distance_5'] * test['tavg_x']

In [36]:
train['month_dist_temp_y'] = train['month'] * train['distance'] * train['tavg_y']
test['month_dist_temp_y'] = test['month'] * test['distance'] * test['tavg_y']

In [37]:
train['month_dist_maxtemp_x'] = train['month'] * train['distance'] * train['tmax_x']
test['month_dist_maxtemp_x'] = test['month'] * test['distance'] * test['tmax_x']

In [38]:
train['month_dist_maxtemp_y'] = train['month'] * train['distance'] * train['tmax_y']
test['month_dist_maxtemp_y'] = test['month'] * test['distance'] * test['tmax_y']

**Month, Distance, and Sealevel**

In [39]:
train['month_dist_sealevel_x'] = train['month'] * train['distance'] * train['low_sealevel_x']
test['month_dist_sealevel_x'] = test['month'] * test['distance'] * test['low_sealevel_x']

In [40]:
train['month_dist_sealevel_y'] = train['month'] * train['distance'] * train['low_sealevel_y']
test['month_dist_sealevel_y'] = test['month'] * test['distance'] * test['low_sealevel_y']

### Creating Number of Mosquitoes Variable

Number of Mosquitoes is an important predictor if we included it in the predictive model.  However, the feature is not in the test data.  We tried to predict number of mosquitoes in using a regularized linear regression.  The variance explained in the target was not high enough and we abandoned the model.

In [41]:
from sklearn.linear_model import RidgeCV, LassoCV, LinearRegression
from sklearn.cross_validation import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler




In [42]:
train.corr()

Unnamed: 0,index,block,street,trap,latitude,longitude,addressaccuracy,nummosquitos,wnvpresent,year,...,month_and_sealevel_x,month_and_sealevel_y,month_dist_temp_x,month_dist_3_temp_x,month_dist_5_temp_x,month_dist_temp_y,month_dist_maxtemp_x,month_dist_maxtemp_y,month_dist_sealevel_x,month_dist_sealevel_y
index,1.000000,0.008347,-0.009874,0.131110,-0.000609,-0.037761,0.001047,0.119550,0.070844,0.963354,...,-0.189654,-0.214386,0.133743,0.673521,0.779104,0.136212,0.135732,0.136861,-0.098768,-0.102551
block,0.008347,1.000000,-0.080079,-0.090660,0.100549,-0.201589,0.182005,0.001565,0.037863,0.010502,...,0.021220,0.026242,-0.047829,-0.006426,-0.023931,-0.048472,-0.048443,-0.048427,0.003000,0.004967
street,-0.009874,-0.080079,1.000000,-0.056570,-0.065959,-0.203317,-0.007657,-0.024558,-0.015154,-0.011213,...,-0.004585,-0.002154,-0.078062,-0.079548,-0.046955,-0.077663,-0.077503,-0.077575,-0.040040,-0.038388
trap,0.131110,-0.090660,-0.056570,1.000000,-0.297636,0.191216,-0.293945,0.034896,-0.013194,0.114960,...,-0.036632,-0.050675,0.122161,0.090816,0.176777,0.122740,0.120412,0.120591,0.005067,-0.000698
latitude,-0.000609,0.100549,-0.065959,-0.297636,1.000000,-0.636842,0.326892,-0.049971,0.044596,0.014907,...,0.030909,0.041856,-0.102094,0.332160,0.040295,-0.102112,-0.099592,-0.099741,-0.008802,-0.004055
longitude,-0.037761,-0.201589,-0.203317,0.191216,-0.636842,1.000000,-0.370838,-0.053099,-0.072321,-0.046790,...,-0.017269,-0.023788,0.077633,-0.242769,-0.045631,0.077366,0.075769,0.075887,0.019391,0.016343
addressaccuracy,0.001047,0.182005,-0.007657,-0.293945,0.326892,-0.370838,1.000000,-0.047380,0.025230,0.011914,...,0.017051,0.024807,-0.102930,0.062615,-0.063854,-0.103464,-0.101933,-0.102001,-0.022966,-0.020735
nummosquitos,0.119550,0.001565,-0.024558,0.034896,-0.049971,-0.053099,-0.047380,1.000000,0.228145,0.132252,...,0.032070,0.025580,-0.024790,0.070036,0.065489,-0.025992,-0.027473,-0.027661,0.023507,0.019342
wnvpresent,0.070844,0.037863,-0.015154,-0.013194,0.044596,-0.072321,0.025230,0.228145,1.000000,0.054519,...,0.001004,0.023275,-0.016182,0.043937,0.019076,-0.017111,-0.018705,-0.019354,-0.022644,-0.010627
year,0.963354,0.010502,-0.011213,0.114960,0.014907,-0.046790,0.011914,0.132252,0.054519,1.000000,...,-0.139781,-0.167954,0.069486,0.644111,0.749867,0.070937,0.069011,0.070315,-0.061349,-0.066886


In [43]:
features = ['year', 'month', 'day', 'tmax_x',
       'tmin_x', 'tavg_x', 'depart_x', 'dewpoint_x', 'wetbulb_x', 'heat_x',
        'cool_x', 'sunrise_x', 'sunset_x',
       'snowfall_x', 'preciptotal_x', 'stnpressure_x',
       'resultspeed_x', 'resultdir_x', 'avgspeed_x', 'tmax_y', 'tmin_y',
       'tavg_y', 'depart_y', 'dewpoint_y', 'wetbulb_y', 'heat_y', 'cool_y', 
        'sunrise_y', 'sunset_y',
       'snowfall_y', 'preciptotal_y', 'stnpressure_y',
       'resultspeed_y', 'resultdir_y', 'avgspeed_y', 'species_culex_pipiens',
       'species_culex_pipiens/restuans', 'species_culex_restuans',
       'species_culex_salinarius', 'species_culex_tarsalis',
       'species_culex_territans', 'distance', 'distance_3', 'distance_5',
       'dist_species_culex_pipiens', 'dist_species_culex_pipiens/restuans',
       'dist_species_culex_restuans', 'dist_species_culex_salinarius',
       'dist_species_culex_tarsalis', 'dist_species_culex_territans',
       'temp_and_precip_x', 'trap_and_precip_y', 'block_and_precip_x',
       'trap_and_temp_x', 'trap_and_temp_y',
        'month_and_dist', 
        'month_dist_temp_x', 'month_dist_temp_y',
       'month_dist_maxtemp_x', 'month_dist_maxtemp_y', ]

X = train[features]
y = train.nummosquitos

In [44]:
train.columns

Index(['index', 'date', 'address', 'block', 'street', 'trap',
       'addressnumberandstreet', 'latitude', 'longitude', 'addressaccuracy',
       'nummosquitos', 'wnvpresent', 'year', 'month', 'day', 'tmax_x',
       'tmin_x', 'tavg_x', 'depart_x', 'dewpoint_x', 'wetbulb_x', 'heat_x',
       'cool_x', 'sunrise_x', 'sunset_x', 'codesum_x', 'depth_x', 'water1_x',
       'snowfall_x', 'preciptotal_x', 'stnpressure_x', 'sealevel_x',
       'resultspeed_x', 'resultdir_x', 'avgspeed_x', 'tmax_y', 'tmin_y',
       'tavg_y', 'depart_y', 'dewpoint_y', 'wetbulb_y', 'heat_y', 'cool_y',
       'sunrise_y', 'sunset_y', 'codesum_y', 'depth_y', 'water1_y',
       'snowfall_y', 'preciptotal_y', 'stnpressure_y', 'sealevel_y',
       'resultspeed_y', 'resultdir_y', 'avgspeed_y', 'species_culex_pipiens',
       'species_culex_pipiens/restuans', 'species_culex_restuans',
       'species_culex_salinarius', 'species_culex_tarsalis',
       'species_culex_territans', 'distance', 'distance_3', 'distance_5',
 

In [45]:
X_train, X_test, y_train, y_test = train_test_split( X, y, 
                                                   random_state = 6)

In [46]:
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

In [47]:
rd = RidgeCV(alphas = [.1, .5, 1])
rd.fit(X_train, y_train)
print(rd.score(X_train, y_train))
rd.score(X_test, y_test)

0.1016290399132086


0.09306349937301606

In [48]:
rd.coef_


array([  2.07747252,   2.38184887,   0.50741971, -11.16913625,
        -9.89926433,   8.24345007,  -0.81144202,   0.85828433,
        -4.98546666,  -7.70116422,   7.02138143,  -1.82955895,
         0.15308572,   0.28387552,  -0.67164537,   0.22057227,
         1.00796023,   1.66520133,  -2.13809597,  -3.97403508,
        -0.85427978,   0.37693452,   0.        ,  -3.5751169 ,
         5.81004667,   2.66815888,   1.70318576,   0.        ,
         0.        ,   0.        ,   0.44997159,   0.24061797,
        -2.11181648,  -1.90815657,   3.84866895,   0.95788759,
         0.43417812,  -1.05241257,  -0.51947567,  -0.06339131,
        -0.58097541,   2.3203377 ,   0.98299813,  -1.16206675,
        -0.43966414,   1.57352066,   1.06956196,   0.01126028,
         0.06267024,  -0.14762762,   6.83661322,  -1.14188909,
        -0.08106792,  -0.28664929,   1.88431189,  -0.84179614,
         7.02868058, -10.17084783,   0.11202251,   0.63066133])

In [49]:
cross_val_score(rd, X_test, y_test, cv = 5)

array([0.07499683, 0.05835795, 0.0610773 , 0.08141616, 0.0776309 ])

### DBSCAN Clustering

Since our attempts to regress the number of mosquitoes failed, we decided to attempt to cluster the traps into geographic location in which we could imply the number of mosquitoes by taking the average of the mosquitoes in each trap in the training data.

In [50]:
from sklearn.cluster import DBSCAN, KMeans

In [51]:
km = KMeans(n_clusters = 12)

In [52]:
features = ["latitude", 'longitude']
new = train[features]
new2 = test[features]
new['list'] = "train"
new2['list'] = 'test'


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [53]:

list =[new, new2]
cluster_data = pd.concat(list, axis = 0)
features = ["latitude", 'longitude']
cluster_data1 = cluster_data[features]

In [54]:
ss = StandardScaler()
cluster_data1 = ss.fit_transform(cluster_data1)

clusters =  km.fit(cluster_data1)

In [55]:
km.cluster_centers_

array([[ 0.94178236, -0.63649406],
       [-1.35960128,  1.81890463],
       [-0.59252078, -0.63654908],
       [-1.24828552,  0.17871208],
       [ 0.38612606,  0.57607898],
       [ 1.17854368, -2.47861561],
       [ 1.10578985, -1.24563248],
       [ 1.23755019,  0.15897325],
       [-0.74438126,  1.12685943],
       [ 0.36267816, -0.13569519],
       [-1.3701292 ,  1.18437632],
       [-0.29841694,  0.38044286]])

In [56]:
#cluster_data['cluster'] = 
km.labels_

array([6, 6, 6, ..., 4, 4, 4], dtype=int32)

In [57]:
cluster_data['cluster'] = km.labels_

In [58]:
train['cluster'] = cluster_data[cluster_data['list']  =='train']['cluster']
test['cluster'] = cluster_data[cluster_data['list']  =='test']['cluster']

**Saving the Next CSV for Analysis Notebook**

In [59]:
train.to_csv('./data/trainw.csv')
test.to_csv('./data/testw.csv')

In [60]:
train.columns

Index(['index', 'date', 'address', 'block', 'street', 'trap',
       'addressnumberandstreet', 'latitude', 'longitude', 'addressaccuracy',
       'nummosquitos', 'wnvpresent', 'year', 'month', 'day', 'tmax_x',
       'tmin_x', 'tavg_x', 'depart_x', 'dewpoint_x', 'wetbulb_x', 'heat_x',
       'cool_x', 'sunrise_x', 'sunset_x', 'codesum_x', 'depth_x', 'water1_x',
       'snowfall_x', 'preciptotal_x', 'stnpressure_x', 'sealevel_x',
       'resultspeed_x', 'resultdir_x', 'avgspeed_x', 'tmax_y', 'tmin_y',
       'tavg_y', 'depart_y', 'dewpoint_y', 'wetbulb_y', 'heat_y', 'cool_y',
       'sunrise_y', 'sunset_y', 'codesum_y', 'depth_y', 'water1_y',
       'snowfall_y', 'preciptotal_y', 'stnpressure_y', 'sealevel_y',
       'resultspeed_y', 'resultdir_y', 'avgspeed_y', 'species_culex_pipiens',
       'species_culex_pipiens/restuans', 'species_culex_restuans',
       'species_culex_salinarius', 'species_culex_tarsalis',
       'species_culex_territans', 'distance', 'distance_3', 'distance_5',
 

In [61]:
test.columns

Index(['index', 'id', 'date', 'address', 'block', 'street', 'trap',
       'addressnumberandstreet', 'latitude', 'longitude', 'addressaccuracy',
       'year', 'month', 'day', 'tmax_x', 'tmin_x', 'tavg_x', 'depart_x',
       'dewpoint_x', 'wetbulb_x', 'heat_x', 'cool_x', 'sunrise_x', 'sunset_x',
       'codesum_x', 'depth_x', 'water1_x', 'snowfall_x', 'preciptotal_x',
       'stnpressure_x', 'sealevel_x', 'resultspeed_x', 'resultdir_x',
       'avgspeed_x', 'tmax_y', 'tmin_y', 'tavg_y', 'depart_y', 'dewpoint_y',
       'wetbulb_y', 'heat_y', 'cool_y', 'sunrise_y', 'sunset_y', 'codesum_y',
       'depth_y', 'water1_y', 'snowfall_y', 'preciptotal_y', 'stnpressure_y',
       'sealevel_y', 'resultspeed_y', 'resultdir_y', 'avgspeed_y',
       'species_culex_pipiens', 'species_culex_pipiens/restuans',
       'species_culex_restuans', 'species_culex_salinarius',
       'species_culex_tarsalis', 'species_culex_territans',
       'species_unspecified_culex', 'distance', 'distance_3', 'distance_