# Appendix ii - EDA: Searching for Optimal Lags Programmatically

Description: Searching for optimal feature lags to utilize in feature engineering.

*To Note: None of the engineered features here made it to our production model. Due to the superior performance, and additional model interpretability, we moved forward with feature lags previously utilized by Lieb, et al.*

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**Importing Final Dataframe**

In [2]:
df = pd.read_csv('../data/final_dataframe_features.csv')

### 'Grid Searching' Through Rolling & Shifted Feature Means

We created a function that will help us search through different permutations of rolling means, along with shifted days, to see which configuration will give us the highest correlation with the Number of Mosquitos per trap. We'll compare that against calculating expanding means of each feature, and checking it's correlation with `NumMosquitos`. 

Based on research from *Lebl, et al*, we believe the following features and ranges will be most telling for the number of mosquitos per trap:
- Daytime Length averaged 4-5 weeks prior
- Temperature averaged 2 weeks prior
- Wind speed averaged over 3 weeks prior

We'll then measure the difference between highest correlated feature lags and un-lagged features to get an idea of how much these prior weather events tell us about mosquito populations.

In [3]:
def hi_corr(f1, f2, rm_min=3, rm_max=7, min_val=1, max_val=30):
    corr_array = []
    new_array = []
    for k in range(rm_min, rm_max + 1):
        for i in range(min_val, max_val + 1):
            new_array.append(np.corrcoef(f1.rolling(k).mean().shift(i).dropna(), f2.drop(f2.index[:i+k-1]))[0,1])
            corr_array.append(np.corrcoef(f1.rolling(k).mean().shift(i).dropna(), f2.drop(f2.index[:i+k-1]))[0,1])
            new_array.append('Rolling Mean: {}'.format(k))
            new_array.append('Shifted Value: {}'.format(i))
    corr_max = max(corr_array)
    corr_min = min(corr_array)
    if abs(corr_min) > corr_max:
        windex = new_array.index(corr_min)
        print(corr_min, new_array[windex + 1], new_array[windex + 2])
    else:
        windex = new_array.index(corr_max)
        print(corr_max, new_array[windex + 1], new_array[windex + 2])
    return

**Daylength Related**

Examining `Day_length`

In [4]:
hi_corr(df['Day_length'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=21, max_val=42)

0.03351765362052723 Rolling Mean: 1 Shifted Value: 40


In [5]:
np.corrcoef(df['Day_length'].expanding().mean(), df['NumMosquitos'])

array([[1.        , 0.04006146],
       [0.04006146, 1.        ]])

### Temperature Related

Examining `Tavg`

In [8]:
hi_corr(df['Tavg'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=7, max_val=21)

0.0666091429540884 Rolling Mean: 1 Shifted Value: 7


In [9]:
np.corrcoef(df['Tavg'].expanding().mean(), df['NumMosquitos'])

array([[1.        , 0.05431909],
       [0.05431909, 1.        ]])

In [10]:
df['Month'].unique()

array([ 5,  6,  7,  8,  9, 10])

Examining `Heat`

In [12]:
hi_corr(df['Heat'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=7, max_val=21)

-0.041308292774105575 Rolling Mean: 7 Shifted Value: 8


In [13]:
np.corrcoef(df['Heat'].expanding().mean(), df['NumMosquitos'])

array([[ 1.        , -0.04675775],
       [-0.04675775,  1.        ]])

Examining `Cool`

In [14]:
hi_corr(df['Cool'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=7, max_val=21)

0.06816173690042639 Rolling Mean: 1 Shifted Value: 7


In [15]:
np.corrcoef(df['Cool'].expanding().mean(), df['NumMosquitos'])

array([[1.        , 0.05245493],
       [0.05245493, 1.        ]])

Examining `Tmax`

In [16]:
hi_corr(df['Tmax'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=7, max_val=21)

0.059233391961211114 Rolling Mean: 1 Shifted Value: 7


In [17]:
np.corrcoef(df['Tmax'], df['NumMosquitos'])

array([[1.        , 0.05710975],
       [0.05710975, 1.        ]])

In [18]:
np.corrcoef(df['Tmax'].expanding().mean(), df['NumMosquitos'])

array([[1.       , 0.0469504],
       [0.0469504, 1.       ]])

Examining `Tmin`

In [19]:
hi_corr(df['Tmin'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=7, max_val=21)

0.06767516243864329 Rolling Mean: 1 Shifted Value: 10


In [20]:
np.corrcoef(df['Tmin'].expanding().mean(), df['NumMosquitos'])

array([[1.        , 0.05988869],
       [0.05988869, 1.        ]])

Examining `Depart`

In [22]:
hi_corr(df['Depart'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=7, max_val=21)

0.037569483464707176 Rolling Mean: 1 Shifted Value: 7


In [23]:
np.corrcoef(df['Depart'].expanding().mean(), df['NumMosquitos'])

array([[1.        , 0.00508592],
       [0.00508592, 1.        ]])

### Wind Related

Examining `ResultSpeed` (Wind Speed)

In [24]:
hi_corr(df['ResultSpeed'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=14, max_val=28)

-0.009319060722963545 Rolling Mean: 7 Shifted Value: 28


In [25]:
np.corrcoef(df['ResultSpeed'].expanding().mean(), df['NumMosquitos'])

array([[ 1.        , -0.00656348],
       [-0.00656348,  1.        ]])

Examining `ResultDir`

In [27]:
hi_corr(df['ResultDir'], df['NumMosquitos'], rm_min=1, rm_max=7, min_val=14, max_val=28)

0.00422555406684047 Rolling Mean: 1 Shifted Value: 18


In [28]:
np.corrcoef(df['ResultDir'].expanding().mean(), df['NumMosquitos'])

array([[1.       , 0.0291355],
       [0.0291355, 1.       ]])

### Precipitation Related

Examining `PrecipTotal`

In [29]:
hi_corr(df['PrecipTotal'], df['NumMosquitos'], rm_min=1, rm_max=100, min_val=7, max_val=100)

0.06796524984931208 Rolling Mean: 50 Shifted Value: 42


In [30]:
np.corrcoef(df['PrecipTotal'].expanding().mean(), df['NumMosquitos'])

array([[1.        , 0.07840287],
       [0.07840287, 1.        ]])

Examining `WetBulb`
>The lowest temperature that can be reached by evaporating water into the air. Note: the wet bulb temperature will always be less than or equal to the temperature. It feels more comfortable when wet-bulb temperature is low. [Source] (http://apollo.lsc.vsc.edu/classes/met130/notes/chapter4/wet_bulb.html)

In [32]:
hi_corr(df['WetBulb'], df['NumMosquitos'], rm_min=1, rm_max=100, min_val=7, max_val=100)

0.07081425525830452 Rolling Mean: 100 Shifted Value: 7


In [33]:
np.corrcoef(df['WetBulb'].expanding().mean(), df['NumMosquitos'])

array([[1.        , 0.05820184],
       [0.05820184, 1.        ]])

Examining `DewPoint`
> the atmospheric temperature (varying according to pressure and humidity) below which water droplets begin to condense and dew can form.

In [34]:
hi_corr(df['DewPoint'], df['NumMosquitos'], rm_min=1, rm_max=100, min_val=7, max_val=100)

  c /= stddev[:, None]
  c /= stddev[None, :]


-0.002170705197829583 Rolling Mean: 9 Shifted Value: 74


In [35]:
np.corrcoef(df['DewPoint'].expanding().mean(), df['NumMosquitos'])

array([[1.00000000e+00, 8.56341637e-04],
       [8.56341637e-04, 1.00000000e+00]])

### Analysis of Shifted & Expanding Features Against `NumMosquitos`

Many of the shifted features had the largest correlation with `NumMosquitos` interestingly weren't shifted to the same degree as found in *Lebl, et al*. And, after running several models utilizing features engineered to the specifications we came to above, we came to the conclusion that a better fit model was possible with more thoughtfully engineered data.

So, given the extensive domain knowledge of *Lebl, et al* in the field, we moved forward with their conclusions for our when engineering our features:

- Daytime Length averaged 4-5 weeks prior
- Temperature averaged 2 weeks prior
- Wind speed averaged over 3 weeks prior