# Logistic Regression Model
Logistic Regression a generalized Linear Model with a straightforward implementation and fast compute times. These qualities make LR a great benchmark before moving on to more complex modeling. Some of it's strengths include low variance, resiliance against overfitting, and a nice probabalistic ouput which makes for easy interpretation.


In [1]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit


# Loading Time Series Split assets

In [2]:
X_train = pd.read_pickle('../assets/X_train.pkl')
X_test = pd.read_pickle('../assets/X_test.pkl')
y_train = pd.read_pickle('../assets/y_train.pkl')
y_test = pd.read_pickle('../assets/y_test.pkl')

In [3]:
X_train.columns


Index(['Day_length', 'Tmax', 'Tmin', 'Tavg', 'ResultSpeed', 'ResultDir',
       'AvgSpeed', 'Sunset', 'Sunrise', 'Heat', 'Depart', 'DewPoint',
       'WetBulb', 'Cool', 'PrecipTotal', 'StnPressure', 'Latitude',
       'Longitude', 'Month', 'Day_length_exp', 'Tavg_shift', 'Heat_exp',
       'Cool_shift', 'Tmax_shift', 'Tmin_shift', 'Depart_shift',
       'ResultSpeed_shift', 'ResultDir_exp', 'PrecipTotal_exp', 'WetBulb_exp',
       'Species_CULEX ERRATICUS', 'Species_CULEX PIPIENS',
       'Species_CULEX PIPIENS/RESTUANS', 'Species_CULEX RESTUANS',
       'Species_CULEX SALINARIUS', 'Species_CULEX TARSALIS',
       'Species_CULEX TERRITANS'],
      dtype='object')

# Setting up a pipeline
We're optimizing the model first by scaling our values.
Next we cycle through both L1 and L2 penalties to determine the optimum loss function for our model.
Finally we iterate through c values between .001 and .95 which cycles through the the min and max of acceptable ranges. C is the inverse regularzation strength which controls the weight of the coefficients in our model. A higher c indicates a lower strength of regularization. After grid searching our best C turned out to be .95 which mean almost no regularization was applied.

In [4]:
pipe = Pipeline([
    ('ss',StandardScaler()),
    ('lr',LogisticRegression(solver='liblinear')),    
])

# Setting a range of hyperparameters

In [5]:
param_grid =  {
    'lr__penalty':['l1','l2'],
    'lr__C': np.linspace(.001,.95,50)
        
}

# Gridsearching
Gridsearching our parameter grid to get the best hyperparameters for our model. 
We set the scoring to roc_auc because of the low count of mosquitos with west nile in our train data. If we were to score on accuracy our model would never predict west nile because 95% of the data doesn't contain observations where mosquitos have west nile. Roc Auc punishes the model for not predicting west nile, therefore it's a better metric for our current application. 
We also did a time series split here because we felt that there was a strong spatiotemporal relationship in the data.

In [6]:
gs = GridSearchCV(pipe, param_grid=param_grid,verbose=1,scoring='roc_auc', cv=TimeSeriesSplit())

In [7]:
gs.fit(X_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   17.9s finished


GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=3),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'lr__penalty': ['l1', 'l2'], 'lr__C': array([0.001  , 0.02037, 0.03973, 0.0591 , 0.07847, 0.09784, 0.1172 ,
       0.13657, 0.15594, 0.17531, 0.19467, 0.21404, 0.23341, 0.25278,
       0.27214, 0.29151, 0.31088, 0.33024, 0.34961, 0.36898, 0.38835,
       0.40771, 0.42708, 0.44645, 0.4658...69, 0.79506,
       0.81443, 0.8338 , 0.85316, 0.87253, 0.8919 , 0.91127, 0.93063,
       0.95   ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_

# Scoring 
The scoring metric used to evaluate our model was roc_auc. ROC_AUC is an aggregate measure of performance across all classification thresholds, it returns a float between 0 and 1. A score of 0 demonstrates a poor classifier that is wrong 100% of the time. A score of 1 indicates a strong classifier that is right 100% of the time. 
We scored an .86 on our training data and a .76 on our test data. We interpreted the outcome of our modeling with catious optimism. We feel confident that this model can be used to shape decisions regarding the frequency and location of pesticide application.

In [8]:
gs.score(X_train,y_train)

0.8658258730915267

In [9]:
gs.score(X_test,y_test)

0.7611057928963493

# Checking the best parameters
Checking the optimal parameters for our model as determined by our gridsearch.
Our C value is .95 percent which indicates that we are applying almost no regularization. The loss function being used is L1 which is robust against outliers.

In [10]:
gs.best_params_

{'lr__C': 0.95, 'lr__penalty': 'l1'}

# Loading coefficients into a dataframe
Interpreting the feature weights gave us insight into what the model was using to predict the presence of west nile. Features like day length expanded mean and wetbulb contributed to our predictions positively. An interesting note is that the species culex pipiens, a known carrier of the disease is also a positive indicator for west nile in our model.

In [28]:
coefs = pd.DataFrame(data=gs.best_estimator_.named_steps['lr'].coef_.T,
                     index=X_train.columns,
                     columns=['importance']
                    )

In [None]:
coefs.head()

In [13]:
# with open('../assets/logistic_regression.pkl','wb+') as f:
#     pickle.dump(gs,f)