# Logistic Regression

In this notebook, I fit a Logistic Regression model which predicts crime occurrences in Chicago in 2017 based on weather features. I use a pipeline and grid search in order to optimize my hyperparameters and boost my accuracy score. I found that the Logistic Regression model had 58% accuracy for the training data, and 54% for the testing data.

## Importing python libraries and dataframes

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report
import pickle
import warnings
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv('../Data/train.csv')
test = pd.read_csv('../Data/test.csv')

In [3]:
with open('../Assets/columns.pkl', 'rb') as f:
    columns = pickle.load(f)

## Train Test Split

We are using 2016 data (train) to predict 2017 data (test).

In [4]:
X_train = train.drop('target', 1)
y_train = train[['target']]

In [5]:
X_test = test.drop('target', 1)
y_test = test[['target']]

## Baseline Accuracy

In [8]:
y_test.target.value_counts(normalize = True)

0.0    0.96339
1.0    0.03661
Name: target, dtype: float64

We can see that the baseline accuracy for this model is 96.34%. 

## Modeling

Utilizing an easily interpretable model like logistic regression for binary outcomes allows us to make inferences for how a feature impacts the classification probabilities. Additionally, the beta coefficients are easy to interpret compared to other models. In our case, we can use this model to predict whether or not a crime occured in Chicago in 2017 based on weather features. Our negative class represents no crime occurring while our positive class represents a crime occurring.

I use pipeline so that I can scale my data and instantiate the model in one step, then use grid search so that I can tune hyperparameters in order to optimize my accuracy score.

In [7]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression())
])

In [8]:
params= {
    'lr__penalty': ['l1', 'l2'],
    'lr__C': [.001, .01, .03, .05, .75, .9]}

In [12]:
LogReg = GridSearchCV(pipe, param_grid = params, cv = TimeSeriesSplit(5))
LogReg.fit(X_train_sc, y_train)

GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
       error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'lr__penalty': ['l1', 'l2'], 'lr__C': [0.001, 0.01, 0.03, 0.05, 0.75, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [13]:
for params in LogReg.best_params_:
    print('The value for',params,'that had the highest accuracy score is',LogReg.best_params_[params])

The value for lr__C that had the highest accuracy score is 0.001
The value for lr__penalty that had the highest accuracy score is l1


These were the best parameters based on grid search. The hyperparameter C represents 1 over alpha, with alpha being the strength of regularization for large coefficients. By having a low C value, we can say that alpha is increasing, resulting in weaker penalization of large coefficients. 

For the penalty, grid search chose lasso regularization as its optimal parameter. This means that some coefficients will have values of 0, or having no weight. Because a ridge norm would drive down features without eradicating them, we can say that lasso is a more punishing regularization method.

In [14]:
print('The training score of the model is',LogReg.score(X_train, y_train))
print('The testing score of the model is',LogReg.score(X_test, y_test))

The training score of the model is 0.5866628612448054
The testing score of the model is 0.5449864854968743


Our model can accurately predict 58% of all crime occurrences in 2016 (train data). For the crime instances for
2017 (test data), the model was able to accurately predict 54% of all instances. The training score has a slightly higher score, indicating slight overfitting.

## Predict proba

In [28]:
LogReg.predict_proba(X_test)

array([[0.57964856, 0.42035144],
       [0.58021103, 0.41978897],
       [0.58029136, 0.41970864],
       ...,
       [0.59007934, 0.40992066],
       [0.59015912, 0.40984088],
       [0.59023889, 0.40976111]])

This probability matrix tells me which class each crime occurence is predicted to be in. For example, the first crime observation in my test dataframe has a 57% chance of being in class 0, meaning that that crime is predicted to not have occurred at that time.

## Finding coefficients

In [16]:
named_steps = LogReg.best_estimator_.named_steps['lr']
named_steps.coef_

array([[ 0.29596414, -0.00062126, -0.01306306,  0.        ,  0.09637038,
         0.00781081, -0.02868743,  0.        ,  0.        ,  0.00192942,
         0.        , -0.02793597,  0.        , -0.0023323 , -0.2286979 ,
         0.03376778,  0.        ,  0.        , -0.0003219 ,  0.00232864,
         0.05808837,  0.03745805,  0.        , -0.00475741]])

In [57]:
lg_coef_df = pd.DataFrame(named_steps.coef_, columns = columns).T
lg_coef_df['coefficients'] = lg_coef_df[0]
lg_coef_df.drop(0, axis=1, inplace=True)

In [58]:
lg_coef_df.sort_values(by = 'coefficients', ascending = False).head(10)

Unnamed: 0,coefficients
hr,0.295964
a_temp,0.09637
prev_7_day_avg_Temp,0.058088
prev_7_day_avg_Daylight,0.037458
sunrise,0.033768
a_wdsp,0.007811
prev_7_day_avg_Precip,0.002329
a_prcp,0.001929
prev_7_day_Rain_drizzle,0.0
daylight,0.0


After sorting the weight values in descending order, we can see which features have the highest weights in our model. In other words, these features had the biggest effect in determining whether a crime occurred or not. It seems that hour of the day and average temperature of the day are highly informative in predicting class. 

In [59]:
lg_coef_df.sort_values(by = 'coefficients', ascending = False).tail(10)

Unnamed: 0,coefficients
a_max,0.0
a_year,0.0
nighttime,-0.000322
da,-0.000621
a_thunder,-0.002332
prev_7_day_Snow,-0.004757
mo,-0.013063
a_rain_drizzle,-0.027936
a_mxpsd,-0.028687
beat_label,-0.228698


By looking at the tail of the dataframe, we can see the features with the lowest weights. Some features have 0 as their coefficient value, resulting from the lasso regularization we implemented in our grid search. To have a negative coefficient means that the probability of crime decreases as that feature's values increase. For example, beat label had the lowest weight in our model with a coefficient value of -0.22. This means that as beat label increases, the probability of crime occurrences decreases. This is not to say that beat areas with higher labels are safer. We can see that there is some correlation, but cannot determine causation without further examining confounding variables such as patrol patterns and how tightly each beat area is clustered.

## Putting predictions into confusion matrix

In [22]:
lg_test_predictions = LogReg.predict(X_test)

In [70]:
lg_cm = confusion_matrix(y_test, lg_test_predictions)

In [71]:
lg_cm_df = pd.DataFrame(lg_cm, columns=['predicted no crime', 'predicted crime'], index=['actual no crime', 'actual crime'])
lg_cm_df

Unnamed: 0,predicted no crime,predicted crime
actual no crime,1258271,1057086
actual crime,36468,51519


In [25]:
tn, fp, fn, tp = confusion_matrix(y_test, lg_test_predictions).ravel() 
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 1258271
False Positives: 1057086
False Negatives: 36468
True Positives: 51519


In [72]:
print(classification_report(y_test, LogReg.predict(X_test), target_names=['No Crime', 'Crime']))

              precision    recall  f1-score   support

    No Crime       0.97      0.54      0.70   2315357
       Crime       0.05      0.59      0.09     87987

   micro avg       0.54      0.54      0.54   2403344
   macro avg       0.51      0.56      0.39   2403344
weighted avg       0.94      0.54      0.67   2403344



We have a high number of false positives, and a relatively low number of false negatives. Having low false negatives means that in comparison to the 3 million rows, only about 36,400 were falsely predicted to not have occurred. The classification report indicates that we have a slightly higher sensitivity score in comparison to specificity. 

Our model did a good job at correctly predicting true negatives, but when it came to predicting true positives, it was not able to correctly predict as many.

## Saving dataframes

In [26]:
X_train.to_csv('../Data/X_train.csv')
X_test.to_csv('../Data/X_test.csv')
y_train.to_csv('../Data/y_train.csv')
y_test.to_csv('../Data/y_test.csv')