<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Gridsearch and Multinomial Models with SF Crime Data

_Authors: Joseph Nelson (DC), Sam Stack (DC)_

---

### Multinomial logistic regression models

So far, we have been using logistic regression for binary problems where there are only two class labels. Logistic regression can be extended to dependent variables with multiple classes.

There are two ways sklearn solves multiple-class problems with logistic regression: a multinomial loss or a "one vs. rest" (OvR) process where a model is fit for each target class vs. all the other classes. 

**Multinomial vs. OvR**
- (both) 'k' classes
- (M) 'k-1' models with 1 reference category
- (OvR) 'k*(k-1)/2' models

You will use the gridsearch in conjunction with multinomial logistic to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

Rather than use the [OneVsRestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html), for this lab we are going to practice building individual models optimized to predict on _One class vs. the rest_.

**Necessary lab imports**

In [7]:
import numpy as np
import pandas as pd
import patsy

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV


import seaborn as sns

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1. Read in the data

In [8]:
crime_csv = '../datasets/sf_crime_train.csv'

In [9]:
sf_crime = pd.read_csv(crime_csv)
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [10]:
# There is a column that is is a Datetime and I want to check and see if it is currently an object
sf_crime.Dates.dtype

dtype('O')

### 2. Create column for hour, month, and year from 'Dates' column.

> *Hint: `pd.to_datetime` may or may not be helpful.*


In [11]:
# pd.datetime was not helpful
sf_time = pd.DataFrame(sf_crime['Dates'].str.split(' ',1).tolist(),columns = ['date','time'])
# sf_time is a dataframe where the Date and time are in separate columns

sf_date = pd.DataFrame(sf_time['date'].str.split('/').tolist(),columns = ['month','day','year'])
# sf_date is a dataframe where all the month, day and year are all in separate columns

In [12]:
# Merge data frames with individual time values back onto main df
sf_crime = sf_crime.merge(sf_date, left_index = True, right_index = True,how = 'outer')
sf_crime = sf_crime.merge(sf_time, left_index = True, right_index = True,how = 'outer')

In [13]:
# Check out Current dataframe if you are interested
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,month,day,year,date,time
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,5,13,15,5/13/15,23:53
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,5,13,15,5/13/15,23:53


In [14]:
# Dropping colums where time is expressed in human speak
sf_crime.drop(['Dates','date'], axis = 1, inplace = True)

### 3. Validate and clean the data.

In [15]:
sf_crime['Category'].value_counts()

LARCENY/THEFT                  4885
OTHER OFFENSES                 2291
NON-CRIMINAL                   2255
ASSAULT                        1536
VEHICLE THEFT                   967
VANDALISM                       877
BURGLARY                        732
WARRANTS                        728
SUSPICIOUS OCC                  592
MISSING PERSON                  535
DRUG/NARCOTIC                   496
ROBBERY                         465
FRAUD                           363
SECONDARY CODES                 261
WEAPON LAWS                     212
TRESPASS                        130
STOLEN PROPERTY                 111
SEX OFFENSES FORCIBLE           103
FORGERY/COUNTERFEITING           85
DRUNKENNESS                      74
KIDNAPPING                       50
PROSTITUTION                     44
DRIVING UNDER THE INFLUENCE      42
DISORDERLY CONDUCT               37
ARSON                            35
LIQUOR LAWS                      25
RUNAWAY                          16
BRIBERY                     

In [16]:
# 1 Trespassing, all others are trespass,  1 Assualt because someone can't spell.

In [17]:
sf_crime['DayOfWeek'].value_counts()
# all days of week are there

Wednesday    2930
Friday       2733
Saturday     2556
Thursday     2479
Sunday       2456
Monday       2447
Tuesday      2399
Name: DayOfWeek, dtype: int64

In [18]:
sf_crime['PdDistrict'].value_counts()
# Values look good

SOUTHERN      3287
NORTHERN      2250
CENTRAL       2206
MISSION       2118
BAYVIEW       1678
INGLESIDE     1628
TARAVAL       1426
TENDERLOIN    1327
RICHMOND      1101
PARK           979
Name: PdDistrict, dtype: int64

In [19]:
sf_crime['Resolution'].value_counts()
# 1 non prosecuted.  Seems legit

NONE                                      12862
ARREST, BOOKED                             4455
UNFOUNDED                                   367
ARREST, CITED                               100
JUVENILE BOOKED                              94
EXCEPTIONAL CLEARANCE                        58
PSYCHOPATHIC CASE                            28
LOCATED                                      25
CLEARED-CONTACT JUVENILE FOR MORE INFO       10
NOT PROSECUTED                                1
Name: Resolution, dtype: int64

In [20]:
sf_crime[['X','Y']].describe()
# all coordinates appear to be legitimate

Unnamed: 0,X,Y
count,18000.0,18000.0
mean,-122.423639,37.768466
std,0.026532,0.024391
min,-122.513642,37.708154
25%,-122.434199,37.753838
50%,-122.416949,37.775608
75%,-122.406539,37.78539
max,-122.365565,37.819923


In [21]:
# Figuring out where that wrong data exists in the DataFrame
sf_crime[sf_crime['Category'] == 'ASSUALT']
# rows 2750 and 4330

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,month,day,year,time
2750,ASSUALT,AGGRAVATED ASSAULT WITH A DEADLY WEAPON,Wednesday,MISSION,NONE,3000 Block of 16TH ST,-122.421083,37.764911,4,29,15,20:00
4330,ASSUALT,THREATS AGAINST LIFE,Saturday,MISSION,"ARREST, BOOKED",16TH ST / CALEDONIA ST,-122.421382,37.764948,4,18,15,23:05


In [22]:
sf_crime[sf_crime['Category'] == 'TRESPASSING']
# row 5519

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,month,day,year,time
5519,TRESPASSING,TRESPASSING,Thursday,CENTRAL,"ARREST, BOOKED",300 Block of MONTGOMERY ST,-122.402739,37.792375,4,16,15,6:00


In [23]:
# Issues with data are small enough to be manually changed
sf_crime.set_value(2750, 'Category', 'ASSAULT')
sf_crime.set_value(4330, 'Category', 'ASSAULT')
sf_crime.set_value(5519, 'Category', 'TRESPASS')

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,month,day,year,time
0,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,5,13,15,23:53
1,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,5,13,15,23:53
2,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414,5,13,15,23:33
3,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873,5,13,15,23:30
4,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541,5,13,15,23:30
5,LARCENY/THEFT,GRAND THEFT FROM UNLOCKED AUTO,Wednesday,INGLESIDE,NONE,0 Block of TEDDY AV,-122.403252,37.713431,5,13,15,23:30
6,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,INGLESIDE,NONE,AVALON AV / PERU AV,-122.423327,37.725138,5,13,15,23:30
7,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,BAYVIEW,NONE,KIRKWOOD AV / DONAHUE ST,-122.371274,37.727564,5,13,15,23:30
8,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,RICHMOND,NONE,600 Block of 47TH AV,-122.508194,37.776601,5,13,15,23:00
9,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,CENTRAL,NONE,JEFFERSON ST / LEAVENWORTH ST,-122.419088,37.807802,5,13,15,23:00


### 4. Set up a target and predictor matrix for predicting violent crime vs. non-violent crime vs. non-crimes.

**Non-Violent Crimes:**
- bad checks
- bribery
- drug/narcotic
- drunkenness
- embezzlement
- forgery/counterfeiting
- fraud
- gambling
- liquor
- loitering 
- trespass

**Non-Crimes:**
- non-criminal
- runaway
- secondary codes
- suspicious occ
- warrants

**Violent Crimes:**
- everything else



**What type of model do you need here? What should your "baseline" category be?**

In [24]:
# Multiclass regression.  Our Baseline will probably be Violent as that is the left-over group so to speak.
# However, it would naturally make sence to set our baseline be the class that has the most observations.

In [25]:
#First i'll need to convert sub categories into overlaying categories.

zeros = ['non-criminal', 'runaway', 'secondary codes', 'suspicious occ', 'warrants']
ones  = ['bad checks', 'bribery', 'drug/narcotic', 'drunkenness', 'embezzlement', 'forgery/counterfeiting', 'fraud', 
         'gambling','liquor', 'loitering', 'trespass', 'other offenses']
#twos  = all other things  

# Empty list to append values into
crime_cat = []
#iterate through sf_crime Category
for crime in sf_crime['Category']:
    # convert values to lower
    crime = crime.lower()
    # checks list of sub categories
    if crime in zeros:
        # appends the overlaying category
        crime_cat.append('non-crime')
    elif crime in ones:
        crime_cat.append('non-violent')
    else:
        crime_cat.append('violent')
        
# take that list and add it to the DF
sf_crime['cat_number'] = crime_cat

In [26]:
# also going to convert DayOfWeek, PdDistrict and Resolution to dummy variables.
dummies = pd.get_dummies(sf_crime[['DayOfWeek','PdDistrict','Resolution']], drop_first = True)

# Merge the dataframe result back onto the original dataframe
sf_crime = sf_crime.merge(dummies, left_index = True, right_index = True,how = 'outer')

In [27]:
# Dropping all the categorical values the I don't think will be relevant or have been converted to dummies for X
X = sf_crime.drop(['Category','Descript','DayOfWeek','PdDistrict',
                   'Resolution','Address','X','Y','cat_number','time'], axis = 1)
y = sf_crime['cat_number'].values

In [43]:
y

array(['non-crime', 'non-violent', 'non-violent', ..., 'non-crime',
       'violent', 'violent'], dtype=object)

In [28]:
X.columns

Index([u'month', u'day', u'year', u'DayOfWeek_Monday', u'DayOfWeek_Saturday',
       u'DayOfWeek_Sunday', u'DayOfWeek_Thursday', u'DayOfWeek_Tuesday',
       u'DayOfWeek_Wednesday', u'PdDistrict_CENTRAL', u'PdDistrict_INGLESIDE',
       u'PdDistrict_MISSION', u'PdDistrict_NORTHERN', u'PdDistrict_PARK',
       u'PdDistrict_RICHMOND', u'PdDistrict_SOUTHERN', u'PdDistrict_TARAVAL',
       u'PdDistrict_TENDERLOIN', u'Resolution_ARREST, CITED',
       u'Resolution_CLEARED-CONTACT JUVENILE FOR MORE INFO',
       u'Resolution_EXCEPTIONAL CLEARANCE', u'Resolution_JUVENILE BOOKED',
       u'Resolution_LOCATED', u'Resolution_NONE', u'Resolution_NOT PROSECUTED',
       u'Resolution_PSYCHOPATHIC CASE', u'Resolution_UNFOUNDED'],
      dtype='object')

### 5. Standardize the predictor matrix

In [29]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
Xs = ss.fit_transform(X)

### 6. Find the optimal hyperparameters (optimal regularization) to predict your crime categories.

> **Note:** Gridsearching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently - the gridsearch object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. I recommend the logistic regression one, but the downside is that lasso and ridge must be searched separately.

**Reference for logistic regression regularization hyperparameters:**
- `solver`: algorithm used for optimization (relevant for multiclass)
    - Newton-cg - Handles Multinomial Loss, L2 only
    - Sag - Handles Multinomial Loss, Large Datasets, L2 Only, Works best on scaled data
    - lbfgs - Handles Multinomial Loss, L2 Only
    - Liblinear - Small Datasets, no Warm Starts
- `Cs`: Regularization strengths (smaller values are stronger penalties)
- `cv`: cross-validations or number of folds
- `penalty`: `'l1'` - LASSO, `'l2'` - Ridge 

In [30]:
# Example:
# fit model with five folds and lasso regularization
# use Cs=15 to test a grid of 15 distinct parameters
# remember: Cs describes the inverse of regularization strength

# logreg_cv = LogisticRegressionCV(solver='liblinear', 
#                                  Cs=[1,5,10], 
#                                  cv=5, penalty='l1')

**Split data into training and testing with 50% in testing.**

In [31]:
# TTS our data.
# We will have a holdout set to test on at the end.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=12)

In [45]:
y_test

array(['violent', 'violent', 'non-violent', ..., 'non-crime', 'non-crime',
       'violent'], dtype=object)

**Gridsearch hyperparameters for the training data.**

In [32]:
# Lets set our model parameters 
logreg_cv = LogisticRegressionCV(Cs=100, cv=5, penalty='l1', scoring='accuracy', solver='liblinear')
logreg_cv.fit(X_train, y_train)

LogisticRegressionCV(Cs=100, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
           refit=True, scoring='accuracy', solver='liblinear', tol=0.0001,
           verbose=0)

**Find the best parameters for each target class.**

In [33]:
# find best C per class  
# Building a dictionary that does a regression for each of the Y classes
# after the fit it grabs the C value for said logistic regression and puts them together.
best_C = {logreg_cv.classes_[i]:x for i, (x, c) in enumerate(zip(logreg_cv.C_, logreg_cv.classes_))}
print('best C for class:', best_C)

('best C for class:', {'non-violent': 0.17073526474706921, 'violent': 79.248289835391859, 'non-crime': 0.29836472402833403})


**Build three logisitic regression models using the best parameters for each target class.**

In [34]:
# fit regular logit model to 'non-crime', 'non-violent', and 'violent' classes
# use lasso penalty
logreg_1 = LogisticRegression(C=best_C['non-crime'], penalty='l1', solver='liblinear', multi_class = 'ovr')
logreg_2 = LogisticRegression(C=best_C['non-violent'], penalty='l1', solver='liblinear', multi_class = 'ovr')
logreg_3 = LogisticRegression(C=best_C['violent'], penalty='l1', solver='liblinear', multi_class = 'ovr')

# Lets check out all of our outputs for all of our models
# fit model for predicting Non Crimes
logreg_1.fit(X_train, y_train)

LogisticRegression(C=0.29836472402833403, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [35]:
# fit model for predicting Non Violent
logreg_2.fit(X_train, y_train)

LogisticRegression(C=0.17073526474706921, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [36]:
# fit model for predicting Violent
logreg_3.fit(X_train, y_train)

LogisticRegression(C=79.248289835391859, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

### 7. Build confusion matrices for the models above
- Use the holdout test data from the train-test split

In [37]:
# using our logregs to predict on our test set and storing predictions
Y_1_pred = logreg_1.predict(X_test)
Y_2_pred = logreg_2.predict(X_test)
Y_3_pred = logreg_3.predict(X_test)

# stores confusion matrix for Y Test and Y Pred  
conmat_1 = confusion_matrix(y_test, Y_1_pred, labels=logreg_1.classes_)
# converts np.matrix format matrix to a dataframe and adds index and column names
conmat_1 = pd.DataFrame(conmat_1, columns=logreg_1.classes_, index=logreg_1.classes_)

conmat_2 = confusion_matrix(y_test, Y_2_pred, labels=logreg_2.classes_)
conmat_2 = pd.DataFrame(conmat_2, columns=logreg_2.classes_, index=logreg_2.classes_)

conmat_3 = confusion_matrix(y_test, Y_3_pred, labels=logreg_3.classes_)
conmat_3 = pd.DataFrame(conmat_3, columns=logreg_3.classes_, index=logreg_3.classes_)

print 'best params for non-crime:'
print conmat_1
print 'best params for non-violent:'
print conmat_2
print 'best params for violent:'
print conmat_3


best params for non-crime:
             non-crime  non-violent  violent
non-crime          132          362     1383
non-violent         38          867      869
violent             28          621     4700
best params for non-violent:
             non-crime  non-violent  violent
non-crime          132          360     1385
non-violent         38          870      866
violent             28          633     4688
best params for violent:
             non-crime  non-violent  violent
non-crime          132          352     1393
non-violent         38          862      874
violent             28          615     4706


### 8. Print classification reports for your three models.

In [38]:
print(classification_report(y_test, Y_1_pred))
print(classification_report(y_test, Y_2_pred))
print(classification_report(y_test, Y_3_pred))

             precision    recall  f1-score   support

  non-crime       0.67      0.07      0.13      1877
non-violent       0.47      0.49      0.48      1774
    violent       0.68      0.88      0.76      5349

avg / total       0.63      0.63      0.58      9000

             precision    recall  f1-score   support

  non-crime       0.67      0.07      0.13      1877
non-violent       0.47      0.49      0.48      1774
    violent       0.68      0.88      0.76      5349

avg / total       0.63      0.63      0.57      9000

             precision    recall  f1-score   support

  non-crime       0.67      0.07      0.13      1877
non-violent       0.47      0.49      0.48      1774
    violent       0.67      0.88      0.76      5349

avg / total       0.63      0.63      0.57      9000



**Describe the metrics in the classification report.**

In [39]:
# Precision ( True Positives divided by Total Predicted Positives)

# Recall (True Positives divided by Total Actual Positives)

# f1-score ( 2 * (precision * recall) / (precision + recall) )

# Support -  Number of True Values in said class


In [40]:
# We can observe subtle differences in the confusion matrix, 
# but overall our classification reports are identical.  

# this leads us to believe that there variables that are highly
# predictive of specific classes.

In [41]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

gs_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=5, verbose=1, n_jobs=-1)


In [42]:
lr_gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done 376 tasks      | elapsed:   13.5s
[Parallel(n_jobs=-1)]: Done 876 tasks      | elapsed:   38.4s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   53.0s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.12332e-05, ...,   8.90215e-01,   1.00000e+00]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)