<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Gridsearch and Multinomial Models with SF Crime Data


---

### Multinomial logistic regression models

So far, we have been using logistic regression for binary problems where there are only two class labels. Logistic regression can be extended to dependent variables with multiple classes.

There are two ways sklearn solves multiple-class problems with logistic regression: a multinomial loss or a "one vs. rest" (OvR) process where a model is fit for each target class vs. all the other classes. 

**Multinomial vs. OvR**
- (M) 'k-1' models with 1 reference category
- (OvR) 'k*(k-1)/2' models

You will use the gridsearch in conjunction with multinomial logistic to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

**Necessary lab imports**

In [1]:
import numpy as np
import pandas as pd
import patsy

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV


import seaborn as sns

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1. Read in the data

In [2]:
crime_csv = '/Users/jasmine/Desktop/GA-DSI/Week-1/DSI/week04/day3_gridsearching_objectorientedprogramming/gridsearching_lab/datasets/sf_crime_train.csv'


In [3]:
#read in the data using pandas
sf_crime = pd.read_csv(crime_csv)
sf_crime.drop('DayOfWeek',axis=1,inplace=True)
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
# check the shape of your dataframe
sf_crime.shape

(18000, 8)

In [5]:
#check whether there are any missing values
#do we need to fix anything here? NO
sf_crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 8 columns):
Dates         18000 non-null object
Category      18000 non-null object
Descript      18000 non-null object
PdDistrict    18000 non-null object
Resolution    18000 non-null object
Address       18000 non-null object
X             18000 non-null float64
Y             18000 non-null float64
dtypes: float64(2), object(6)
memory usage: 1.1+ MB


In [6]:
#check what your datatypes are
#do we need to fix anything here?
sf_crime.dtypes

Dates          object
Category       object
Descript       object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
dtype: object

### 2. Create column for year, month, day, hour, time, and date from 'Dates' column.

> *`pd.to_datetime` and `Series.dt` may be helpful here!*


In [7]:
# convert the 'Dates' column to a datetime object
sf_crime['Dates'] = pd.to_datetime(sf_crime.Dates)

In [8]:
sf_crime['Dates'].dtype #<M8[ns] means datetime

dtype('<M8[ns]')

In [9]:
# create a new column for 'Year','Month',and 'Day_of_Week'
sf_crime['Year'] = sf_crime['Dates'].dt.year
sf_crime['Month'] = sf_crime['Dates'].dt.month
sf_crime['Day_of_Week'] = sf_crime['Dates'].dt.day
#check the first couple rows to make sure it's what you want
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,13
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,13


In [10]:
# create a column for the 'Hour','Time', and 'Date'
sf_crime['Hour'] = sf_crime['Dates'].dt.hour
sf_crime['Time'] = sf_crime['Dates'].dt.time
sf_crime['Date'] = sf_crime['Dates'].dt.date

In [11]:
# Drop the 'Dates' column
sf_crime.drop(columns='Dates', axis=1);

## 3. Validate and clean the data.

In [12]:
# check the 'Category' value counts to see what sort of categories there are
# and to see if anything might require cleaning (particularly the ones with fewer values)
sf_crime['Category'].value_counts()

LARCENY/THEFT                  4885
OTHER OFFENSES                 2291
NON-CRIMINAL                   2255
ASSAULT                        1536
VEHICLE THEFT                   967
VANDALISM                       877
BURGLARY                        732
WARRANTS                        728
SUSPICIOUS OCC                  592
MISSING PERSON                  535
DRUG/NARCOTIC                   496
ROBBERY                         465
FRAUD                           363
SECONDARY CODES                 261
WEAPON LAWS                     212
TRESPASS                        130
STOLEN PROPERTY                 111
SEX OFFENSES FORCIBLE           103
FORGERY/COUNTERFEITING           85
DRUNKENNESS                      74
KIDNAPPING                       50
PROSTITUTION                     44
DRIVING UNDER THE INFLUENCE      42
DISORDERLY CONDUCT               37
ARSON                            35
LIQUOR LAWS                      25
RUNAWAY                          16
EMBEZZLEMENT                

In [13]:
# What's going on with 'TRESPASS' and 'TRESPASSING'?
# What's going on with 'ASSAULT' and 'ASSUALT'?
# fix these with .loc
TRESPASSING_mask = (sf_crime.Category == 'TRESPASS')
sf_crime.loc[TRESPASSING_mask, 'Category'] = 'TRESPASSING'

ASSAULT_mask = (sf_crime.Category == 'ASSUALT')
sf_crime.loc[ASSAULT_mask, 'Category'] = 'ASSAULT'


In [23]:
sf_crime['Category'].value_counts()

LARCENY/THEFT                  4885
OTHER OFFENSES                 2291
NON-CRIMINAL                   2255
ASSAULT                        1538
VEHICLE THEFT                   967
VANDALISM                       877
BURGLARY                        732
WARRANTS                        728
SUSPICIOUS OCC                  592
MISSING PERSON                  535
DRUG/NARCOTIC                   496
ROBBERY                         465
FRAUD                           363
SECONDARY CODES                 261
WEAPON LAWS                     212
TRESPASSING                     131
STOLEN PROPERTY                 111
SEX OFFENSES FORCIBLE           103
FORGERY/COUNTERFEITING           85
DRUNKENNESS                      74
KIDNAPPING                       50
PROSTITUTION                     44
DRIVING UNDER THE INFLUENCE      42
DISORDERLY CONDUCT               37
ARSON                            35
LIQUOR LAWS                      25
RUNAWAY                          16
BRIBERY                     

In [29]:
import calendar

In [32]:
# have a look to see whether you have all the days of the week in your data
Days = []
for i in range(len(sf_crime['Date'])):
    Days.append(calendar.day_name[sf_crime['Date'][i].weekday()])


In [39]:
pd.Series(Days).value_counts()

Wednesday    2930
Friday       2733
Saturday     2556
Thursday     2479
Sunday       2456
Monday       2447
Tuesday      2399
dtype: int64

In [40]:
# have a look at the value counts for 'Descript', 'PdDistrict', and 'Resolution' to make sure it all checks out
sf_crime['Descript'].value_counts()


GRAND THEFT FROM LOCKED AUTO                              2127
STOLEN AUTOMOBILE                                          625
AIDED CASE, MENTAL DISTURBED                               591
DRIVERS LICENSE, SUSPENDED OR REVOKED                      589
BATTERY                                                    520
PETTY THEFT FROM LOCKED AUTO                               498
PETTY THEFT OF PROPERTY                                    484
LOST PROPERTY                                              468
WARRANT ARREST                                             429
MALICIOUS MISCHIEF, VANDALISM                              361
FOUND PROPERTY                                             353
MALICIOUS MISCHIEF, VANDALISM OF VEHICLES                  340
GRAND THEFT FROM UNLOCKED AUTO                             321
SUSPICIOUS OCCURRENCE                                      305
INVESTIGATIVE DETENTION                                    246
FOUND PERSON                                           

In [41]:
sf_crime['PdDistrict'].value_counts()

SOUTHERN      3287
NORTHERN      2250
CENTRAL       2206
MISSION       2118
BAYVIEW       1678
INGLESIDE     1628
TARAVAL       1426
TENDERLOIN    1327
RICHMOND      1101
PARK           979
Name: PdDistrict, dtype: int64

In [42]:
sf_crime['Resolution'].value_counts()

NONE                                      12862
ARREST, BOOKED                             4455
UNFOUNDED                                   367
ARREST, CITED                               100
JUVENILE BOOKED                              94
EXCEPTIONAL CLEARANCE                        58
PSYCHOPATHIC CASE                            28
LOCATED                                      25
CLEARED-CONTACT JUVENILE FOR MORE INFO       10
NOT PROSECUTED                                1
Name: Resolution, dtype: int64

In [43]:
# use .describe() to see whether the location coordinates seem appropriate
sf_crime.describe()

Unnamed: 0,X,Y,Year,Month,Day_of_Week,Hour
count,18000.0,18000.0,18000.0,18000.0,18000.0,18000.0
mean,-122.423639,37.768466,2015.0,3.489944,14.290167,13.646833
std,0.026532,0.024391,0.0,0.868554,8.955835,6.53904
min,-122.513642,37.708154,2015.0,2.0,1.0,0.0
25%,-122.434199,37.753838,2015.0,3.0,5.0,10.0
50%,-122.416949,37.775608,2015.0,3.0,16.0,15.0
75%,-122.406539,37.78539,2015.0,4.0,20.0,19.0
max,-122.365565,37.819923,2015.0,5.0,31.0,23.0


### 4. Set up a target and predictor matrix for predicting violent crime vs. non-violent crime vs. non-crimes.

**Non-Violent Crimes:**
- bad checks
- bribery
- drug/narcotic
- drunkenness
- embezzlement
- forgery/counterfeiting
- fraud
- gambling
- liquor
- loitering 
- trespass

**Non-Crimes:**
- non-criminal
- runaway
- secondary codes
- suspicious occ
- warrants

**Violent Crimes:**
- everything else



**What type of model do you need here? What is your baseline accuracy?**

In [49]:
NVC = ['BAD CHECKS','BRIBERY','DRUG/NARCOTIC','DRUNKENNESS',
     'EMBEZZLEMENT','FORGERY/COUNTERFEITING','FRAUD',
     'GAMBLING','LIQUOR','LOITERING','TRESPASS','OTHER OFFENSES']

NOT_C = ['NON-CRIMINAL','RUNAWAY','SECONDARY CODES','SUSPICIOUS OCC','WARRANTS']

#use a list comprehension to get all the categories in sf_crime['Category'].unique() that are NOT in the lists above
unique_cat = sf_crime['Category'].unique()
VC = [ cat_name for cat_name in unique_cat if (cat_name not in NVC) and (cat_name not in NOT_C)]
  

In [62]:
#add a column called 'Type' into your dataframe that stores whether the observation was:
#Non-Violent, Violent, or Non-Crime
#use .map()!
def typecrime(x):
    if x in NOT_C: return 'NOT_CRIMINAL'
    if x in NVC: return 'NON-VIOLENT'
    if x in VC: return 'VIOLENT_CRIME'

sf_crime['Type']= list(map(lambda x: typecrime(x), sf_crime['Category']))

In [64]:
#find the baseline accuracy:
baseline = sf_crime['Type'].value_counts(normalize=True)

In [65]:
baseline

VIOLENT_CRIME    0.600389
NOT_CRIMINAL     0.214000
NON-VIOLENT      0.185611
Name: Type, dtype: float64

In [69]:
#create a target array with 'Type'
#create a predictor matrix with 'Day_of_Week','Month','Year','PdDistrict','Hour', and 'Resolution'
y = sf_crime.Type
X = sf_crime[['Day_of_Week', 'Month', 'Year', 'PdDistrict', 'Hour', 'Resolution']]

In [73]:
#use pd.get_dummies() to dummify your categorical variables
#remember to drop a column!
X = pd.get_dummies(X, drop_first=True)

In [74]:
X.head()

Unnamed: 0,Day_of_Week,Month,Year,Hour,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,...,"Resolution_ARREST, BOOKED","Resolution_ARREST, CITED",Resolution_CLEARED-CONTACT JUVENILE FOR MORE INFO,Resolution_EXCEPTIONAL CLEARANCE,Resolution_JUVENILE BOOKED,Resolution_LOCATED,Resolution_NONE,Resolution_NOT PROSECUTED,Resolution_PSYCHOPATHIC CASE,Resolution_UNFOUNDED
0,13,5,2015,23,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,13,5,2015,23,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
2,13,5,2015,23,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
3,13,5,2015,23,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
4,13,5,2015,23,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0


### 5. Create a train/test/split and standardize the predictor matrices

In [82]:
from sklearn.model_selection import train_test_split

In [88]:
#create a 50/50 train test split; 
#stratify based on your target variable
#use a random state of 2018
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                            test_size=0.5, random_state=2018)

In [89]:
#standardise your predictor matrices
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

### 6. Create a basic Logistic Regression model and use cross_val_score to assess its performance on your training data

In [92]:
#create a default Logistic Regression model and find its mean cross-validated accuracy with your training data
#use 5 cross-validation folds
lr = LogisticRegression()
scores = cross_val_score(lr, X, y, cv=5)
print("Cross-validated scores:", scores)
print("Mean CV R2:", np.mean(scores))
print('Std CV R2:', np.std(scores))

Cross-validated scores: [0.62965019 0.63093585 0.61794943 0.63545429 0.62656293]
Mean CV R2: 0.6281105405787544
Std CV R2: 0.005829697365548055


In [96]:
#create a confusion matrix with cross_val_predict
predictions = cross_val_predict(lr, X, y, cv=5)
confusion = confusion_matrix(y, predictions)
pd.DataFrame(confusion,
             columns=sorted(y_train.unique()),
             index=sorted(y_train.unique()))

Unnamed: 0,NON-VIOLENT,NOT_CRIMINAL,VIOLENT_CRIME
NON-VIOLENT,1520,81,1740
NOT_CRIMINAL,736,254,2862
VIOLENT_CRIME,1193,82,9532


### 7. Find the optimal hyperparameters (optimal regularization) to predict your crime categories using GridSearchCV.

> **Note:** Gridsearching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently - the gridsearch object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. I recommend the logistic regression one, but the downside is that lasso and ridge must be searched separately. To start with, use `GridSearchCV`.

**Reference for logistic regression regularization hyperparameters:**
- `solver`: algorithm used for optimization (relevant for multiclass)
    - Newton-cg - Handles Multinomial Loss, L2 only
    - Sag - Handles Multinomial Loss, Large Datasets, L2 Only, Works best on scaled data
    - lbfgs - Handles Multinomial Loss, L2 Only
    - liblinear - Small Datasets, no Warm Starts
- `C`: Regularization strengths (smaller values are stronger penalties)
- `penalty`: `'l1'` - Lasso, `'l2'` - Ridge 

In [97]:
lr.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'max_iter': 100,
 'multi_class': 'ovr',
 'n_jobs': 1,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [101]:
#create a hyperparameter dictionary for a logistic regression
params = {
    'C': np.logspace(-5, 5, 15),
    'penalty': ['l1','l2'],
    'fit_intercept':[True,False]
}

In [103]:
#create a gridsearch object using LogisticRegression() and the dictionary you created above
from sklearn.model_selection import GridSearchCV

gs = GridSearchCV(lr, 
                  param_grid=params,
                 cv=5,
                 scoring='accuracy',
                 return_train_score= True,
                 verbose=1)


In [104]:
#fit the gridsearch object on your training data
gs.fit(X_train, y_train)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   41.6s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([1.00000e-05, 5.17947e-05, 2.68270e-04, 1.38950e-03, 7.19686e-03,
       3.72759e-02, 1.93070e-01, 1.00000e+00, 5.17947e+00, 2.68270e+01,
       1.38950e+02, 7.19686e+02, 3.72759e+03, 1.93070e+04, 1.00000e+05]), 'penalty': ['l1', 'l2'], 'fit_intercept': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=1)

In [105]:
#print out the best parameters
gs.best_params_

{'C': 1.0, 'fit_intercept': False, 'penalty': 'l1'}

In [108]:
#print out the best mean cross-validated score
cross_val_score(gs.best_estimator_, X_train, y_train, cv=5).mean()

0.6311122106381156

In [124]:
#assign your best estimator to the variable 'best_logreg'
best_logreg = gs.best_estimator_

In [114]:
#score your model on your testing data
gs.score(X_test, y_test)

0.6334444444444445

### 8. Print out a classification report for your best_logreg model

In [132]:
#use your test data to create your classification report
predictions = cross_val_predict(best_logreg, X_test, y_test)
#cross_val_predict(gs.best_estimator_, X_test, y_test, cv=5)
print(classification_report(y_test, predictions))

               precision    recall  f1-score   support

  NON-VIOLENT       0.46      0.46      0.46      1680
 NOT_CRIMINAL       0.66      0.07      0.13      1926
VIOLENT_CRIME       0.68      0.89      0.77      5394

  avg / total       0.63      0.64      0.58      9000



### 9. Explore LogisticRegressionCV.  

With LogisticRegressionCV, you can access the best regularization strength for predicting each class! Read the documentation and see if you can implement a model with LogisticRegressionCV.

In [134]:
# A:
lrCv = LogisticRegression()

In [136]:
lrCv.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'max_iter': 100,
 'multi_class': 'ovr',
 'n_jobs': 1,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [143]:
gsCV = GridSearchCV(lrCv, 
                  param_grid=params,
                 cv=5,
                 scoring='accuracy',
                 return_train_score= True,
                 verbose=1)

In [144]:
gsCV.fit(X_train, y_train)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   50.2s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([1.00000e-05, 5.17947e-05, 2.68270e-04, 1.38950e-03, 7.19686e-03,
       3.72759e-02, 1.93070e-01, 1.00000e+00, 5.17947e+00, 2.68270e+01,
       1.38950e+02, 7.19686e+02, 3.72759e+03, 1.93070e+04, 1.00000e+05]), 'penalty': ['l1', 'l2'], 'fit_intercept': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=1)

In [139]:
cross_val_score(gs.best_estimator_, X_train, y_train, cv=5).mean()

0.6312232599865191

In [145]:
print("Best score: ", gsCV.best_score_)

Best score:  0.6315555555555555


In [146]:
print("Best prams: ", gsCV.best_params_)

Best prams:  {'C': 1.0, 'fit_intercept': True, 'penalty': 'l1'}


In [147]:
gsCV.score(X_test, y_test)

0.6332222222222222