<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Grid Searching and Multinomial Models with San Francisco Crime Data

_Authors: Joseph Nelson (DC), Sam Stack (DC)_

---

### Multinomial Logistic Regression Models

So far, we've been using logistic regression for binary problems where there are only two class labels. Logistic regression can also be extended to dependent variables with multiple classes.

There are two ways scikit-learn solves multiple class problems with logistic regression: a multinomial loss or a "one-versus-rest" (OvR) process in which a model is fit for each target class versus all of the other classes. 

**Multinomial vs. OvR**
- (both) `k` classes.
- (M) `k-1` models with one reference category.
- (OvR) `k*(k-1)/2` models.

You'll use grid search in conjunction with multinomial logistic regression to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

Rather than use the [OneVsRestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html), we're going to practice building individual models optimized to predict on _one class versus the rest_ for this lab.

**Necessary Lab Imports**

In [1]:
import numpy as np
import pandas as pd
import patsy

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV


import seaborn as sns

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1) Read in the data.

In [24]:
crime_csv = './datasets/sf_crime_train.csv'

In [25]:
df = pd.read_csv(crime_csv)
df.head(3)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 9 columns):
Dates         18000 non-null object
Category      18000 non-null object
Descript      18000 non-null object
DayOfWeek     18000 non-null object
PdDistrict    18000 non-null object
Resolution    18000 non-null object
Address       18000 non-null object
X             18000 non-null float64
Y             18000 non-null float64
dtypes: float64(2), object(7)
memory usage: 1.2+ MB


In [27]:
for col in df.columns[1:-3]:
    print df[col].value_counts()

LARCENY/THEFT                  4885
OTHER OFFENSES                 2291
NON-CRIMINAL                   2255
ASSAULT                        1536
VEHICLE THEFT                   967
VANDALISM                       877
BURGLARY                        732
WARRANTS                        728
SUSPICIOUS OCC                  592
MISSING PERSON                  535
DRUG/NARCOTIC                   496
ROBBERY                         465
FRAUD                           363
SECONDARY CODES                 261
WEAPON LAWS                     212
TRESPASS                        130
STOLEN PROPERTY                 111
SEX OFFENSES FORCIBLE           103
FORGERY/COUNTERFEITING           85
DRUNKENNESS                      74
KIDNAPPING                       50
PROSTITUTION                     44
DRIVING UNDER THE INFLUENCE      42
DISORDERLY CONDUCT               37
ARSON                            35
LIQUOR LAWS                      25
RUNAWAY                          16
BRIBERY                     

### 2) Create column for hour, month, and year from "Dates" column.

> *Hint: `pd.to_datetime` may or may not be helpful.*


In [30]:
df['Month'] = pd.to_datetime(df['Dates']).map(lambda x: x.month)
df['Year'] = pd.to_datetime(df['Dates']).map(lambda x: x.year)

In [31]:
df.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Month,Year
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,5,2015
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,5,2015
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414,5,2015
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873,5,2015
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541,5,2015


### 3) Validate and clean the data.

In [33]:
df.describe(include='all')

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Month,Year
count,18000,18000,18000,18000,18000,18000,18000,18000.0,18000.0,18000.0,18000.0
unique,7855,38,510,7,10,10,6381,,,,
top,4/1/15 0:01,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,SOUTHERN,NONE,800 Block of BRYANT ST,,,,
freq,25,4885,2127,2930,3287,12862,402,,,,
mean,,,,,,,,-122.423639,37.768466,3.489944,2015.0
std,,,,,,,,0.026532,0.024391,0.868554,0.0
min,,,,,,,,-122.513642,37.708154,2.0,2015.0
25%,,,,,,,,-122.434199,37.753838,3.0,2015.0
50%,,,,,,,,-122.416949,37.775608,3.0,2015.0
75%,,,,,,,,-122.406539,37.78539,4.0,2015.0


### 4) Set up a target and predictor matrix for predicting violent, non-violent, and non-crimes.

**Non-Violent Crimes**
- Bad checks.
- Bribery.
- Drug/narcotic.
- Drunkenness.
- Embezzlement.
- Forgery/counterfeiting.
- Fraud.
- Gambling.
- Liquor.
- Loitering.
- Trespass.

**Non-Crimes**
- Non-criminal.
- Runaway.
- Secondary codes.
- Suspicious OCC.
- Warrants.

**Violent Crimes**
- Everything else.

**What type of model do you need here? What should your "baseline" category be?**

In [45]:
NVC = ['BAD CHECKS',
'BRIBERY',
'DRUG/NARCOTIC',
'DRUNKENNESS',
'EMBEZZLEMENT',
'FORGERY/COUNTERFEITING',
'FRAUD',
'GAMBLING',
'LIQUOR',
'LOITERING',
'TRESPASS']
NC = ['NON-CRIMINAL',
'RUNAWAY',
'SECONDARY CODES',
'SUSPICIOUS OCC',
'WARRANTS']

In [72]:
df['NVC'] = df['Category'].isin(NVC)*1
df['NC'] = df['Category'].isin(NC)*1
df.loc[(df['NVC'] == 0) & (df['NC'] == 0), 'VC'] = 1
df['VC'].fillna(0, inplace=True)
df['VC'] = df['VC'].astype('int')

In [73]:
#Non-crime to be baseline cat
df.drop('NC', axis=1, inplace=True)
df.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Month,Year,NVC,VC
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,5,2015,0,0
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,5,2015,0,1
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414,5,2015,0,1
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873,5,2015,0,1
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541,5,2015,0,1


### 5) Standardize the predictor matrix.

In [7]:
# A:

### 6) Find the optimal hyperparameters (optimal regularization) to predict your crime categories.

> **Note:** Grid searching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently — the grid search object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. The `LogisticRegressionCV` is recommended, but the downside is that the lasso and ridge must be searched separately.

**References for logistic regression regularization hyperparameters:**
- `solver`: Algorithm used for optimization (relevant for multiclass).
    - `Newton-cg`: Handles multinomial loss and L2 only.
    - `Sag`: Handles multinomial loss, large data sets, and L2 only; works best on scaled data.
    - `lbfgs`: Handles multinomial loss and L2 only.
    - `Liblinear`: Small data sets; no warm starts.
- `Cs`: Regularization strengths (smaller values are stronger penalties).
- `cv`: Cross-validations or number of folds.
- `penalty`: `'l1'` = lasso, `'l2'` = ridge.

In [8]:
# Example:
# Fit model with five folds and lasso regularization.
# Use Cs=15 to test a grid of 15 distinct parameters.
# Remember: Cs describes the inverse of regularization strength.

# logreg_cv = LogisticRegressionCV(solver='liblinear', 
#                                  Cs=[1,5,10], 
#                                  cv=5, penalty='l1')

**Split data into training and testing sets with 50 percent in testing.**

In [9]:
# A:

**Grid search hyperparameters for the training data.**

In [10]:
# A:

**Find the best parameters for each target class.**

In [11]:
# A:

**Build three logistic regression models using the best parameters for each target class.**

In [12]:
# A:

### 7) Build confusion matrices for the models above.
- Use the holdout test data from the train/test split.

In [13]:
# A:

### 8) Print classification reports for your three models.

In [14]:
# A:

**Describe the metrics in the classification report.**

In [15]:
# A: