<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Grid Searching and Multinomial Models with San Francisco Crime Data

_Authors: Joseph Nelson (DC), Sam Stack (DC)_

---

### Multinomial Logistic Regression Models

So far, we've been using logistic regression for binary problems where there are only two class labels. Logistic regression can also be extended to dependent variables with multiple classes.

There are two ways scikit-learn solves multiple class problems with logistic regression: a multinomial loss or a "one-versus-rest" (OvR) process in which a model is fit for each target class versus all of the other classes. 

**Multinomial vs. OvR**
- (both) `k` classes.
- (M) `k-1` models with one reference category.
- (OvR) `k*(k-1)/2` models.

You'll use grid search in conjunction with multinomial logistic regression to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

Rather than use the [OneVsRestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html), we're going to practice building individual models optimized to predict on _one class versus the rest_ for this lab.

**Necessary Lab Imports**

In [1]:
import numpy as np
import pandas as pd
import patsy

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV


import seaborn as sns

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1) Read in the data.

In [2]:
crime_csv = '../datasets/sf_crime_train.csv'

In [3]:
sf_crime = pd.read_csv(crime_csv)
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
# There is a column that is a datetime data type and we want to check and see if it is currently an object.
sf_crime.Dates.dtype

dtype('O')

### 2) Create column for hour, month, and year from "Dates" column.

> *Hint: `pd.to_datetime` may or may not be helpful.*


In [7]:
sf_crime['date'] = sf_crime['Dates'].apply(pd.to_datetime)

In [9]:
sf_crime['year'] = sf_crime['date'].apply(lambda x: x.year)
sf_crime['month'] = sf_crime['date'].apply(lambda x: x.month)
sf_crime['day'] = sf_crime['date'].apply(lambda x: x.day)

In [10]:
# Check out the current DataFrame if you're are interested.
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,date,year,month,day
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015-05-13 23:53:00,2015,5,13
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015-05-13 23:53:00,2015,5,13


In [11]:
# Dropping columns where time is expressed in human language:
sf_crime.drop(['Dates','date'], axis = 1, inplace = True)

### 3) Validate and clean the data.

In [12]:
sf_crime['Category'].value_counts()

LARCENY/THEFT                  4885
OTHER OFFENSES                 2291
NON-CRIMINAL                   2255
ASSAULT                        1536
VEHICLE THEFT                   967
VANDALISM                       877
BURGLARY                        732
WARRANTS                        728
SUSPICIOUS OCC                  592
MISSING PERSON                  535
DRUG/NARCOTIC                   496
ROBBERY                         465
FRAUD                           363
SECONDARY CODES                 261
WEAPON LAWS                     212
TRESPASS                        130
STOLEN PROPERTY                 111
SEX OFFENSES FORCIBLE           103
FORGERY/COUNTERFEITING           85
DRUNKENNESS                      74
KIDNAPPING                       50
PROSTITUTION                     44
DRIVING UNDER THE INFLUENCE      42
DISORDERLY CONDUCT               37
ARSON                            35
LIQUOR LAWS                      25
RUNAWAY                          16
BRIBERY                     

In [13]:
# There is one instance of trespassing while all others are trespass, 
# as well as one instance of "assualt" due to a misspelling.

In [14]:
sf_crime['DayOfWeek'].value_counts()
# All days of the week are there.

Wednesday    2930
Friday       2733
Saturday     2556
Thursday     2479
Sunday       2456
Monday       2447
Tuesday      2399
Name: DayOfWeek, dtype: int64

In [15]:
sf_crime['PdDistrict'].value_counts()
# The values look good.

SOUTHERN      3287
NORTHERN      2250
CENTRAL       2206
MISSION       2118
BAYVIEW       1678
INGLESIDE     1628
TARAVAL       1426
TENDERLOIN    1327
RICHMOND      1101
PARK           979
Name: PdDistrict, dtype: int64

In [16]:
sf_crime['Resolution'].value_counts()
# One instance of not prosecuted was found.

NONE                                      12862
ARREST, BOOKED                             4455
UNFOUNDED                                   367
ARREST, CITED                               100
JUVENILE BOOKED                              94
EXCEPTIONAL CLEARANCE                        58
PSYCHOPATHIC CASE                            28
LOCATED                                      25
CLEARED-CONTACT JUVENILE FOR MORE INFO       10
NOT PROSECUTED                                1
Name: Resolution, dtype: int64

In [17]:
sf_crime[['X','Y']].describe()
# All of the coordinates appear to be legitimate.

Unnamed: 0,X,Y
count,18000.0,18000.0
mean,-122.423639,37.768466
std,0.026532,0.024391
min,-122.513642,37.708154
25%,-122.434199,37.753838
50%,-122.416949,37.775608
75%,-122.406539,37.78539
max,-122.365565,37.819923


In [18]:
# Figuring out where that wrong data exist in the DataFrame:
sf_crime[sf_crime['Category'] == 'ASSUALT']
# Rows 2750 and 4330.

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,year,month,day
2750,ASSUALT,AGGRAVATED ASSAULT WITH A DEADLY WEAPON,Wednesday,MISSION,NONE,3000 Block of 16TH ST,-122.421083,37.764911,2015,4,29
4330,ASSUALT,THREATS AGAINST LIFE,Saturday,MISSION,"ARREST, BOOKED",16TH ST / CALEDONIA ST,-122.421382,37.764948,2015,4,18


In [19]:
sf_crime[sf_crime['Category'] == 'TRESPASSING']
# Row 5519.

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,year,month,day
5519,TRESPASSING,TRESPASSING,Thursday,CENTRAL,"ARREST, BOOKED",300 Block of MONTGOMERY ST,-122.402739,37.792375,2015,4,16


In [20]:
sf_crime.loc[5519, 'Category']

'TRESPASSING'

In [21]:
# The issues with the data are small enough to be changed manually.
sf_crime.loc[2750, 'Category'] = 'ASSAULT'
sf_crime.loc[4330, 'Category'] = 'ASSAULT'
sf_crime.loc[5519, 'Category'] = 'TRESPASS'

### 4) Set up a target and predictor matrix for predicting violent, versus non-violent, versus non-crimes.

**Non-Violent Crimes**
- Bad checks.
- Bribery.
- Drug/narcotic.
- Drunkenness.
- Embezzlement.
- Forgery/counterfeiting.
- Fraud.
- Gambling.
- Liquor.
- Loitering.
- Trespass.

**Non-Crimes**
- Non-criminal.
- Runaway.
- Secondary codes.
- Suspicious OCC.
- Warrants.

**Violent Crimes**
- Everything else.

**What type of model do you need here? What should your "baseline" category be?**

### 5) Standardize the predictor matrix.

### 6) Find the optimal hyperparameters (optimal regularization) to predict your crime categories.

> **Note:** Grid searching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently — the grid search object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. The `LogisticRegressionCV` is recommended, but the downside is that the lasso and ridge must be searched separately.

**References for logistic regression regularization hyperparameters:**
- `solver`: Algorithm used for optimization (relevant for multiclass).
    - `Newton-cg`: Handles multinomial loss and L2 only.
    - `Sag`: Handles multinomial loss, large data sets, and L2 only; works best on scaled data.
    - `lbfgs`: Handles multinomial loss and L2 only.
    - `Liblinear`: Small data sets; no warm starts.
- `Cs`: Regularization strengths (smaller values are stronger penalties).
- `cv`: Cross-validations or number of folds.
- `penalty`: `'l1'` = lasso, `'l2'` = ridge.

**Split data into training and testing sets with 50 percent in testing.**

**Grid search hyperparameters for the training data.**

LogisticRegressionCV(Cs=100, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
           refit=True, scoring='accuracy', solver='liblinear', tol=0.0001,
           verbose=0)

**Find the best parameters for each target class.**

('best C for class:', {'non-violent': 0.14174741629268062, 'violent': 138.48863713938746, 'non-crime': 0.298364724028334})


**Build three logistic regression models using the best parameters for each target class.**

### 7) Build confusion matrices for the models above.
- Use the holdout test data from the train/test split.

### 8) Print classification reports for your three models.

**Describe the metrics in the classification report.**