<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Gridsearch and Multinomial Models with SF Crime Data


---



Predict the category (type) of crime based on various features captured by San Francisco police departments.

**Necessary lab imports**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

### 1. Read in the data

In [3]:
# read in the data using pandas
sf_crime = pd.read_csv(
    '../../../../resource-datasets/sf_crime/sf_crime_sample.csv')
sf_crime.drop('DayOfWeek', axis=1, inplace=True)
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052
2,2004-03-06 03:00:00,NON-CRIMINAL,LOST PROPERTY,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421
3,2011-12-03 12:10:00,BURGLARY,"BURGLARY OF STORE, UNLAWFUL ENTRY",TARAVAL,"ARREST, BOOKED",3200 Block of 20TH AV,-122.475647,37.728528
4,2003-01-10 00:15:00,LARCENY/THEFT,PETTY THEFT OF PROPERTY,NORTHERN,NONE,POLK ST / BROADWAY ST,-122.421772,37.795946


In [4]:
# check the shape of your dataframe

In [5]:
# check whether there are any missing values
# do we need to fix anything here?

In [6]:
# check what your datatypes are
# do we need to fix anything here?

### 2. Create column for year, month, day, hour, time, and date from 'Dates' column.

> *`pd.to_datetime` and `Series.dt` may be helpful here!*


In [7]:
# convert the 'Dates' column to a datetime object
sf_crime['Dates'] = pd.to_datetime(sf_crime['Dates'])
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052


In [8]:
# create a new column for 'Year','Month',and 'Day_of_Week'
sf_crime['Year'] = sf_crime['Dates'].dt.year
sf_crime['Month'] = sf_crime['Dates'].dt.month
sf_crime['Day_of_Week'] = sf_crime['Dates'].dt.weekday_name
# check the first couple rows to make sure it's what you want
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday


In [9]:
# create a column for the 'Hour','Time', and 'Date'
sf_crime['Hour'] = sf_crime['Dates'].dt.hour
sf_crime['Time'] = sf_crime['Dates'].dt.time
sf_crime['Date'] = sf_crime['Dates'].dt.date
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week,Hour,Time,Date
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday,23,23:27:00,2003-03-23
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday,6,06:45:00,2006-03-07


In [10]:
# Drop the 'Dates' column

### 3. Validate and clean the data.

In [11]:
# check the 'Category' value counts to see what sort of categories there are
# and to see if anything might require cleaning (particularly the ones with fewer values)

In [12]:
# have a look to see whether you have all the days of the week in your data

In [13]:
# have a look at the value counts for 'Descript', 'PdDistrict', and 'Resolution' to make sure it all checks out

In [14]:
# use .describe() to see whether the location coordinates seem appropriate

### 4. Set up a target and predictor matrix for predicting violent crime vs. non-violent crime vs. non-crimes.

**Non-Violent Crimes:**
- bad checks
- bribery
- drug/narcotic
- drunkenness
- embezzlement
- forgery/counterfeiting
- fraud
- gambling
- liquor
- loitering 
- trespass

**Non-Crimes:**
- non-criminal
- runaway
- secondary codes
- suspicious occ
- warrants

**Violent Crimes:**
- everything else



**What type of model do you need here? What is your baseline accuracy?**

In [15]:
NVC = ['BAD CHECKS', 'BRIBERY', 'DRUG/NARCOTIC', 'DRUNKENNESS',
       'EMBEZZLEMENT', 'FORGERY/COUNTERFEITING', 'FRAUD',
       'GAMBLING', 'LIQUOR LAWS', 'LOITERING', 'TRESPASS', 'OTHER OFFENSES']

NOT_C = ['NON-CRIMINAL', 'RUNAWAY',
         'SECONDARY CODES', 'SUSPICIOUS OCC', 'WARRANTS']

# use a list comprehension to get all the categories in sf_crime['Category'].unique() that are NOT in the lists above

VC = []

In [16]:
# add a column called 'Type' into your dataframe that stores whether the observation was:
# Non-Violent, Violent, or Non-Crime
# use .map()!


def typecrime(x):
    if x in NOT_C:
        return 'NOT_CRIMINAL'
    if x in NVC:
        return 'NON-VIOLENT'
    if x in VC:
        return 'VIOLENT_CRIME'

# sf_crime['Type']=

In [17]:
# find the baseline accuracy:

In [18]:
# create a target array with 'Type'
# y =

In [19]:
# create a predictor matrix with 'Day_of_Week','Month','Year','PdDistrict','Hour', and 'Resolution'
# X =

In [20]:
# use pd.get_dummies() to dummify your categorical variables
# remember to drop a column!
# X =

### 5. Create a train/test/split and standardize the predictor matrices

In [21]:
# create a 50/50 train test split;
# stratify based on your target variable
# use a random state of 2018

In [22]:
# standardise your predictor matrices
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

### 6. Create a basic Logistic Regression model and use cross_val_score to assess its performance on your training data

In [23]:
# create a default Logistic Regression model and find its mean cross-validated accuracy with your training data
# use 5 cross-validation folds

In [24]:
# create a confusion matrix
# predictions =
#confusion = confusion_matrix()
# pd.DataFrame(confusion,
#             columns=sorted(y_train.unique()),
#             index=sorted(y_train.unique()))

### 7. Find the optimal hyperparameters (optimal regularization) to predict your crime categories using GridSearchCV.

> **Note:** Gridsearching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently - the gridsearch object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. I recommend the logistic regression one, but the downside is that lasso and ridge must be searched separately. To start with, use `GridSearchCV`.

**Reference for logistic regression regularization hyperparameters:**
- `solver`: algorithm used for optimization (relevant for multiclass)
    - Newton-cg - Handles Multinomial Loss, L2 only
    - Sag - Handles Multinomial Loss, Large Datasets, L2 Only, Works best on scaled data
    - lbfgs - Handles Multinomial Loss, L2 Only
    - liblinear - Small Datasets, no Warm Starts
- `C`: Regularization strengths (smaller values are stronger penalties)
- `penalty`: `'l1'` - Lasso, `'l2'` - Ridge 

In [25]:
# create a hyperparameter dictionary for a logistic regression

In [26]:
# create a gridsearch object using LogisticRegression() and the dictionary you created above

In [27]:
# fit the gridsearch object on your training data

In [28]:
# print out the best parameters

In [29]:
# print out the best mean cross-validated score

In [30]:
# assign your best estimator to the variable 'best_logreg'

In [31]:
# score your model on your testing data

### 8. Print out a classification report for your best_logreg model

In [32]:
# use your test data to create your classification report
# predictions =
# print(classification_report())

### 9. Explore LogisticRegressionCV.  

With LogisticRegressionCV, you can access the best regularization strength for predicting each class! Read the documentation and see if you can implement a model with LogisticRegressionCV.

In [33]:
# A: