# Prediction and Supervised Learning: SDPD

Today, we'll make predictions on the San Diego Police Dept traffic stops dataset, paying attention to which attributes contribute most to our model.

Among the questions we'll ask, are
* Can you predict the age/gender/ethnicity based on other factors? (which are regression? classification?)
* Can you predict who will be search or arrested, based on attributes of the stopped driver?

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
import numpy as np
from datascience import *

### Clean the SDPD 

As machine learning algorithms usually require numeric input, the cell below cleaned the SDPD data in a standard way.
* Yes/No fields have been changed to 1/0.
* Fields not conforming to a the expected values were given the value -1.
    - you will likely want to filter out rows with -1 values before doing predictions!
* The ethnicities are encoded with integers using `race_dict`

In [None]:
import pandas as pd
sdraw = pd.read_csv('../01.Traffic_Stops/1.messy_data/data/vehicle_stops_2016_datasd.csv')

# convert string date to date-time object
sdraw['timestamp'] = pd.to_datetime(sdraw.timestamp)

cleaned = pd.DataFrame()
cleaned['stop_id'] = sd['stop_id']

# clean stop_cause
cleaned['is_moving_violation'] = sdraw.stop_cause.apply(lambda x:float(x == 'Moving Violation'))
cleaned['is_equipment_violation'] = sdraw.stop_cause.apply(lambda x:float(x == 'Equipment Violation'))

# service area: all non digits to -1, else same
cleaned['service_area'] = sdraw.service_area.apply(lambda x:x if x.isdigit() else -1)

# race: translate race codes to integers, given in race_dict
race_dict = dict(zip(sdraw.subject_race.unique(), range(1000)))
cleaned['subject_race'] = sdraw.subject_race.apply(lambda x: race_dict.get(x))

# sex: M=>1, F=>0, Else -1
def sex(s):
    if pd.isnull(s):
        return -1
    if s.lower() == 'M':
        return 1
    elif s.lower() == 'F':
        return 0
    else:
        return -1

cleaned['subject_sex'] = sdraw.subject_sex.apply(sex)

# Age: if not number, or > 100, then -1. Else make a float
def age(s):
    if pd.isnull(s):
        return -1
    if not s.isdigit():
        return -1
    if float(s) > 100:
        return -1
    else:
        return float(s)
    
cleaned['subject_age'] = sdraw.subject_age.apply(age)

# Datetime columns, using datetime methods
cleaned['hour'] = sdraw.timestamp.apply(lambda x:x.hour)
cleaned['day_of_week'] = sdraw.timestamp.apply(lambda x:x.dayofweek)
cleaned['day_of_month'] = sdraw.timestamp.apply(lambda x:x.day)
cleaned['month'] = sdraw.timestamp.apply(lambda x:x.month)

# SD resident / searched / arrested

def yes_no(s):
    if pd.isnull(s):
        return -1
    if s.lower() == 'Y':
        return 1
    elif s.lower() == 'N':
        return 0
    else:
        return -1
    
cleaned['sd_resident'] = sdraw.sd_resident.apply(yes_no)
cleaned['arrested'] = sdraw.arrested.apply(yes_no)
cleaned['searched'] = sdraw.searched.apply(yes_no)

cleaned = cleaned.dropna() # drops any row without a timestamp

In [None]:
sdpd = Table.from_df(cleaned)

In [None]:
sdpd

### Is a given traffic stop for a moving violation?

We will try to predict, given a traffic stop, was it a moving violation?

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# X=attributes for training; y=labels

# is_moving/equipment_violation are the labels, so we need to take them out!
train_table = sdpd.drop('stop_id', 'is_moving_violation', 'is_equipment_violation')
X = train_table.values
y = sdpd.column('is_moving_violation')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier 
rfc = RandomForestClassifier()

In [None]:
rfc.fit(X_train, y_train)

### Check the models accuracy:
* Is this accuracy even any good?
* What are the models "False positives"
    - the model guessed moving violation, when it wasn't
* What are the models "False negatives"
    - the model guess it wasn't a moving violation, when it was
* What are the properties of examples the model thought were moving violations?   

In [None]:
# Accuracy of the model
rfc.score(X_test, y_test)

In [None]:
# Code

In [None]:
# Code

### Feature importances
In decision tree based models, you can look at the importance of each feature to the model

* Which features are most important?
* Which values of those features are associated with moving violations?

In [None]:
importances = rfc.feature_importances_
for x,y in sorted(zip(train_table.labels, importances)):
    print(x,'\t\t\t', y)

In [None]:
# Code

In [None]:
# Code

### Try other ML model types
* `GradientBoostingClassifier` is a sequential decision tree model
    - each subsequent tree improves upon the misclassifications of the previous tree.
    - it has a `feature_importances_` attribute as well
* `LogisticRegression` is a regression based classifier
    - the importance of the features can be access through the attribute `coef_`.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

## Next steps: 

Try other prediction problems, such as those listed at the top of the notebook. As always try to understand *why* the models are making the decisions that they are!