# [LEGALST-123] Lab 14: Other Prediction Algorithms & Feature Selection

In [26]:
from datascience import *
from collections import Counter
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# Introduction
In this lab, we will spend some time looking into different classification algorithms from the ones you have seen in previous labs so far. To aid us in our prediction tasks, we will also be learning about how to effectively use feature selection to make sure we are optimizing our use of the data. For this purpose we'll be utilizing the Nashville police stops dataset that you have seen in several labs thus far.

- **Other Prediction Algorithms**: 
    - We will cover a handful of other classification algorithms, including support vector machines (SVMs), decision trees, random forests, and naive Bayes. For each algorithm, we'll provide some intuition behind which circumstances they are most useful for and how to judge their performance.

- **Feature Selection**:
    - We will learn about the process of feature selection, which will allow us to optimize our model and reduce overfitting of our model, which we learned about in the previous lab.

<br/>

<hr style="border: 1px solid #fdb515;" />

## Other Prediction Algorithms

As stated previously, we'll be using the Nashville police stops dataset again, but we'll be using a cleaned sample of it that we obtained in a previous lab. This sample contains 833 rows and 30 columns.

In [2]:
stops = pd.read_csv("https://github.com/ds-modules/data/raw/main/nashville_cleaned.csv", index_col=0).reset_index()
stops.head(5)

Unnamed: 0,index,date,time,location,lat,lng,precinct,reporting_area,zone,subject_age,...,contraband_drugs,contraband_weapons,frisk_performed,search_conducted,search_person,search_vehicle,search_basis,reason_for_stop,vehicle_registration_state,notes
0,1840907,2010-04-18,13140.0,"BURGESS AVE & WHITE BRIDGE PIKE, NASHVILLE, TN...",36.145004,-86.85797,1.0,5103.0,113.0,23.0,...,,,False,False,False,False,,moving traffic violation,TN,
1,492044,2015-01-19,19920.0,"DUE WEST AVE W & S GRAYCROFT AVE, MADISON, TN,...",36.249187,-86.734459,7.0,1797.0,723.0,45.0,...,,,False,False,False,False,,vehicle equipment violation,TN,tail light out
2,431170,2015-01-15,1020.0,"S GALLATIN PIKE & MADISON BLVD, MADISON, TN, 3...",36.254979,-86.715246,7.0,1623.0,711.0,21.0,...,,,False,False,False,False,,moving traffic violation,TN,
3,2066423,2013-05-17,62760.0,"CHARLOTTE PIKE & W HILLWOOD DR, NASHVILLE, TN,...",36.139093,-86.880533,1.0,5009.0,123.0,35.0,...,,,False,False,False,False,,vehicle equipment violation,TN,
4,2899480,2010-09-01,28140.0,"BELL RD & DODSON CHAPEL RD, HERMITAGE, TN, 37076",36.16331,-86.613147,5.0,9501.0,521.0,53.0,...,,,False,False,False,False,,moving traffic violation,TN,


Before we start getting into some of our prediction tasks, we'll drop the columns with many null values that are irrelevant for this analysis.

In [3]:
stops.drop(labels = ['contraband_found', 'contraband_drugs', 'contraband_weapons', 'search_basis', 'notes'], axis = 1, inplace = True)

Next, we have to decide which feature of our data we would like to predict. Because we'll be practicing with more classifcation algorithms today, we want to choose a column that can easily be split into two or more classes, if it's not already presented like that. Let's take a look at our columns again below:

In [4]:
stops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   index                       833 non-null    int64  
 1   date                        833 non-null    object 
 2   time                        833 non-null    float64
 3   location                    833 non-null    object 
 4   lat                         833 non-null    float64
 5   lng                         833 non-null    float64
 6   precinct                    833 non-null    float64
 7   reporting_area              833 non-null    float64
 8   zone                        833 non-null    float64
 9   subject_age                 833 non-null    float64
 10  subject_race                833 non-null    object 
 11  subject_sex                 833 non-null    object 
 12  officer_id_hash             833 non-null    object 
 13  type                        833 non

Based on this information, we can see that some of the columns such as `arrest_made` or `citation_issued` are presented as `True` or `False` values. These may be good options for our classification tasks, as they could also present us with some interesting algorithms: based on the features of the individual row, can we accurately predict whether an arrest was made or a citation was issued? 

Let's take a look at the distribution of `True/False` values for these columns below with the `.value_counts()` method:

In [5]:
stops['arrest_made'].value_counts()

False    816
True      17
Name: arrest_made, dtype: int64

In [6]:
stops['citation_issued'].value_counts()

False    660
True     173
Name: citation_issued, dtype: int64

In [7]:
stops['warning_issued'].value_counts()

True     684
False    149

Among these three columns, it looks like the `citation_issued` column has the more well-distributed split between the true and false values. Our prediction algorithms will typically do better when the split between the two classes is more even, so we'll use that column for the binary prediction tasks below!

In the regression lab, we discussed the importance of doing train-test splits on our data, so in the cell below we'll perform this split and then define our 2-D DataFrame of features in a variable `X_binary` (for now, we'll use all of the numerical columns in the DataFrame as features for our prediction), our 1-D array of labels in another variable `y_binary`:

In [14]:
# the features used to predict citation issuing status
X_binary = stops[['time', 'lat', 'lng', 'precinct', 'reporting_area', 'zone', 'subject_age']]

# whether a citation was issued or not
y_binary = stops['citation_issued']

# set the random seed
np.random.seed(10)

# split the data with 0.2 proportion for test size
# train_test_split returns 4 values: X_train, X_test, y_train, y_test 
X_binary_train, X_binary_test, y_binary_train, y_binary_test = train_test_split(X_binary, y_binary, test_size = 0.2)

### Support Vector Machines (SVMs)

TO BE COMPLETED: SVM models

### Decision Trees

Decision trees predict target values by creating a set of decision rules. The tree is made up of *nodes*, which constitute decision points, and *branches*, which represent the outcome of the decision. Here's an example using the [Titanic](https://www.kaggle.com/c/titanic/data) data set to predict whether or not a passenger survived the sinking of the ship. Nodes are represented by the text, and branches by lines (left branch = 'yes', right branch = 'no').

Starting at the *root node* (which in computer science, somewhat counterintuitively, is at the top), the data is split into different subgroups at each decision node going top to bottom. The very bottom nodes in the tree (the *leaves*) assign prediction values to the data. 

<img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png" style="width: 400px; height: 400px;" />

> 'sibsp' gives the number of siblings or spouses a passanger had on board. The left number under a leaf is the chance of survival for that subgroup; the right number is the percentage of passengers in that subgroup.

**Question 1:** Based on the decision tree above, what would the model predict would happen to an 8-year-old boy with 2 sisters and a brother? What would the chance of survival be for a 28-year-old married man?

*YOUR ANSWER HERE*

In [7]:
# SOLUTION

# The boy would be predicted to survive. The man would have survived with a 17% chance.

The process is very similar to the other scikit-learn models you've used.
1. Create the `DecisionTreeRegressor()`. Set `max_depth` equal to 3.
2. Fit `X_train` and `y_train` to the regressor to create the model


Note: The `max_depth =` parameter of DecisionTreeRegression constrains how many times a data set can be split. For example, the Titanic tree had a max depth of 3 (i.e. you could pass through at most 3 branches when going from the root to a leaf). 

In the cell below, we perform the two-step process listed above. You'll notice that we decide to utilize a lot of the optional parameters listed in the DecisionTree method! If we didn't want to use so many of the parameters, we simply could do `dt_reg = DecisionTreeRegressor()`, and it would also work in creating a DecisionTree model -- it would just be less specific than the one we've created below!

In [16]:
# make the DecisionTreeRegressor
dt_reg = DecisionTreeRegressor(criterion = 'squared_error',  # how to measure fit
                               splitter = 'best',  # or 'random' for random best split
                               max_depth = 3,  # how deep tree nodes can go
                               min_samples_split = 2,  # samples needed to split node
                               min_samples_leaf = 1,  # samples needed for a leaf
                               min_weight_fraction_leaf = 0.0,  # weight of samples needed for a node
                               max_features = None,  # number of features to look for when splitting
                               max_leaf_nodes = None,  # max nodes
                               min_impurity_decrease = 1e-07)  # early stopping

# fit the model
dt_model = dt_reg.fit(X_binary_train, y_binary_train)

Now that we have our model created and fitted, we can use the `.score()` method to test it out on our test set.

In [17]:
# score the model
print(dt_model.score(X_binary_train, y_binary_train))
print(dt_model.score(X_binary_test, y_binary_test))

0.151622010144
-0.220985663765


The best score we can receive with this method is 1.0 -- scores very close to zero and negative scores imply the model does not perform very well. This model was just a base model that utilized all numerical features from our dataset (regardless of whether they might be good estimators for citation status or not). As we explore more models in this lab and get into feature importance and selection, we can get a better understanding of why the decision tree doesn't perform very well in this case and see what we can do to improve it!

However, one benefit of decision trees that we saw above with the Titanic variable is that they are very interpretable and easy to visualize. We can use the `export_graphviz` function below to get a better understanding of the splits that constitute our decision tree:

In [18]:
print(export_graphviz(dt_model, out_file = None, feature_names = X_binary_train.columns))

digraph Tree {
node [shape=box, fontname="helvetica"] ;
edge [fontname="helvetica"] ;
0 [label="reporting_area <= 9403.0\nsquared_error = 0.165\nsamples = 666\nvalue = 0.209"] ;
1 [label="subject_age <= 52.5\nsquared_error = 0.149\nsamples = 614\nvalue = 0.182"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="time <= 65280.0\nsquared_error = 0.167\nsamples = 522\nvalue = 0.213"] ;
1 -> 2 ;
3 [label="squared_error = 0.192\nsamples = 367\nvalue = 0.259"] ;
2 -> 3 ;
4 [label="squared_error = 0.093\nsamples = 155\nvalue = 0.103"] ;
2 -> 4 ;
5 [label="reporting_area <= 9296.0\nsquared_error = 0.011\nsamples = 92\nvalue = 0.011"] ;
1 -> 5 ;
6 [label="squared_error = 0.0\nsamples = 91\nvalue = 0.0"] ;
5 -> 6 ;
7 [label="squared_error = 0.0\nsamples = 1\nvalue = 1.0"] ;
5 -> 7 ;
8 [label="lat <= 36.216\nsquared_error = 0.25\nsamples = 52\nvalue = 0.519"] ;
0 -> 8 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
9 [label="time <= 12570.0\nsquared_error = 0.23\n

### Random Forests

A **Random Forest** is an example of an **ensemble method**, a prediction algorithm that we'll cover in a future lab. Essentially, ensemble methods work by building multiple estimators and then using the average (or majority) prediction as the final one. Random Forests do this by by creating multiple decision trees (a 'forest' of them, if you will), each trained on sample of data drawn at random with replacement from the given set. Additionally, when each tree is constructed, not every feature is considered as a candidate on which to split the tree for each decision point.

By adding some randomization into the subsets and features that are considered by each model, then averaging the predictions across models, Random Forest can typically produce a model that is better at generalization.

To create a Random Forest model using scikit-learn, we can utilize the `RandomForestRegressor()` method!

**Question 2:** In the cell below, create a base `RandomForestRegressor()` model (i.e. don't worry about passing in any hyperparameters!) and fit it to the training data. Then, use the `score()` method on our training and testing sets to see how the model performs!

In [19]:
# YOUR CODE HERE
# create the RandomForestRegressor model
# rf_reg = ...

# fit the model to our training data
# rf_model = ...

# print the scores of the model on training and testing data
# print(...)
# print(...)


# SOLUTION
# create the RandomForestRegressor model
rf_reg = RandomForestRegressor()

# fit the model to our training data
rf_model = rf_reg.fit(X_binary_train, y_binary_train)

# print the scores of the model on training and testing data
print(rf_model.score(X_binary_train, y_binary_train))
print(rf_model.score(X_binary_test, y_binary_test))

0.853639651618
-0.206872291022


You likely noticed that this model performed well on our training set, but just as badly as the decision trees on our testing set! As before, we'll see in the second part of this lab how to utilize feature engineering properly and pick the best features to improve these models.

Unfortunately, because our Random Forest model picks the average prediction of many different decision tree models, it's not as easy to interpret or visualize our model! However, one of the benefits of random forests is that they typically perform better on our test set and reduce the amount of overfitting to our training set than decision trees do. 

Thus, when deciding which model you would like to use for your prediction tasks, **it's important to consider how important interpretability is for your situation.**

### Naive Bayes

TO BE COMPLETED: Naive Bayes predictions

<br/>

<hr style="border: 1px solid #fdb515;" />

## Feature Selection

Before we get started with the process of feature selection, we'll use the process of **one-hot encoding** as a sort of feature engineering for our categorical variables. For example, features such as `subject_race`, `subject_sex`, and `violation` could end up being useful features for our classification models, but we currently can't use them because they're not in a numerical format!

To do this, we can use a helpful function from `pandas` called `pd.get_dummies` to obtain one-hot-encoded (also known as "dummy") variables for these features.

In [20]:
stops_w_dummies = pd.get_dummies(data = stops, columns = ['subject_sex', 'subject_race', 'violation'])
stops_w_dummies.columns

Index(['index', 'date', 'time', 'location', 'lat', 'lng', 'precinct',
       'reporting_area', 'zone', 'subject_age', 'officer_id_hash', 'type',
       'frisk_performed', 'search_conducted', 'search_person',
       'search_vehicle', 'reason_for_stop', 'vehicle_registration_state',
       'subject_sex_female', 'subject_sex_male',
       'subject_race_asian/pacific islander', 'subject_race_black',
       'subject_race_hispanic', 'subject_race_other', 'subject_race_unknown',
       'subject_race_white', 'violation_investigative stop',
       'violation_moving traffic violation', 'violation_parking violation',
       'violation_registration', 'violation_safety violation',
       'violation_seatbelt violation',
       'violation_vehicle equipment violation'],
      dtype='object')

Taking a look at our columns for this new DataFrame with our dummy variables, we see that there is a column for each possible value within the categorical variables that we one-hot-encoded. To see what these columns look like, let's take a look at the `"subject_sex_male"` column below:

In [21]:
stops_w_dummies[['index', 'subject_sex_male', 'citation_issued']]

Unnamed: 0,index,subject_sex_male,citation_issued
0,1840907,1,False
1,492044,0,False
2,431170,0,False
3,2066423,1,False
4,2899480,0,False
...,...,...,...
828,2475346,0,False
829,211168,0,True
830,2697517,0,False
831,2782384,1,False


Now, instead of having one column that incodes the subject's sex as a category, we have two different columns (`"subject_sex_male"` and `"subject_sex_female"`) that contain either 1's or 0's, depending on if the subject was male or not. We can now use these different categorical variables in our model creation!

First, we'll create a new copy of the DataFrame that contains only numerical or one-hot-encoded columns:

In [22]:
stops_w_dummies.drop(['date', 'location', 'officer_id_hash', 'type', 'outcome', 'reason_for_stop', 'vehicle_registration_state'], axis = 1, inplace = True)

TODO: need to find some way of using .corr() or creating heatmap, but not possible with this many columns currently. may need to reevaluate columns

TODO: after looking into feature importance / correlations, can then copy over a lot of the methods from old lab 14 (feature selection lab). just needs some reframing for this data.

<br/>

<hr style="border: 1px solid #fdb515;" />

## Hyperparameter Tuning & Cross Validation

<br/>

<hr style="border: 1px solid #fdb515;" />

## Congrats! You've completed Lab 14.

Hopefully, you now understand how to create different types of classification models and fit them to your data, as well as the different types of feature selection and hyperparameter tuning we can perform to improve them. 

<br/>

<hr style="border: 1px solid #fdb515;" />