## [LEGALST-123] Lab 14: More Classifiers--Decision Tree, SVM
---
<img src="https://c.pxhere.com/photos/df/5e/uganda_signs_outdoor_wooden_sign_direction_this_wy_that_way-341892.jpg!d" style="width: 600px; height: 400px;" />


This lab used to come at the very end of the course, but it makes sense to include it a bit earlier, before we get into text, where we will explore yet more methods for classification (including predicting words). This lab will introduce more flexible methods of prediction that do not assume as much about the data they try to fit and are not bound by function form like linear models are. We will see if tree-based models or a linear support vector machine model are any better at the challenging task of predicting whether or not a traffic stop resulted in a search. 

In this lab we will use the 2013 stop data with the added demographic information from the ACS that we then joined using the spatial join in the previous lab. Here is how the lab will proceed. 
1. add a couple of interaction terms to the features we used in Lab 13 to try to get a better fit for the logistic regression classifier
2. try a decision tree classifier
3. try tree-based models (random forest and boosted trees)
4. try a support vector machine model to separate classes 

Don't be discouraged if your classifiers don't work that well--it's not you, it's the data-generating process!

*Estimated Time: 60 minutes*


In [None]:
# **Dependencies:**
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
!pip install geopandas
import geopandas as gpd
import fiona
import plotly.express as px
import json
%matplotlib inline
!pip install scikit-learn
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split
import sklearn.linear_model as linear
import sklearn.svm as svm
import sklearn.tree as tree
import sklearn.ensemble as ensemble
import sklearn.neighbors as neighbors
import sklearn.metrics as metrics

from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV

# import ipywidgets as widgets

# set the random seed so that results reproduce
np.random.seed(10)

## The Data: Nashville Police Stops 
#### (with added ACS demographic information for location of each stop)

This lab will use the dataset we put together in the "Performance Metrics" lab, where we joined the 2013 Nashville traffic stop data with the demographic data on census block groups from the American Community Survey. Remember that this was GeoPandas data. It is a big file so be patient. We will read it into a dataframe first. Then we will have to fix the column labels, which were truncated by GeoPandas when the dataframe was written out to a file.

In [None]:
# get the shapefile from the Git repo ds-modules/data
path = "https://github.com/ds-modules/data/raw/main/LS123_Nashville_2013_block_groups_stop_data.shp"
Nashville_stops_2013geo = gpd.read_file(path)
Nashville_stops_2013geo.info()

In [None]:
# interpreter on Datahub has problems with dataframes so this may not display right without restarting kernel
Nashville_stops_2013geo.head()

In [None]:
Nashville_stops_2013geo.columns

In [None]:
# we need to restore the column names that were truncated when the file was written out; those names are
# in the df we made with the spatial join in Lab 13
column_dict = {
    'med_inc_20':"median_income_2013",
    'hispanic_p':'hispanic_pct',
    'asianpac_1':'asianpacific_pct',
    'search_con':'search_conducted',
    'violation_':'violation_investigative_stop',
    'violatio_1':'violation_moving_traffic',
    'violatio_2':'violation_vehicle_equip',
    'subject_ag':'subject_age',
    'subject_se':'subject_sex_male',
    'subject_ra':'subject_race_white',
    'subject__1':'subject_race_hispanic',
    'subject__2':'subject_race_black',
    'citation_i':'citation_issued'
    }
Nashville_stops_2013geo.rename(columns=column_dict, inplace=True)
Nashville_stops_2013geo.info()

## 1. Improving the logistic classifier for search
In Lab 13 we created a logistic classifier for predicting searches after traffic stops. It did not perform all that well, and the naive prediction of "no search" was much more accurate. Nevertheless, we can perhaps do better than we did in Lab 13 by creating new features based on inductive theory from criminology, which is, police are even more likely to be suspicious of "race out of place" than they are likely to be driven by suspicion based on race. For example, police will default to more suspicion of a black motorist in a white neighborhood, or a white motorist in a black neighborhood, than they would if the motorists were "in place" in a neighborhood that matched the officer's construction of their race. 

We can operationalize this idea with interaction terms, which are the scalar products of two features. Here we can construct an interaction term `'wht_in_blk'` (for stops of white motorists) as `'wht_in_blk' = 'subject_race_white' * 'black_pct'` and do the same thing for an interaction term `'blk_in_wht'`. When you include interaction terms you also need to include their components in order to see the fixed effect of the interaction term. You do not necessarily need to use these interaction terms; the logic is that these may apply to the world of Nashville traffic stops.

First, we will create a dataframe with the features we want to use in our analysis; for purposes of comparison you will probably want to use the features from Lab 13, then add the interaction terms to them. Note that we changed the column names above. After that split and scale the data as before, and then train up a logistic regression classifier for searches. We may do better, but don't get your hopes up.

In [None]:
# select the features you used in Lab 13 making sure to include components for the interaction terms

reasonable_features = [...]
reasonable_df = Nashville_stops_2013geo[reasonable_features]

# function to multiply two dataframe columns to make interaction terms; you may use another method
# note: Pandas throws a warning
def column_prod(ser1, ser2):
    prods = []
    arr1 = ...
    arr2 = ...
    prods = ...
    return prods

reasonable_df['wht_in_blk'] = ...
reasonable_df['blk_in_wht'] = ...

reasonable_df.info()

In [None]:
# split the data as before with the training proportion as 0.6, and validation & test as 0.2 each
# remember that you do it in two steps

y = reasonable_df['search_conducted']
X = ...
# split the sampled data into training and test sets 
X_train, X_test, y_train, y_test = train_test_split(...)
# split the sampled training set into training and validation sets
X_train, X_validate, y_train, y_validate = ...
# reminder of size of training, validation, and test sets
print("X_train shape: ", X_train.shape)
print("X_validate shape: ", X_validate.shape)
print("X_test shape: ", X_test.shape)

In [None]:
# scale the data on the training set and apply the scaling to validation and test sets


scaler = StandardScaler()
scaler.fit(...)
X_train = ...
X_validate = ...
X_test = ...

In [None]:
# construct and then fit the logistic regression model to the training set
# then report the accuracy on the training and validation sets

logit_cf = linear.LogisticRegression(penalty='l2', class_weight='balanced')
logit_cf.fit(...)

print("Accuracy on training set: ", ...)
print("Accuracy on validation set: ", ...)

In [None]:
# get model prediction and the probability estimate for search for each observation in validation set
predictions = ...
probabilities = ... # returns an array 

In [None]:
# make a histogram of the probabilities of search == True the model generates, noting that 
# .predict_proba returns an array that looks like [[probability negative, probability positive]]
import plotly.graph_objects as go
fig = go.Figure(data=[go.Histogram(x=...)])
fig.show()

**Question**

What does the distribution of the probability of the positive class tell us about how well the classifier works? 

_your answer here_



In [None]:
# let's illustrate the prediction accuracy with a confusion matrix for the validation set

logit_cf_matrix = ...
logit_cf_matrix

In [None]:
# display the confusion matrix as a heatmap

logit_cf_cm = pd.DataFrame(...)
logit_cf_cm = logit_cf_cm.rename(index=str, columns={0:'False', 1:'True'})
logit_cf_cm.index = ['False', 'True']
plt.figure(figsize = (6,4))
sns.set(font_scale=1.4)#for label size
sns.heatmap(logit_cf_cm, 
           annot=True,
           fmt = '9.0f',
           annot_kws={"size": 14})

plt.title("Logistic Classifier Confusion Matrix for Search Conducted")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# find the logistic regression classifier's recall and precision and AUC-ROC score on the validation set
# hint: use the sklearn 'metrics' module

logit_cf_recall = ...
logit_cf_precision = ...)
logit_cf_auc_roc_score = ...)
print("logistic regression classifer recall: ",logit_cf_recall)
print("logistic regression classifer precision: ", logit_cf_precision)
print("logistic regression classifer AUC-ROC score: ", logit_cf_auc_roc_score)

In [None]:
# show which features are most predictive of search (remember that the features have been standardized so you
# cannot interpret the coefficients as the actual increase in probability) by showing their coefficients

feature_coefs = zip(list(X_train.columns),list(logit_cf.coef_[0,:]))
list(feature_coefs)

**Note:** At this point you could also see if the logistic classifier gives different average predictions and probabilities for different population groups, as in Lab 13. 

**Questions**
1. Did the logistic classifier with the additional features do any better than the logistic classifier in the preceding lab? Why or why not?
2. Is there anything interesting about the relative weights of the coefficients for the standardized features? Does it seem like there might be something to the "race out of place" hypothesis?

_your answers here_

1. 

2. 


### Ensemble methods
Over the course of this class, we've seen that a single model may have significant trouble making accurate predictions. **Ensemble methods** seek to improve on the single-model method by combining the predictions from multiple base models.

This lab will cover the two types of ensemble methods- averaging and boosting- using decision trees as our base model. But, one of the strengths of ensemble methods is their ability to solve many kinds of problems using many different kinds of base models.


## ---
## 2. Decision Trees

[Decision trees](https://scikit-learn.org/stable/modules/tree.html#tree) predict target values by creating a set of decision rules and do not make as many assumptions about the data as parametric methods like linear regression, which make strong assumptions about the functional form of the prediction algorithm. Decision tree models are very flexible when it comes to fitting data, but by the same token often have the problem of overfitting to the data they are trained on.

The tree is made up of *nodes*, which constitute decision points, and *branches*, which represent the outcome of the decision. Here's an example using the [Titanic](https://www.kaggle.com/c/titanic/data) data set to predict whether or not a passenger survived the sinking of the ship. Nodes are represented by the text, and branches by lines (left branch = 'yes', right branch='no').

Starting at the *root node* (which in computer science, somewhat counterintuitively, is at the top), the data is split into different subgroups at each decision node going top to bottom. The very bottom nodes in the tree (the *leaves*) assign prediction values to the data. Trees can be used to predict continuous outcomes (see sklearn's [`DecisionTreeRegressor()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) but they are commonly used to predict the class of an observation. We will work with decision tree classifiers.

<img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png" style="width: 400px; height: 400px;" />

> *'sibsp' gives the number of siblings or spouses a passanger had on board. The left number under a leaf is the chance of survival for that subgroup; the right number is the percentage of passengers in that subgroup. *


**QUESTION:** Based on this decision tree, what would the model predict would happen to an 8-year-old boy with 2 sisters and a brother? What would the chance of survival be for a 28-year-old married man?

**ANSWER:** The boy would be predicted to survive. The man would have survived with a 17% chance.

Here we will use sklearn's [`DecisionTreeClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) to predict whether or not the Nashville police searched incident to a traffic stop. Keep your expectations reasonable, since here we are creating but a single tree.

1. Create the `DecisionTreeClassifier()`. Set `max_depth` equal to 4.
2. Fit `X_train` and `y_train` to the regressor to create the model


Note: There are a large number of model parameters to set. The `criterion=` parameter measures the quality of the split at a node in the tree--gini impurity is the probability that a random draw will be misclassified. The `splitter=` parameter is the strategy for choosing the split at each node. The `max_depth=` parameter constrains how many times a data set can be split. For example, the Titanic tree had a max depth of 3 (i.e. you could pass through at most 3 branches when going from the root to a leaf). A decision tree with more layers may give terminal nodes (leaves) that are more homogeneous. Here we will set the depth to 4 (back in the day they used to be called 'ply') so that we have a tractable visualization of the resulting tree.

In [None]:
# make the DecisionTreeClassifier
dt_cf = DecisionTreeClassifier(criterion='gini',  # how to measure fit
                               splitter='best',  # or 'random' for random best split
                               max_depth=4,  # how deep tree nodes can go
                               min_samples_split=2,  # samples needed to split node
                               min_samples_leaf=1,  # samples needed for a leaf
                               min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                               max_features=None,  # number of features to look for when splitting
                               max_leaf_nodes=None,  # max nodes
                               random_state=10,
                               min_impurity_decrease=1e-07)  # early stopping

# fit the model
model = ...

The `feature_importances_` show how much weight is given to each feature in the model. Higher numbers correspond to more important features. The importances correspond to features by their index: the importance weight in position 1 goes with feature 1, and so on.

In [None]:
# show the feature importance, in a data frame for convenience
pd.DataFrame({'feature': ...,
             'importance': ...})

Now, complete the final steps:

3. Check the model's accuracy on the training and validation data using `.score`
4. Show the confusion matrix for the classifier
5. Show the recall, precision, and AUC-ROC scores for the classifier

In [None]:
# score the model
print("accuracy on training set: ", ...)
print("accuracy on validation set: ", ...)

# generate predicted class and probability of search == True for the validation set
predictions = ...
probabilities = ...

# illustrate the prediction accuracy with a confusion matrix for the validation set

dtree_cf_matrix = metrics.confusion_matrix(y_validate, predictions)
dtree_cf_cm = pd.DataFrame(dtree_cf_matrix, range(2), range(2))
dtree_cf_cm = dtree_cf_cm.rename(index=str, columns={0:'False', 1:'True'})
dtree_cf_cm.index = ['False', 'True']
plt.figure(figsize = (6,4))
sns.set(font_scale=1.4)#for label size
sns.heatmap(dtree_cf_cm, 
           annot=True,
           fmt = '9.0f',
           annot_kws={"size": 12})

plt.title("Decision Tree Classifier Confusion Matrix for Search Conducted")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# find the decision tree classifier's recall, precision, and AUC-ROC scores on the validation set

dtree_cf_recall = ...
dtree_cf_precision = ...
dtree_cf_auc_roc_score = ...
print("decision tree classifer recall: ",dtree_cf_recall)
print("decision tree classifer precision: ", dtree_cf_precision)
print("decision tree classifer AUC-ROC score: ", dtree_cf_auc_roc_score)

#### Questions
1. How did the decision tree perform relative to both the logistic regression and naive classifiers? 
2. How would you characterize the bias of the decision tree classifier versus that of the logistic regression classifier? Is there a reason to prefer one over the other?

_your answer here_

1. 

2. 

### Visualization


One of the great thing about decision trees is that, unlike many other models, it's relatively easy to visualize what is going on inside the model. The graphviz library can show the structure of the tree, as well as what decision is being made at each node.

Due to some datahub limitations we can't use graphviz directly through our notebook. However, we can use the [Webgraphviz site](http://webgraphviz.com/) as a workaround. Run the cell below to generate the graphviz data for the model you just trained. Then, copy the *entire* output of the cell, click the link to Webgraphviz, replace the sample text in the text box with your copied data, and hit the button to generate the graph.

In [None]:
# get the graphviz data
print(export_graphviz(model, out_file = None, feature_names = X.columns))

Some notes on the visualization:

- the top line of every node shows the decision that splits the data at that node
- `samples` is the number of samples (rows) that are going through that node on the way down the tree
- `value` is the class memberships expressed as `[negative, positive]` at that node
- `gini` is the Gini impurity G for that node 
$$G = \sum_{i=1}^{C} p(i) * (1-p(i))$$
where C is the total number of classes and $p(i)$ is the probability of selecting class $i$

#### Questions
1. How is the decision tree making the splits?
2. Do you think a deeper tree would give you a better classifier? Why or why not?

_your answers here_

1. 

2. 


## --
## 3. Averaging Methods 

Although we did not have an overfitting problem (no doubt due to the data generating process for searches and the features that are actually available to us), tree-based models usually overfit their training data. We can try to address this overfitting issue using **averaging** ensemble methods. The intuition behind averaging is to build multiple estimators, then use the average of all their predictions as the final prediction.

### Random Forest

**Random Forest** accomplishes this by creating multiple decision trees (a 'forest' of them, if you will), each trained on sample of data drawn at random with replacement from the given set. Additionally, when each tree is constructed, not every feature is considered as a candidate on which to split the tree for each decision point.

By adding some randomization into the subsets and features that are considered by each model, then averaging the predictions across models, [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#random-forests-and-other-randomized-tree-ensembles) can typically produce a model that is better at generalization.

**EXERCISE:** Create an out-of-the-box [`RandomForestClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) (i.e. use all the default settings for the model hyperparameters, like `n_estimators`), then fit it to the data and get the model's scores on the training and validation data. As above, also report a confusion matrix (as a dataframe) and the recall, precision, and AUC-ROC score for the Random Forest Classifier on the validation set. 

In [None]:
# create the classifier
rf_cf = RandomForestClassifier(n_estimators=100,  # number of trees
                               criterion='gini',  # how to measure fit
                               max_depth=None,  # how deep tree nodes can go
                               min_samples_split=2,  # samples needed to split node
                               min_samples_leaf=1,  # samples needed for a leaf
                               min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                               max_features='sqrt',  # max feats
                               max_leaf_nodes=None,  # max nodes
                               n_jobs=1, # how many to run parallel
                               random_state=10,
                               class_weight=None) #using default even though classes unbalanced
# fit the data 
model = ...
#score the model on the training and validation data
print('mean accuracy on training set: ', ...)
print('mean accuracy on validation set: ', ...)

# generate predicted class and probability of search == True for the validation set
predictions = ...
probabilities = ...

# make confusion matrix
rf_cf_matrix = ...
rf_cf_cm = pd.DataFrame(rf_cf_matrix, range(2), range(2))
rf_cf_cm = rf_cf_cm.rename(index=str, columns={0:'Pred_False', 1:'Pred_True'})
rf_cf_cm.index = ['False', 'True']
print(rf_cf_cm)

rf_cf_recall = ...
rf_cf_precision = ...
rf_cf_auc_roc_score = ...
print("random forest classifer recall: ",rf_cf_recall)
print("random forest classifer precision: ", rf_cf_precision)
print("random forest classifer AUC-ROC score: ", rf_cf_auc_roc_score)

#### Question
1. How does Random Forest compare to Decision Tree on our metrics?

_your answer here_




### Hyperparameter tuning
Like with most models, we can get better results by tuning the hyperparameters of the model. Let's try changing three: `max_depth`, `n_estimators`, and `min_samples_split`.

#### Grid Search

The process of choosing good hyperparameters can be tedious, involving a lot of trial and error. Fortunately, sklearn has a tool to help.

A [**grid search**](https://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search) tests different possible parameter combinations to see which combination yields the best results. The grid is formatted as a dictionary, where the keys are the parameter names and the values are the different values you want to try for each parameter.

In [None]:
# create a parameter grid to look for optimal values for n_estimators, max_depth, and _min_samples_split
param_grid = {'n_estimators': range(50, 151, 50),
              'max_depth': range(1, 11, 5),
              'min_samples_split': [2]}

Once the grid is made, it gets fed into a `GridSearchCV` along with the corresponding model. This may take awhile to run - the computer is calculating the score for every possible combination of parameter values in the grid and is using 5-fold [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators) to keep it from overfitting. Rather than use 'accuracy' as the metric, we will use the AUC-ROC score as the summary metric of performance. Remember that our Random Forest Classifier did not fit the data well, so we may do badly here too. Fear not, since now we know that search is hard to predict with the features at hand.

In [None]:
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, scoring='roc_auc')
grid_search.fit(...)

Once we've fit the model to the data, information about the search process and results is stored in `.cv_results_`. Here's what you can see:

In [None]:
# the keys stored in the grid search process results dictionary
sorted(grid_search.cv_results_.keys())

"params" contains the different combinations of parameters that were tried. "mean_test_score" gives the average score (which we set to AUC-ROC) for models using each set of parameters. Items are matched by index- the ith score is for the ith set of parameters.

In [None]:
grid_search.cv_results_["params"]

In [None]:
grid_search_results_df = pd.DataFrame(...)
grid_search_results_df.head()

**EXERCISE:** Find the set of parameters that got the best average score (`np.argmax` might help). Create a new random forest regressor using those parameters, then fit the model and print the scores for the training and validation data.

**Note:** Our Random Forest Classifier performed almost the same whatever the combination of parameters. If you are modeling something where the features you have capture the variation in the data better than ours do, this exercise can be quite helpful.

In [None]:
best_index = ...
best_params = grid_search.cv_results_["params"][best_index]
best_params

In [None]:
# create the classifier
rf_cf_2 = RandomForestClassifier(n_estimators=...,
                                 max_depth=...,
                                 min_samples_split=...)

# fit the model 
model = ...

#score the model (this is accuracy) on the training and validation data
print("second random forest classifier accuracy on training set: ", ...)
print("second random forest classifier accuracy on validation set: ", ...)

And, as always, there is a shortcut- if you call `.predict` or `.score` on the grid search object you originally used to find the best parameters, it will do so using the best set of parameters automatically. Note that the score might be slightly different from the one in the model you just calculated due to the 'random' part of 'random forest'.

In [None]:
print("AUC-ROC for model with best parameters: ", grid_search.score(X_validate, y_validate))

## 3. Boosting Methods <a id='section 3'></a>

**Boosting** algorithms work roughly like so:
1. Make a weak predictor (one that makes predictions with slightly better-than-chance accuracy)
2. Train and evaluate the weak predictor
3. Make a new weak predictor that takes into account the errors made in the last model and improves on them.
4. Repeat steps 2 and 3 many times

### Ada Boost

[Ada Boost](https://scikit-learn.org/stable/modules/ensemble.html#adaboost) (for ADAptive BOOSTing) is one of the most popular boosting algorithms. The adaptive part of the algorithm comes from how it updates the data for each weak model in the sequence.

Each sample $i$ in the training set is weighted by some number $w_i$, and the input to the model is the samples multiplied by the weights. At first, all the $w_i$s are the same. After the first model is evaluated, the weights are updated so that samples that were predicted incorrectly get higher weights and samples that were predicted correctly get lower weights.

**QUESTION:** In the playground game "Duck, Duck, Goose", children are labeled as 'ducks' or 'geese' in the name of schoolyard mayhem. Suppose we want to build a classifier that predicts whether a sample is a duck or a goose based on two features: color, and whether or not it quacks.

In [None]:
birds = pd.DataFrame({'color':['white', 'grey', 'white'],
                      'quacks':['yes', 'yes', 'no'],
                     'type':['duck', 'duck', 'goose']})
birds

Initially, all the weights are the same.

In [None]:
birds['weights'] = np.ones(3) / 3
birds

The initial model in a sequence for Ada Boost outputs the following predictions:

In [None]:
birds['predictions'] = ['duck', 'goose', 'goose']
birds

For samples 0, 1 and 2, state whether their corresponding weight would be adjusted higher, lower, or stay the same before the data is fed into the next model.

**ANSWER:**
- Sample 0: 
- Sample 1: 
- Sample 2: 

Using an Ada Boost Classifier is similar to using the other tree-based classifiers above.

In [None]:
# construct the classifier
ada_cf = AdaBoostClassifier(base_estimator=None,  # default is decision tree 
                            n_estimators=50,  # number of trees to try before stopping
                            learning_rate=1.0,  # decrease influence of each additional estimator
                            random_state=10) # sets the random seed
                            
# fit the data
model = ...

# get the model predictions and the probabilities of predicting positive class and metrics
predictions = ...
probabilities = ...
ada_cf_recall = ...
ada_cf_precision = ...
ada_cf_auc_roc_score = ...

# report metrics
print("AdaBoost Model accuracy on training set: ", ...)
print("AdaBoost Model accuracy on validation set: ", ...)
print("AdaBoost classifer recall: ", ada_cf_recall)
print("AdaBoost classifer precision: ", ada_cf_precision)
print("AdaBoost classifer AUC-ROC score: ", ada_cf_auc_roc_score)


**Note:** You could use grid search as we did above to find better values for the Ada Boost parameters. See the AdaBoost Classifier [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier) for more details). 

#### Question
How does the AdaBoost classifier do compared to the other classifiers? What have we learned about the search process, at least with respect to search in Nashville traffic stops?


_your answer here_




#### In summary:
- decision trees make predictions using 'if/then/else' rules to split data into subsets, but are highly subject to overfitting
- grid search can be a useful tool for tuning hyperparameters
- ensemble methods seek to improve models by averaging or boosting multiple models
- random forest uses randomness and averaging to counter the overfitting problem for decision trees
- Ada boost can be used on a variety of model types to taking an initial weak model and improve it with sequential boosting

## ---
## 4. Support Vector Classifier
There is one other type of supervised learning method that is also used for classification--[support vector machines](https://scikit-learn.org/stable/modules/svm.html). The basic intuition is that we want to find an $n$-dimensional hyperplane (defined by an equation of the form, for $i$ from 1 to $n$ dimensions)
$$\beta_0 + \beta_1x_i1 + \beta_2x_i2 + ... + \beta_px_ip = 0$$
that separates the classes in our data. When the hyperplane can completely separate the data, we have a maximal marginal classifier, which must satisfy the constraints that all of the members of each class are on the appropriate side of the hyperplane and that the margin between the closest points to the boundary and the boundary is maximized. See Chapter 9 in _An Introduction to Statistical Learning_ for a thorough explanation. 

The support vector classifier is a more flexible version of the maximal marginal classifier that relaxes the requirement that all the points in each class be on the correct side of the margin, or even the correct side of they hyperplane. This allows support vector classifiers to misclassify some observations in return for doing a better job of classifying most of the training observations. 

The decision rule of support vector classifiers relies only on the subset of training observations which, if they moved, would move the separating hyperplane (the vectors originating at those training observations and perpendicular to the separating hyperplane are the support vectors). Support vector classifiers can work in high-dimensional spaces (text!) but use only the subset of the training observations and so don't use up so much memory. Here the intent is just to try it out -- we already suspect that the Nashville police stops are not readily separable into true and false classes when it comes to `search_conducted`. Note that support vector machines do not provide probability estimates, and that sometimes the optimization algorithm that attempts to locate the hyperplane for maximum separation of classes fails to converge on a solution. 

In [None]:
# construct a linear support vector classifier
# Note: C is a tuning parameter--the larger C is, the more violations of being 
# on the correct side of margin are allowed and the more the bias-variance tradeoff is shifted toward higher
# bias and lower variance

svcf = svm.LinearSVC(C=10, class_weight='balanced', verbose=1, max_iter=1000, random_state=10)

# fit the Linear SVC to the training data 
...

In [None]:
# get model predictions for validation set
predictions = ...
svc_recall = ...
svc_precision = ...

# get a read on its accuracy on the validation set
print("support vector classifier accuracy: ", ...)
print("support vector classifier recall: ", svc_recall)
print("suport vector classifier precision: ", svc_precision)

# show a df with the features and their coefficients in the model
svcf_features = pd.DataFrame({'feature':svcf.feature_names_in_, 'coefficient':svcf.coef_[0,:]})
svcf_features

#### Question
Which classifier performed the best? Was the best performance good enough to allow us to predict search outcomes from a traffic stop?




---

## Bibliography

- Random Forest, ADA Boost, grid search code adapted from https://github.com/dlab-berkeley/python-machine-learning/blob/master
- Ensemble methods general reference: http://scikit-learn.org/stable/modules/ensemble.html

---
Notebook developed by: Keeley Takimoto (Spring 2018) and amended by Jonathan Marshall (Spring 2024)

Data Science Modules: http://data.berkeley.edu/education/modules
