# Model Building

This is a continuation from the `Feature Engineering` notebook. In that exploration, we prepared a dataset for modelling by clearing out NaNs, binning continuous numeric features, and encoding categorical features. 

Our dataset consists of roughly 50k offenders and 750k offenses from the Correctional Service of Canada. According to reporting by Tom Cardoso of the Globe and Mail, reintegration potential scores are one of the most important factors in determining parole and early release for offenders. Reintegration potential is composed of a few component reports, but has a large qualitative and subjective component. According to Cardoso, the lack of cultural nuance in putting together these evaluations has led to racial bias in the corrections system. 

We will look at whether we can predict reintegration potential scores for offenders, and whether there's a racial component in determining scores. We will also look at any other features that are predictive of reintegration potential.

Our features and target are as follows:

|Features|Target|
|:-----|:-------|
|INSTITUTIONAL SECURITY LEVEL, OFFENDER SECURITY LEVEL, DYNAMIC/NEED, STATIC/RISK, MOTIVATION, OFFENCE, RACE, AGE, SENTENCE LENGTH, CUSTODY, SUPERVISION|REINTEGRATION POTENTIAL|

Let's import relevant packages for our model building, read in our modelling data, and have a look at its head. 

In [23]:
#Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
import xgboost as xgb

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_curve, roc_auc_score, f1_score, auc, confusion_matrix, classification_report

In [6]:
df = pd.read_csv('modelling_data.csv')
df.head(10)

Unnamed: 0,INSTITUTIONAL SECURITY LEVEL,OFFENDER SECURITY LEVEL,DYNAMIC/NEED,STATIC/RISK,REINTEGRATION POTENTIAL,MOTIVATION,offence_AGGRAVATED ASSAULT,offence_ARMED ROBBERY,offence_ARSON - DAMAGE TO PROPERTY,offence_ASSAULT - THREATS OF VIOLENCE,...,SENLENGTH_40-60Q,SENLENGTH_60-80Q,SENLENGTH_80-100Q,CUSTODY_Community,CUSTODY_In Custody,SUPERVISION_DAY PAROLE,SUPERVISION_FULL PAROLE,SUPERVISION_In Custody,SUPERVISION_LONG TERM SUPER,SUPERVISION_STAT RELEASE
0,4,4,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
1,3,4,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
2,3,4,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,3,4,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
4,2,2,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
5,2,2,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
6,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
7,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
8,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
9,2,2,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


Look at balance of our target variable. 

In [7]:
df['REINTEGRATION POTENTIAL'].value_counts()

1    57693
2    47380
0    36687
Name: REINTEGRATION POTENTIAL, dtype: int64

Looks like we have a rough balance between the three reintegration potential scores. As a reminder, we've encoded our reintergation potential scores as follows ('high' is the best reintegration potential score, 'low' is the worst):

* **HIGH:** 0
* **MEDIUM:** 1
* **LOW:** 2

## Objectives

The goals of this exploration are two-fold. First, we want to put together a model that has a good accuracy score for predicting the reintegration potential of offenders in this dataset. Second, we want to explore the whether there's a difference in the accuracy scores we get for a white offender test set vs. a test set composed of other races. 

In order to achieve this, we'll take the following steps:

* Carve out the white offenders for most of our modelling - we will split this dataset into training, validation, and test sets.
    * The remaining data, composed of non-white offenders, will be a secondary test set - our expectation is that if there's no racial bias in the dataset, our models should perform identically on our white and non-white test sets. 
* We will fit our data to the following models:
    * **Multinomial Logistic Regression - OvR:** Baseline model.
    * **K-Nearest Neighbors:** Helpful since it's a nonparametric model. Since we have a lot of data and don't want to worry too much about choosing just the right features this will be a good addition to the exploration.
    * **Random Forest:** Usually more accurate than decision trees and doesn't tend to overfit. 
    * **SVM - OvR:** Effective for large sample sizes and uses a subset of training points in the decision function so it's memory efficient. 
    * **XGBoost:** Comparatively faster than other ensemble classifiers.
* Test the accuracy score and ROC AUC for all classifiers to determine which is most promising.
* Tune hyperparameters with GridSearchCV for most successful model. 
* Retrain most successful model on a combination of training and validation data, and then predict on both of our test sets (white and non-white offenders). 
* Finally, we will examine feature importance and see what conclusions can be drawn. 

Let's begin by creating two datasets: one for white and one for non-white offenders.

In [18]:
#Create a dataset consisting of only white offenders

w_offenders = df[df['RACE_White'] == 1].reset_index(drop=True)

#Create a dataset consisting of all non-white offenders

nw_offenders = df[df['RACE_White'] == 0].reset_index(drop=True)

Let's now separate our non-white dataset into our features and target. We will use this for a second round of testing later on in the exploration.

In [19]:
X_testtwo = nw_offenders.drop('REINTEGRATION POTENTIAL', axis=1)
y_testtwo = nw_offenders['REINTEGRATION POTENTIAL'].values

We'll do the same thing for our white offender dataset. First, let's confirm that we still have rough class balance in our target variable.

In [20]:
w_offenders['REINTEGRATION POTENTIAL'].value_counts()

1    35464
2    26642
0    25657
Name: REINTEGRATION POTENTIAL, dtype: int64

Now we can separate our white offender dataset into features and targets, then split the data into training and test sets, and further split the training set into training and validation sets.

In [21]:
#Split the white offender dataset into features and a target
X = w_offenders.drop('REINTEGRATION POTENTIAL', axis=1)
y = w_offenders['REINTEGRATION POTENTIAL'].values

#Get training and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

#Split the training set into a training and validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 42)

Let's now instantiate all of the models we mentioned above, fit them to our training set, and predict our validation set. As a reminder, those models are:

* **Multinomial Logistic Regression - OvR** 
* **K-Nearest Neighbors** 
* **Random Forest**  
* **SVM - OvR** 
* **XGBoost**

And we will be determining the _accuracy score_ as well as the _ROC AUC score_ of each model. 

In [25]:
#Instantiate all models

mlr = LogisticRegression(multi_class = 'ovr',random_state=42)
knn = KNeighborsClassifier()
rf = RandomForestClassifier()
svm = LinearSVC(multi_class='ovr')
xgb = xgb.XGBClassifier(objective='multi:softmax',seed=42)


In [None]:
functions = 


labels = ['High', 'Medium', 'Low'] 
print(classification_report(y_test, y_pred, labels)) #classification report from sklearn
cnf_matrix = confusion_matrix(y_test, y_pred, labels=labels)
plt.imshow(cnf_matrix, cmap=plt.cm.Blues) #plot confusion matrix grid
threshold = cnf_matrix.max() / 2 #threshold to define text color
for i in range(cnf_matrix.shape[0]): #print text in grid
    for j in range(cnf_matrix.shape[1]): 
        plt.text(j, i, cnf_matrix[i,j], color="w" if cnf_matrix[i,j] > threshold else 'black')
tick_marks = np.arange(len(labels)) #define labeling spacing based on number of classes
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.colorbar()
plt.tight_layout()

In [None]:
# Generate a no skill prediction (majority class)

ns_prob = [0 for _ in range(len(y_val))]

# Get prediction probability and predicted label for a classifier 

def get_prob_pred(classifier, X_train, y_train, X_val):
    classifier.fit(X_train, y_train)
    cl_prob = classifier.predict_proba(X_val)
    cl_prob = cl_prob[:,1]
    cl_y_pred = classifier.predict(X_val)
    return cl_prob, cl_y_pred
    
lr_prob, lr_pred = get_prob_pred(LogisticRegression(), X_train, y_train, X_val)  
knn_prob, knn_pred = get_prob_pred(KNeighborsClassifier(), X_train, y_train, X_val)    
rf_prob, rf_pred = get_prob_pred(RandomForestClassifier(), X_train, y_train, X_val)    


# Calculate scores

ns_auc = roc_auc_score(y_val, ns_prob)
lr_auc = roc_auc_score(y_val, lr_prob)
knn_auc = roc_auc_score(y_val, knn_prob)
rf_auc = roc_auc_score(y_val, rf_prob)

#Summarize scores

print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic Regression: ROC AUC=%.3f' % (lr_auc))
print('KNN Classifier: ROC AUC=%.3f' % (knn_auc))
print('Random Forest Classifier: ROC AUC=%.3f' % (rf_auc))

# Calculate ROC curves

ns_fpr, ns_tpr, threshold = roc_curve(y_val, ns_prob)
lr_fpr, lr_tpr, threshold = roc_curve(y_val, lr_prob)
knn_fpr, knn_tpr, threshold = roc_curve(y_val, knn_prob)
rf_fpr, rf_tpr, threshold = roc_curve(y_val, rf_prob)

# Plot the ROC curve for the model
fig = plt.figure(figsize=[12,10])

plt.plot(cmap=plt.cm.YlGn)
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', markersize=3,label='Logistic Regression')
plt.plot(knn_fpr, knn_tpr, marker='.',markersize=3, label='KNN Classifier')
plt.plot(rf_fpr, rf_tpr, marker='.', markersize=3,label='Random Forest Classifier')


# Axis labels

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

# Show the legend

plt.legend()

# Show the plot

fig.savefig("images/ROC_AUC.png")
plt.show()

Option:
-Ordinal regression