# Model Building

This is a continuation from the `Feature Engineering` notebook. In that exploration, we prepared a dataset for modelling by clearing out NaNs, binning continuous numeric features, and encoding categorical features. 

Our dataset consists of roughly 50k offenders and 750k offenses from the Correctional Service of Canada. According to reporting by Tom Cardoso of the Globe and Mail, reintegration potential scores are one of the most important factors in determining parole and early release for offenders. Reintegration potential is composed of a few component reports, but has a large qualitative and subjective component. According to Cardoso, the lack of cultural nuance in putting together these evaluations has led to racial bias in the corrections system. 

We will look at whether we can predict reintegration potential scores for offenders, and whether there's a racial component in determining scores. We will also look at any other features that are predictive of reintegration potential.

Our features and target are as follows:

|Features|Target|
|:-----|:-------|
|INSTITUTIONAL SECURITY LEVEL, OFFENDER SECURITY LEVEL, DYNAMIC/NEED, STATIC/RISK, MOTIVATION, OFFENCE, RACE, AGE, SENTENCE LENGTH, CUSTODY, SUPERVISION|REINTEGRATION POTENTIAL|

Let's import relevant packages for our model building, read in our modelling data, and have a look at its head. 

In [1]:
#Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('modelling_data.csv')
df.head(10)

Unnamed: 0,INSTITUTIONAL SECURITY LEVEL,OFFENDER SECURITY LEVEL,DYNAMIC/NEED,STATIC/RISK,REINTEGRATION POTENTIAL,MOTIVATION,offence_AGGRAVATED ASSAULT,offence_ARMED ROBBERY,offence_ARSON - DAMAGE TO PROPERTY,offence_ASSAULT - THREATS OF VIOLENCE,...,SENLENGTH_40-60Q,SENLENGTH_60-80Q,SENLENGTH_80-100Q,CUSTODY_Community,CUSTODY_In Custody,SUPERVISION_DAY PAROLE,SUPERVISION_FULL PAROLE,SUPERVISION_In Custody,SUPERVISION_LONG TERM SUPER,SUPERVISION_STAT RELEASE
0,4,4,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
1,3,4,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
2,3,4,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,3,4,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
4,2,2,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
5,2,2,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
6,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
7,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
8,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
9,2,2,2,2,2,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


Look at balance of our target variable. 

In [10]:
df['REINTEGRATION POTENTIAL'].value_counts()

1    57693
2    47380
0    36687
Name: REINTEGRATION POTENTIAL, dtype: int64

Looks like we have a rough balance between the three reintegration potential scores. As a reminder, we've encoded our reintergation potential scores as follows ('high' is the best reintegration potential score, 'low' is the worst):

* **HIGH:** 0
* **MEDIUM:** 1
* **LOW:** 2

## Objectives

The goals of this exploration are two-fold. First, we want to put together a model that has a good accuracy score for predicting the reintegration potential of offenders in this dataset. Second, we want to explore the whether there's a difference in the accuracy scores we get for a white offender test set vs. a test set composed of other races. 

In order to achieve this, we'll take the following steps:

* Carve out the white offenders for most of our modelling - we will split this dataset into training, validation, and test sets.
    * The remaining data, composed of non-white offenders, will be a secondary test set - our expectation is that if there's no racial bias in the dataset, our models should perform identically on our white and non-white test sets. 
* We will fit our data to the following models:
    * **Multinomial Logistic Regression - OvR:** Baseline model.
    * **K-Nearest Neighbors:** Helpful since it's a nonparametric model. Since we have a lot of data and don't want to worry too much about choosing just the right features this will be a good addition to the exploration.
    * **Random Forest:** Usually more accurate than decision trees and doesn't tend to overfit. 
    * **SVM - OvO:** Effective for large sample sizes and uses a subset of training points in the decision function so it's memory efficient. 
    * **XGBoost:** Comparatively faster than other ensemble classifiers.
* Test the accuracy score and ROC AUC for all classifiers to determine which is most promising.
* Tune hyperparameters for most successful model. 
* Retrain most successful model on a combination of training and validation data, and then predict on both of our test sets (white and non-white offenders). 
* Finally, we will examine feature importance and see what conclusions can be drawn. 

Let's begin by creating our training dataset, validation dataset, and two testing datasets.

Option:
-Ordinal regression