### Group Project 4 : Comparing 3 Models for Predicting Recidivism

For background on this project, please see the [README](../README.md).

**Notebooks**
- [Data Acquisition & Cleaning](./01_data_acq_clean.ipynb)
- [Exploratory Data Analysis](./02_eda.ipynb)
- [Modeling](./03_modeling.ipynb)
- [Experiments](./03a_experiments.ipynb)
- Results and Recommendations (this notebook)

**In this notebook, you'll find:**
- Comparative evaluation of models for each of the 3 datasets
- Identification of the best model for each dataset
- Identification of the best overall model
- Policy recommendations based on our investigations

In [27]:
# just need pandas for this notebook
import pandas as pd

In [28]:
# read in our model stats for all datasets, remind ourselves what they look like
stats = pd.read_csv('../data/model_stats.csv')
stats.head()

Unnamed: 0,dataset_name,model_used,model_params,train_score,test_score,accuracy,specificity,precision,recall,f1 score,true_neg,false_pos,false_neg,true_pos
0,NY,logreg gen/age,"{'C': 1.0, 'class_weight': None, 'dual': False...",0.586841,0.587914,0.587914,0.970683,0.561523,0.052514,0.096047,21356,645,14903,826
1,NY,rf gen/age,"{'bootstrap': True, 'ccp_alpha': 0.0, 'class_w...",0.588332,0.587013,0.587013,0.918049,0.519584,0.123975,0.200185,20198,1803,13779,1950
2,NY,ada gen/age,"{'algorithm': 'SAMME.R', 'base_estimator__ccp_...",0.588332,0.587013,0.587013,0.918049,0.519584,0.123975,0.200185,20198,1803,13779,1950
3,NY,grad gen/age,"{'ccp_alpha': 0.0, 'criterion': 'friedman_mse'...",0.588325,0.586986,0.586986,0.918186,0.519487,0.123721,0.199846,20201,1800,13783,1946
4,NY,stack gen/age,"{'cv': None, 'estimators': [('random_forest', ...",0.588325,0.586986,0.586986,0.918186,0.519487,0.123721,0.199846,20201,1800,13783,1946


In [29]:
# establish the columns we will export for the presentation
presentation_cols = ['model_used', 'train_score', 'test_score', 'specificity']

---
##### Model 1: Basic Dataset (NY)

In [19]:
# NY evaluation - top 5
stats[stats['dataset_name'] == 'NY'].sort_values(by='accuracy', ascending = False).head()[presentation_cols]

Unnamed: 0,model_used,train_score,test_score,specificity
11,fnn gen/age/cty,0.599808,0.600451,0.884096
8,ada gen/age/cty,0.599569,0.600027,0.870779
10,stack gen/age/cty,0.600981,0.599867,0.876233
6,logreg gen/age/cty,0.598734,0.599019,0.876778
9,grad gen/age/cty,0.59894,0.598834,0.918276


In [20]:
# NY evaluation - describe
stats[stats['dataset_name'] == 'NY'].describe()

Unnamed: 0,train_score,test_score,accuracy,specificity,precision,recall,f1 score,true_neg,false_pos,false_neg,true_pos
count,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0
mean,0.589404,0.588544,0.588544,0.884312,0.533269,0.174836,0.252499,19455.75,2545.25,12979.0,2750.0
std,0.019122,0.019024,0.019024,0.079806,0.037958,0.07957,0.078992,1755.802957,1755.802957,1251.564258,1251.564258
min,0.531884,0.531275,0.531275,0.646743,0.428025,0.052514,0.096047,14229.0,645.0,9913.0,826.0
25%,0.588325,0.587006,0.587006,0.876642,0.51956,0.123911,0.2001,19287.0,1800.0,12407.0,1949.0
50%,0.593533,0.592526,0.592526,0.906868,0.55013,0.165777,0.255499,19952.0,2049.0,13121.5,2607.5
75%,0.599629,0.599231,0.599231,0.918186,0.553201,0.211202,0.305259,20201.0,2714.0,13780.0,3322.0
max,0.602783,0.600451,0.600451,0.970683,0.570781,0.369763,0.396766,21356.0,7772.0,14903.0,5816.0


ANALYSIS

- Using a basic model and only considering age at release, gender, and county of indictment does not provide sufficient improvement from baseline.
- We used six different classification models on the gender, age, and county features; however, our best score was a 2% increase in accuracy from a baseline of 58% to 60% using a **fully connected neural network**.
- As these scores failed to meet our target requirements, we will push forward to the demographic and behavioral datasets for further model evaluation. 

---
##### Model 2: Criminal History Dataset (FL)

In [21]:
# FL evaluation - top 5
stats[stats['dataset_name'] == 'FL'].sort_values(by='accuracy', ascending = False).head()[presentation_cols]

Unnamed: 0,model_used,train_score,test_score,specificity
17,grad,0.886566,0.881635,0.871616
15,rf,0.891616,0.877877,0.860789
16,ada,0.863786,0.866134,0.859242
14,bag,0.996477,0.863786,0.878577
18,fnn1,0.852513,0.854392,0.861562


In [22]:
# FL evaluation - describe
stats[stats['dataset_name'] == 'FL'].describe()

Unnamed: 0,train_score,test_score,accuracy,specificity,precision,recall,f1 score,true_neg,false_pos,false_neg,true_pos
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,0.885404,0.854908,0.854908,0.872854,0.808361,0.827153,0.816372,1128.6,164.4,144.5,691.5
std,0.062418,0.018202,0.018202,0.015473,0.01289,0.062885,0.030376,20.006666,20.006666,52.572173,52.572173
min,0.828558,0.821982,0.821982,0.859242,0.794016,0.721292,0.760883,1111.0,118.0,80.0,603.0
25%,0.846554,0.847698,0.847698,0.861562,0.799401,0.817285,0.809574,1114.0,158.75,110.0,683.25
50%,0.858149,0.852513,0.852513,0.86891,0.80561,0.833134,0.815353,1123.5,169.5,139.5,696.5
75%,0.890353,0.865547,0.865547,0.877224,0.815004,0.868421,0.835179,1134.25,179.0,152.75,726.0
max,0.999413,0.881635,0.881635,0.908739,0.837017,0.904306,0.856164,1175.0,182.0,233.0,756.0


ANALYSIS

- The tuned **GradientBoostingClassifier** was the winner here, with excellent training and test accuracy, very little overfitting,
  and reasonably specificity of 0.87.
- All of our top 5 models were in very similar ranges for both accuracy and specificity, and all met our criterion of accuracy of at least 80%.
- The BaggingClassifier would probably be one of the least likely choices here, given the evident overfitting.

---
##### Model 3: Behavioral Dataset (GA)

In [23]:
# GA evaluation - top 5
stats[stats['dataset_name'] == 'GA'].sort_values(by='accuracy', ascending = False).head()[presentation_cols]

Unnamed: 0,model_used,train_score,test_score,specificity
31,gboost2,0.753582,0.71756,0.595335
30,gboost,0.739048,0.714695,0.567444
35,ada3,0.728454,0.713058,0.574544
34,ada,0.727175,0.71142,0.574037
33,rf3,0.745599,0.704871,0.480223


In [24]:
# GA evaluation - describe
stats[stats['dataset_name'] == 'GA'].describe()

Unnamed: 0,train_score,test_score,accuracy,specificity,precision,recall,f1 score,true_neg,false_pos,false_neg,true_pos
count,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0
mean,0.801472,0.688377,0.688377,0.548115,0.71943,0.783298,0.749264,1080.882353,891.117647,631.470588,2282.529412
std,0.130845,0.03281,0.03281,0.05073,0.023503,0.052654,0.031294,100.039919,100.039919,153.43448,153.43448
min,0.611771,0.608064,0.608064,0.425456,0.652986,0.668154,0.676276,839.0,724.0,417.0,1947.0
25%,0.721648,0.679288,0.679288,0.521805,0.715577,0.73164,0.725521,1029.0,840.0,540.0,2132.0
50%,0.742119,0.703438,0.703438,0.545639,0.725406,0.805422,0.763128,1076.0,896.0,567.0,2347.0
75%,0.971341,0.704871,0.704871,0.574037,0.736181,0.814688,0.770315,1132.0,943.0,782.0,2374.0
max,1.0,0.71756,0.71756,0.63286,0.745048,0.856898,0.775948,1248.0,1133.0,967.0,2497.0


ANALYSIS

Our best model overall was the first Gradient Boost model found in the tuning portion, with an accuracy score of 0.718. This was only slightly better than the trial GBoost and tuned Ada Boost models, which had accuracy scores of 0.715 and 0.713, respectively. Between these three, Gradient Boost had the higher specificity. Our overall mean accuracy was 0.688, but the average accuracy score of just the tuned models was 0.703, which is better than our untuned average by about 0.03. In terms of specificity, the two means were also close in number (0.556 vs 0.541), but the tuned models outperformed here as well.

All of that said, none of the tuned or untuned scores met our target accuracy of 0.80. The usage of a model to predict recidivism and the potential implications of this is something we've been hyper-aware of throughout our analysis. Because these predictions-depending on how, when, and where they are used-have the ability to impact peoples' lives, we believe any model that cannot predict with at least 80% accuracy should be used in a real-life setting. With that, we would not move forward with any of the models listed above without significant optimizations made.

---
##### Overall Best Model Evaluation

In [30]:
# overall evaluation - top 5
stats.sort_values(by='accuracy', ascending = False).head()[['dataset_name'] + presentation_cols]

Unnamed: 0,dataset_name,model_used,train_score,test_score,specificity
17,FL,grad,0.886566,0.881635,0.871616
15,FL,rf,0.891616,0.877877,0.860789
16,FL,ada,0.863786,0.866134,0.859242
14,FL,bag,0.996477,0.863786,0.878577
18,FL,fnn1,0.852513,0.854392,0.861562


ANALYSIS

- As we would expect, the criminal history (FL) dataset's models dominate the top 5 slots across all 3 models
- This bears out the adage that "The best predictor of future behavior is past behavior"
- However - while the criminal history model may provide the most accurate predictions, it's very difficult to make any policy recommendations on this basis - at the point that this model becomes useful, someone has already committed one or more crimes
- We will use learnings from the Exploratory Data Analysis, especially from the behavioral (GA) dataset, to make policy recommendations, even though it was not the most accurate model

---
##### Conclusions & Recommendations
- As noted above, the criminal history (FL) model does bear out the idea that useful models can be created to predict recidivism, given its 88% accuracy rate.
- In general, implicit bias can be a danger in this domain of modeling. We would recommend excluding features such as age, race, or gender, in favor of features oriented towards observing patterns of behavior (such as gang affiliation, steadiness of employment, and education level), along with some inclusion of criminal history.
- Increasing employment opportunities for prisoners and parolees shows a clear positive influence on recidivism - other studies show that involvement in employment programs can reduce recidivism by as much as 60%.
- Targeting gang affiliations is another solid strategy based on our data exploration - gang affiliates were up to twice as likely to be rearrested as others. There are nonprofit organizations whose focus is transitioning prisoners with gang affiliations back into the mainstream of their community.
- Educational programs and support are key - our analysis showed up to a 20% decrease in recidivism with increased levels of education.
- Mental health and substance abuse programming can be a solid mitigator as well, and affects a large portion of the incarcerated population. Individuals with substance and/or mental health-related conditions are about twice as likely to be rearrested as others, and the majority of incarcerated individuals don't seek out or receive this type of help.


---
**NEXT STEPS**:
- Further modeling efforts should head in the direction of finding the best possible combination of behavioral and criminal history features.
- Given the fairly surprising success of the NLP experiment within the criminal history (FL) dataset, further investigation is warranted on understanding the relationship between the type of charges levied against first-time offenders and their eventual trajectory through the criminal justice system.
- We ended up with a few models that had solidly high specificity, indicating fairly low incidence of false negatives. With further research, it might be possible to use similar models to inform decisions about parole, supervision, and treatment programs. However, we would not recommend at this time pursuing using such models in sentencing.
- These 3 models were each contained with a particular state (or even county) jurisdiction, and as such may not generalize well to other parts of the country. Although it would be ideal to find a model applicable across the U.S. (or even the world), care must be taken to understand regional differences before applying such a model on a larger scale.