### Group Project 4 : Comparing 3 Models for Predicting Recidivism

For background on this project, please see the [README](../README.md).

**Notebooks**
- [Data Acquisition & Cleaning](./01_data_acq_clean.ipynb)
- [Exploratory Data Analysis](./02_eda.ipynb)
- [Modeling](./03_modeling.ipynb)
- [Experiments](./03a_experiments.ipynb)
- Results and Recommendations (this notebook)

**In this notebook, you'll find:**
- TODO etc.

In [1]:
import pandas as pd

In [2]:
stats = pd.read_csv('../data/model_stats.csv')
stats.head()

Unnamed: 0,dataset_name,model_used,model_params,train_score,test_score,accuracy,specificity,precision,recall,f1 score,true_neg,false_pos,false_neg,true_pos
0,NY,logreg gen/age,"{'C': 1.0, 'class_weight': None, 'dual': False...",0.586841,0.587914,0.587914,0.970683,0.561523,0.052514,0.096047,21356,645,14903,826
1,NY,rf gen/age,"{'bootstrap': True, 'ccp_alpha': 0.0, 'class_w...",0.588332,0.587013,0.587013,0.918049,0.519584,0.123975,0.200185,20198,1803,13779,1950
2,NY,ada gen/age,"{'algorithm': 'SAMME.R', 'base_estimator__ccp_...",0.588332,0.587013,0.587013,0.918049,0.519584,0.123975,0.200185,20198,1803,13779,1950
3,NY,grad gen/age,"{'ccp_alpha': 0.0, 'criterion': 'friedman_mse'...",0.588325,0.586986,0.586986,0.918186,0.519487,0.123721,0.199846,20201,1800,13783,1946
4,NY,stack gen/age,"{'cv': None, 'estimators': [('random_forest', ...",0.588325,0.586986,0.586986,0.918186,0.519487,0.123721,0.199846,20201,1800,13783,1946


In [11]:
presentation_cols = ['model_used', 'train_score', 'test_score', 'specificity']

In [12]:
# NY evaluation - top 5
stats[stats['dataset_name'] == 'NY'].sort_values(by='accuracy', ascending = False).head()[presentation_cols]

Unnamed: 0,model_used,train_score,test_score,specificity
11,fnn gen/age/cty,0.599808,0.600451,0.884096
8,ada gen/age/cty,0.599569,0.600027,0.870779
10,stack gen/age/cty,0.600981,0.599867,0.876233
6,logreg gen/age/cty,0.598734,0.599019,0.876778
9,grad gen/age/cty,0.59894,0.598834,0.918276


In [None]:
# NY evaluation - describe
stats[stats['dataset_name'] == 'NY'].describe()

##### CONCLUSIONS
- Using a basic model and only considering age at release, gender, and county of indictment does not provide sufficient improvement from baseline.
- We used six different classification models on the gender, age, and county features; however, our best score was a 2% increase in accuracy from a baseline of 58% to 60% using a stacked RF and Gradient Boosted Model.
- As these scores failed to meet our target requirements, we will push forward to the demographic and behavioral datasets for further model evaluation. 

In [13]:
# FL evaluation - top 5
stats[stats['dataset_name'] == 'FL'].sort_values(by='accuracy', ascending = False).head()[presentation_cols]

Unnamed: 0,model_used,train_score,test_score,specificity
17,grad,0.886566,0.881635,0.871616
15,rf,0.891616,0.877877,0.860789
16,ada,0.863786,0.866134,0.859242
14,bag,0.996477,0.863786,0.878577
18,fnn1,0.852513,0.854392,0.861562


In [None]:
# FL evaluation - describe
stats[stats['dataset_name'] == 'FL'].describe()

CONCLUSIONS
- It looks like an L2-regularized neural network with early stopping (around 62 epochs or so) gives us the best combination of high accuracy and low variance.
- The recall, precision and F1 for this model are also in keeping with, or better than, the other models.
- This will be our production model for Florida.

In [14]:
# GA evaluation - top 5
stats[stats['dataset_name'] == 'GA'].sort_values(by='accuracy', ascending = False).head()[presentation_cols]

Unnamed: 0,model_used,train_score,test_score,specificity
31,gboost2,0.753582,0.71756,0.595335
30,gboost,0.739048,0.714695,0.567444
35,ada3,0.728454,0.713058,0.574544
34,ada,0.727175,0.71142,0.574037
33,rf3,0.745599,0.704871,0.480223


In [15]:
# GA evaluation - describe
stats[stats['dataset_name'] == 'GA'].describe()

NameError: name 'model_trials_df' is not defined

CONCLUSIONS

Our best model overall was the first Gradient Boost model found in the trials portion, with an accuracy score of 0.715. This was only slightly better than the tuned Ada Boost model, which had an accuracy of 0.713. Between these two, Ada Boost had the higher specificity. Our overall mean accuracy was 0.686, but the average accuracy score of just the tuned models was 0.703, which is better than our untuned average by about 0.03. In terms of specificity, the two means were closer in number (0.555 vs 0.544), but the tuned models outperformed here as well.

All of that said, none of the tuned or untuned scores met our target accuracy of 0.80. The usage of a model to predict recidivism and the potential implications of this is something we've been hyper-aware of throughout our analysis. Because these predictions-depending on how, when, and where they are used-have the ability to impact peoples' lives, we believe any model that cannot predict with at least 80% accuracy should be used in a real-life setting. With that, we would not move forward with any of the models listed above without significant optimizations made.

**FINAL NOTES/NEXT STEPS**:
- TODO