### Group Project 4 : Comparing 3 Models for Predicting Recidivism

For background on this project, please see the [README](../README.md).

**Notebooks**
- [Data Acquisition & Cleaning](./01_data_acq_clean.ipynb)
- [Exploratory Data Analysis](./02_eda.ipynb)
- Modeling (this notebook)
- [Results and Recommendations](./04_results.ipynb)

**In this notebook, you'll find:**
- Classification models for each of the 3 datasets
- TODO etc.

**Model 1: Base feature set - New York**

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras import utils
from tensorflow.keras.datasets import mnist

In [2]:
%store -r ny_df

In [3]:
#Define X and y, Train Test and Split the Data

X = ny_df[['gender_map', 'Age at Release']]
y = ny_df['recidivism']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

##### Basic Models- Two Features Gender and Age

In [4]:
#Logistic Regression, Instantiate and Fit Model
lr = LogisticRegression()
lr.fit(X_train, y_train)

#Save train and test accuracy for model evaluation
log_train = lr.score(X_train, y_train)
log_test = lr.score(X_test, y_test)

In [5]:
#Random Forest, Instantiate and Fit
rf= RandomForestClassifier(random_state = 42)
rf.fit(X_train, y_train)

#Save train and test accuracy for model eval
rf_train = rf.score(X_train, y_train)
rf_test = rf.score(X_test, y_test)

In [6]:
#Ada Boost
ada= AdaBoostClassifier(base_estimator = DecisionTreeClassifier(random_state = 42))
ada.fit(X_train, y_train)

ada_train = ada.score(X_train, y_train)
ada_test = ada.score(X_test, y_test)

In [7]:
#Gradient Boost
gboost= GradientBoostingClassifier(random_state = 42)
gboost.fit(X_train, y_train)

gb_train = gboost.score(X_train, y_train)
gb_test = gboost.score(X_test, y_test)

In [8]:
#Stacked Model
level1_estimators = [
    ('random_forest', RandomForestClassifier()),
    ('gboost', GradientBoostingClassifier()),
    ('ada', AdaBoostClassifier())
]

stacked_model = StackingClassifier(estimators=level1_estimators,
                                 final_estimator = LogisticRegression())

stacked_model.fit(X_train, y_train)

#Save train and test accuracy for model eval
stack_train = stacked_model.score(X_train, y_train)
stack_test = stacked_model.score(X_test, y_test)

In [9]:
#KNN
ss = StandardScaler()
X_train_sc=ss.fit_transform(X_train)
X_test_sc= ss.transform(X_test)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

#Save train and test accuracy for model eval
knn_train = knn.score(X_train, y_train)
knn_test = knn.score(X_test, y_test)

##### Models Including County, Gender, Age

In [10]:
#Creating a Dataframe including dummies of the county of indictment
df_dummies = pd.get_dummies(ny_df['County of Indictment'])
df_dummies['gender'] = ny_df['gender_map']
df_dummies['recidivism'] = ny_df['recidivism']
df_dummies['age'] = ny_df['Age at Release']

#Define X and y with dummy data
X = df_dummies.drop(columns = ['recidivism'])
y = df_dummies['recidivism']

#Train test split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, random_state =42, stratify = y)

In [11]:
#Logistic Regression Model
lr2 = LogisticRegression()
lr2.fit(X_train2, y_train2)

#Storing Accuracy Scores
log_train2 = lr2.score(X_train2, y_train2)
log_test2 = lr2.score(X_test2, y_test2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
#Random Forest with Hyperparameter tuning
rf2= RandomForestClassifier(random_state = 42)
rf_params = {
    'n_estimators': [120, 130],
    'max_depth': [15,20],
    'max_features' : [15,20],
    'n_jobs': [-1]
}

gs1 = GridSearchCV(rf, param_grid= rf_params, cv = 3, n_jobs = -1)

gs1.fit(X_train2, y_train2)

#Storing Accuracy Scores
rf_train2 = gs1.score(X_train2, y_train2)
rf_test2 = gs1.score(X_test2, y_test2)

In [13]:
#Ada Boost with Hyperparam tuning
ada= AdaBoostClassifier(base_estimator = DecisionTreeClassifier(random_state = 42))

ada_params= {
    'n_estimators': [50, 100],
    'base_estimator__max_depth':[1,3]
}

gs2 = GridSearchCV(ada, param_grid = ada_params, cv = 3, n_jobs = -1)
gs2.fit(X_train2, y_train2)

#Storing Accuracy Scores
ada_train2 = gs2.score(X_train2, y_train2)
ada_test2 = gs2.score(X_test2, y_test2)

In [14]:
#Graident Boost
gboost2= GradientBoostingClassifier(random_state = 42)

gboost2.fit(X_train2, y_train2)


#Storing Accuracy Scores
gboost_train2 = gboost2.score(X_train2, y_train2)
gboost_test2 = gboost2.score(X_test2, y_test2)

In [15]:
#Stacked Model
level1_estimators = [
    ('random_forest', RandomForestClassifier()),
    ('gboost', GradientBoostingClassifier())
]

stacked_model2 = StackingClassifier(estimators=level1_estimators,
                                 final_estimator = LogisticRegression())

stacked_model2.fit(X_train2, y_train2)

#Storing Accuracy Scores
stack_train2 = stacked_model2.score(X_train2, y_train2)
stack_test2 = stacked_model2.score(X_test2, y_test2)

In [18]:
X_train2.shape

(141487, 65)

In [19]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train2)
X_test_sc = ss.transform(X_test2)

In [23]:
# Instantiate a CNN.
model = Sequential()

model.add(Dense(84, input_shape = (65,), activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

# Compile it
model.compile(loss = 'bce', optimizer = 'adam', metrics = ['accuracy'])

# Fit it
res = model.fit(X_train_sc, y_train2,
                epochs = 10,
                batch_size = 32,
                validation_data = (X_test_sc, y_test2),
                verbose = 0) # No printing of our output

In [35]:
#Save Accuracy Scores
train_loss = res.history['accuracy']
test_loss = res.history['val_accuracy']

neuralnet_train = max(train_loss)
neuralnet_test = max(test_loss)

0.5997921824455261

##### Model Evaluation

In [16]:
#Model Evaluation No Dummies
train_scores = [log_train, rf_train, ada_train, gb_train, stack_train, knn_train]
eval_df = pd.DataFrame(train_scores, columns = ['train_scores'])

eval_df['test_scores'] = [log_test, rf_test, ada_test, gb_test, stack_test, knn_test]
eval_df.index = ['logreg', 'rf', 'ada', 'gb', 'stack', 'knn']
eval_df['diff'] = eval_df['train_scores'] - eval_df['test_scores']
eval_df['baseline'] = .58

eval_df

Unnamed: 0,train_scores,test_scores,diff,baseline
logreg,0.586831,0.587728,-0.000896,0.58
rf,0.588386,0.58707,0.001316,0.58
ada,0.588386,0.58707,0.001316,0.58
gb,0.588386,0.58707,0.001316,0.58
stack,0.588386,0.58707,0.001316,0.58
knn,0.534304,0.534614,-0.00031,0.58


In [36]:
#Model Evaluation With Dummies
train_scores2 = [log_train2, rf_train2, ada_train2, gboost_train2, stack_train2, neuralnet_train]
eval_df2 = pd.DataFrame(train_scores2, columns = ['train_scores'])

eval_df2['test_scores'] = [log_test2, rf_test2, ada_test2, gboost_test2, stack_test2, neuralnet_test]
eval_df2.index = ['logreg', 'rf', 'ada', 'gb', 'stack', 'neuralnet']
eval_df2['diff'] = eval_df2['train_scores'] - eval_df2['test_scores']
eval_df2['baseline'] = .58

eval_df2

Unnamed: 0,train_scores,test_scores,diff,baseline
logreg,0.598776,0.599262,-0.000486,0.58
rf,0.603236,0.597714,0.005521,0.58
ada,0.599362,0.59922,0.000143,0.58
gb,0.598988,0.598711,0.000277,0.58
stack,0.600981,0.600301,0.00068,0.58
neuralnet,0.598945,0.599792,-0.000847,0.58


**Model 2: Criminal history feature set - Florida**

**Model 3: Behavioral feature set - Georgia**

**Final selected models for each dataset**

**FINAL NOTES**
- The model statistics are exported [here](../data/model_stats.csv).
- The next notebook in the series is [Results and Recommendations](./04_results.ipynb).