<a href="https://colab.research.google.com/github/bintualkassoum/fair-ml-project/blob/main/Fair_Machine_Learning_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Project Description 

#Data
The Challenge uses data on roughly 26,000 individuals from the State of Georgia released from Georgia prisons on discretionary parole to the custody of the Georgia Department of Community Supervision (GDCS) for the purpose of post-incarceration supervision between January 1, 2013 and December 31, 2015. 

This dataset is split into two sets, training and test. We used a 70/30 split, indicating that 70% of the data is in the training dataset and 30% in the test dataset. 

##Training Set 
The training dataset includes the four dichotomous dependent variables measuring if an individual recidivated in the three-year follow-up period (yes/no) as well as recidivated by time period (year 1, year 2, or year 3). 

##Test Sets
The second test set contains only those individuals that did not recidivate year 1, while the third contains only those who did not recidivate in year 1 and 2

##Recidivism Measurement 
Recidivism is measured as an arrest for a new felony or misdemeanor crime within three years of the supervision start date. The test dataset does not include the four dependent variables. 

##Additional Info
Arrest episodes with multiple charges are described in this data by the most serious charge. The exception is criminal history domestic violence and gun charges, which count all charges across all episodes. GCIC data also provides the recidivism measure, defined as a new felony or misdemeanor arrest episode within three years of parole supervision start date.

#Date Pre-Processing

##Import Packages

In [None]:
# Import Packages

%matplotlib inline

import pandas as pd
import numpy as np
import sklearn 
from sklearn import preprocessing 

# Visualizations 
import matplotlib.pyplot as plt
import seaborn as sns

# Standard plotly imports
! pip install plotly 
! pip install chart-studio
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode

# Using plotly + cufflinks in offline mode
! pip install cufflinks --upgrade
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import plotly.io as pio
pio.renderers.default = "colab"

In [158]:
# Load Data 
from google.colab import drive 

drive.mount('/content/drive')

full = pd.read_csv('/content/drive/MyDrive/Grad School/Misc/Fair Machine Learning (COMPAS)/Data/NIJ_s_Recidivism_Challenge_Full_Dataset.csv')

train = pd.read_csv('/content/drive/MyDrive/Grad School/Misc/Fair Machine Learning (COMPAS)/Data/NIJ_s_Recidivism_Challenge_Training_Dataset.csv')

test1 = pd.read_csv('/content/drive/MyDrive/Grad School/Misc/Fair Machine Learning (COMPAS)/Data/NIJ_s_Recidivism_Challenge_Test_Dataset1.csv')

test2 = pd.read_csv('/content/drive/MyDrive/Grad School/Misc/Fair Machine Learning (COMPAS)/Data/NIJ_s_Recidivism_Challenge_Test_Dataset2.csv') 

test3 = pd.read_csv('/content/drive/MyDrive/Grad School/Misc/Fair Machine Learning (COMPAS)/Data/NIJ_s_Recidivism_Challenge_Test_Dataset3.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Exploratory Analysis - Raw Data 

## Analyze Data 

In [None]:
full.head()

In [None]:
# Check Data Shape 
print(full.shape)

print(train.shape)

print(test1.shape)

print(test2.shape)

print(test3.shape)

In [None]:
# Data Stats 
full.describe()

In [None]:
# Check Data Types 
print(full.dtypes)

In [163]:
# Inspect Unique Values - Continuous Variable 
full['ID'].unique()

array([    1,     2,     3, ..., 26759, 26760, 26761])

In [164]:
# Inspect Unique Values - Categorical Variable 
full['Race'].unique()

array(['BLACK', 'WHITE'], dtype=object)

In [165]:
# Inspect Unique Values - Categorical Variable 
full['Supervision_Risk_Score_First'].unique()

array([ 3.,  6.,  7.,  4.,  5.,  2.,  1., 10.,  9.,  8., nan])

##Visualize Data

In [None]:
# Age at Release Histogram
full['Age_at_Release'].iplot(kind='hist', xTitle='Age', yTitle='count', title='Age at Release')

In [None]:
# Gender Box Plot
full.pivot(columns='Gender', values='ID').iplot(
        kind='box',
        yTitle='Number of Inmates',
        title='Inmate Distribution by Gender')

In [None]:
# Race Box Plot
full.pivot(columns='Race', values='ID').iplot(
        kind='box',
        yTitle='Number of Inmates',
        title='Inmate Distribution by Race')

Race and Gender distributions are fairly even 

In [None]:
# Prior Felony Arrests Box Plot 
full.pivot(columns='Prior_Arrest_Episodes_Felony', values='ID').iplot(
        kind='box',
        yTitle='Number of Inmates',
        title='Inmate Distribution by Prior Felony Arrests')

In [None]:
# Prior Felony Arrests Box Plot 
full.pivot(columns='Prior_Arrest_Episodes_Misd', values='ID').iplot(
        kind='box',
        yTitle='Number of Inmates',
        title='Inmate Distribution by Prior Misdemeanor Arrests')

Our data set contains a significant number of people who have never been arrested 

#Data Cleaning 

In [None]:
# Inspect Missing Values 
full.isna().sum()

In [172]:
# Drop Drug Features 
full.drop(columns = ['Avg_Days_per_DrugTest', 'DrugTests_THC_Positive', 'DrugTests_Cocaine_Positive',
                    'DrugTests_Meth_Positive', 'DrugTests_Other_Positive'], inplace = True)

These values were dropped due to the significant number of null values 

#Feature Engineering

##One Hot Encoding 

In [None]:
# One-Hot Encoding - Categorical, Nonordered Variables 
one_hot = pd.get_dummies(data = full, columns = ['Gender', 'Race', 'Gang_Affiliated', 'Age_at_Release',
                                               'Employment_Exempt',	'Recidivism_Within_3years',	'Recidivism_Arrest_Year1',
                                               'Recidivism_Arrest_Year2',	'Recidivism_Arrest_Year3'], drop_first = True)
pd.concat([full, one_hot], axis = 1).head()

In [None]:
# Drop Non Encoded Columns
full.drop(columns = ['Gender', 'Race', 'Gang_Affiliated', 'Age_at_Release',
                     'Employment_Exempt',	'Recidivism_Within_3years',	'Recidivism_Arrest_Year1',
                     'Recidivism_Arrest_Year2',	'Recidivism_Arrest_Year3'], inplace = True)

In [None]:
# Demographic Data Encoding

# Create Encoder Instance 
le = preprocessing.LabelEncoder()

# Supervision Risk Score Encoded 
full['Supervision_Level_First'] = le.fit_transform(full['Supervision_Level_First'])
full['Supervision_Level_First']

# Education Level Encoded
full['Education_Level'] = le.fit_transform(full['Education_Level'])
full['Education_Level']

# Number of Dependents Encoded
full['Dependents'] = le.fit_transform(full['Dependents'])
full['Dependents']

In [None]:
# Prison History Feature Encoding 

# Prison Offense Encoded
full['Prison_Offense'] = le.fit_transform(full['Prison_Offense'])
full['Prison_Offense']

# Prison Years Encoded
full['Prison_Years'] = le.fit_transform(full['Prison_Years'])
full['Prison_Years']

In [None]:
# Arrests Feature Encoding 

# Prior Misdemeanor Arrests Encoded
full['Prior_Arrest_Episodes_Misd'] = le.fit_transform(full['Prior_Arrest_Episodes_Misd'])
full['Prior_Arrest_Episodes_Misd']

# Prior Violent Arrests Encoded
full['Prior_Arrest_Episodes_Violent'] = le.fit_transform(full['Prior_Arrest_Episodes_Violent'])
full['Prior_Arrest_Episodes_Violent']

# Prior Property Arrests Encoded
full['Prior_Arrest_Episodes_Property'] = le.fit_transform(full['Prior_Arrest_Episodes_Property'])
full['Prior_Arrest_Episodes_Property']

# Prior Drug Arrests Encoded
full['Prior_Arrest_Episodes_Drug'] = le.fit_transform(full['Prior_Arrest_Episodes_Drug'])
full['Prior_Arrest_Episodes_Drug']

# Prior PPV Violation Arrests Encoded
full['Prior_Arrest_Episodes_PPViolationCharges'] = le.fit_transform(full['Prior_Arrest_Episodes_PPViolationCharges'])
full['Prior_Arrest_Episodes_PPViolationCharges']

# Prior DV Arrests Encoded
full['Prior_Arrest_Episodes_DVCharges'] = le.fit_transform(full['Prior_Arrest_Episodes_DVCharges'])
full['Prior_Arrest_Episodes_DVCharges']

# Prior Gun Arrests Encoded
full['Prior_Arrest_Episodes_GunCharges'] = le.fit_transform(full['Prior_Arrest_Episodes_GunCharges'])
full['Prior_Arrest_Episodes_GunCharges']

In [None]:
# Convictions Feature Encoding 

# Prior Felony Convictions Encoded
full['Prior_Conviction_Episodes_Felony'] = le.fit_transform(full['Prior_Conviction_Episodes_Felony'])
full['Prior_Conviction_Episodes_Felony']

# Prior Misdemeanor Convictions Encoded
full['Prior_Conviction_Episodes_Misd'] = le.fit_transform(full['Prior_Conviction_Episodes_Misd'])
full['Prior_Conviction_Episodes_Misd']

# Prior Violent Convictions Encoded
full['Prior_Conviction_Episodes_Viol'] = le.fit_transform(full['Prior_Conviction_Episodes_Viol'])
full['Prior_Conviction_Episodes_Viol']

# Prior Drug Convictions Encoded
full['Prior_Conviction_Episodes_Drug'] = le.fit_transform(full['Prior_Conviction_Episodes_Drug'])
full['Prior_Conviction_Episodes_Drug']

# Prior PPV Convictions Encoded
full['Prior_Conviction_Episodes_PPViolationCharges'] = le.fit_transform(full['Prior_Conviction_Episodes_PPViolationCharges'])
full['Prior_Conviction_Episodes_PPViolationCharges']

# Prior Domestic Violence Convictions Encoded
full['Prior_Conviction_Episodes_DomesticViolenceCharges'] = le.fit_transform(full['Prior_Conviction_Episodes_DomesticViolenceCharges'])
full['Prior_Conviction_Episodes_DomesticViolenceCharges']

# Prior Gun Convictions Encoded
full['Prior_Conviction_Episodes_GunCharges'] = le.fit_transform(full['Prior_Conviction_Episodes_GunCharges'])
full['Prior_Conviction_Episodes_GunCharges']

In [None]:
# Revocations Feature Encoding 

# Prior Parole Revocations Encoded
full['Prior_Revocations_Parole'] = le.fit_transform(full['Prior_Revocations_Parole'])
full['Prior_Revocations_Parole']

# Prior Probation Revocations Encoded
full['Prior_Revocations_Probation'] = le.fit_transform(full['Prior_Revocations_Probation'])
full['Prior_Revocations_Probation']

In [None]:
# Violations Feature Encoding 

# Prior Instruction Violations Encoded
full['Violations_Instruction'] = le.fit_transform(full['Violations_Instruction'])
full['Violations_Instruction']

# Prior Fail to Report Violations Encoded
full['Violations_FailToReport'] = le.fit_transform(full['Violations_FailToReport'])
full['Violations_FailToReport']

# Prior Move w/o Permission Violations Encoded
full['Violations_MoveWithoutPermission'] = le.fit_transform(full['Violations_MoveWithoutPermission'])
full['Violations_MoveWithoutPermission']

In [None]:
# Other Parole Related Features 

# Delinquency Reports Encoded
full['Delinquency_Reports'] = le.fit_transform(full['Delinquency_Reports'])
full['Delinquency_Reports']

# Program Attendances Encoded
full['Program_Attendances'] = le.fit_transform(full['Program_Attendances'])
full['Program_Attendances']

# Program Unexcused Absences Encoded
full['Program_UnexcusedAbsences'] = le.fit_transform(full['Program_UnexcusedAbsences'])
full['Program_UnexcusedAbsences']

# Residence Changes Encoded
full['Residence_Changes'] = le.fit_transform(full['Residence_Changes'])
full['Residence_Changes']

In [None]:
"""
'Prior_Arrest_Episodes_Violent',
       'Prior_Arrest_Episodes_Property', 'Prior_Arrest_Episodes_Drug',
       'Prior_Arrest_Episodes_PPViolationCharges',
       'Prior_Arrest_Episodes_DVCharges', 'Prior_Arrest_Episodes_GunCharges',
       'Prior_Conviction_Episodes_Felony', 'Prior_Conviction_Episodes_Misd',
       'Prior_Conviction_Episodes_Viol', 'Prior_Conviction_Episodes_Prop',
       'Prior_Conviction_Episodes_Drug',
       'Prior_Conviction_Episodes_PPViolationCharges',
       'Prior_Conviction_Episodes_DomesticViolenceCharges',
       'Prior_Conviction_Episodes_GunCharges', 'Prior_Revocations_Parole',
       'Prior_Revocations_Probation'
"""

In [None]:
full.dtypes