### Group Project 4 : Comparing 3 Models for Predicting Recidivism

For background on this project, please see the [README](../README.md).

**Notebooks**
- Data Acquisition & Cleaning (this notebook)
- [Exploratory Data Analysis](./02_eda.ipynb)
- [Modeling](./03_modeling.ipynb)
- [Results and Recommendations](./04_results.ipynb)

**In this notebook, you'll find (for each of the 3 models):**
- Data ingestion
- Cleaning
- New feature engineering
- etc. TODO

**Model 1: Base feature set - New York**

This dataset was pulled directly from the ny.gov website. It represents the return status within three years of release from prison for former inmates in the State of New York.

|Feature|Type|Description|
|---|---|---|
|Release Year|int|The year the inmate was released from prison|
|County of Indictment|object|The county within the State of New York where the inmate was indicted|
|Gender|object|The inmates gender|
|Age at Release|int|The age of the inmate at the time of release from prison|
|Return Status|object|The status of the inmate within 3 years of release (Returned because of New Offense or Parole Violation, Not Returned)|

In [2]:
import pandas as pd
ny_df = pd.read_csv('../data/NY/newyork.csv')
ny_df.head()

Unnamed: 0,Release Year,County of Indictment,Gender,Age at Release,Return Status
0,2008,UNKNOWN,MALE,55,Not Returned
1,2008,ALBANY,MALE,16,Returned Parole Violation
2,2008,ALBANY,MALE,17,Not Returned
3,2008,ALBANY,MALE,17,Returned Parole Violation
4,2008,ALBANY,MALE,18,Not Returned


In [3]:
#Quick overview of the data shows 188k observations with only a few features including release year, county of indictment, age at release and gender. Return Status will be the target.
ny_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188650 entries, 0 to 188649
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   Release Year          188650 non-null  int64 
 1   County of Indictment  188650 non-null  object
 2   Gender                188650 non-null  object
 3   Age at Release        188650 non-null  int64 
 4   Return Status         188650 non-null  object
dtypes: int64(2), object(3)
memory usage: 7.2+ MB


In [4]:
#Data is fairly clean no null values
ny_df.isnull().sum()

Release Year            0
County of Indictment    0
Gender                  0
Age at Release          0
Return Status           0
dtype: int64

In [6]:
#Transforming the target variable 'Return Status' into 1s and 0s. 1 representing someone who returned to prison within 3 years of release and 0 representing someone who did not. 
ny_df['recidivism'] = ny_df['Return Status'].map({'Not Returned': 0, 'Returned Parole Violation' : 1, 'New Felony Offense' : 1})

In [7]:
#Transforming the Gender into 1s and 0s so it can be used in modeling. 1s represent Male 0s Female.
ny_df['gender_map'] = ny_df['Gender'].map({'MALE': 1, 'FEMALE': 0})
ny_df.head()

Unnamed: 0,Release Year,County of Indictment,Gender,Age at Release,Return Status,recidivism,gender_map
0,2008,UNKNOWN,MALE,55,Not Returned,0,1
1,2008,ALBANY,MALE,16,Returned Parole Violation,1,1
2,2008,ALBANY,MALE,17,Not Returned,0,1
3,2008,ALBANY,MALE,17,Returned Parole Violation,1,1
4,2008,ALBANY,MALE,18,Not Returned,0,1


In [8]:
%store ny_df

Stored 'ny_df' (DataFrame)


**Model 2: Criminal history feature set - Florida**

TODO provide some background

**Model 3: Behavioral feature set - Georgia**

TODO provide some background

**FINAL NOTES**:
- The final datasets for modeling are exported:
  - [here](../data/NY/NY_final.csv) for Model 1 (NY)
  - [here](../data/FL/FL_final.csv) for Model 2 (FL)
  - [here](../data/GA/GA_final.csv) for Model 3 (GA)
- The next notebook in the series is [Exploratory Data Analysis](./02_eda.ipynb).