# For a Better Criminal Justice System: Analyzing Recidivism and Prison Populations in the US

#### Data sourced from:
- [Sentencing Project, State Imprisonment Rate](https://www.sentencingproject.org/the-facts/#rankings)
    - "Research on incarceration has traditionally centered on state-level data: specifically state prison populations or the statewide combined prison and jail population. Using the state as the unit of analysis is sufficient for understanding the broad contours of incarceration in the United States, but it does not provide the level of detail necessary to unpack its causes and consequences."
    
-[Bureau of Justice Statistics, National Prisoners Statistics Program](https://www.kaggle.com/christophercorrea/prisoners-and-crime-in-united-states?select=crime_and_incarceration_by_state.csv)

    - "The Bureau of Justice Statistics administers the National Prisoners Statistics Program (NPS), an annual data collection effort that began in response to a 1926 congressional mandate. The population statistics reflect each state's prisoner population as of December 31 for the recorded year. Prisoners listed under federal jurisdiction are incarcerated by the U.S. Bureau of Prisons."
    
- [Iowa Department of Corrections](https://data.iowa.gov/Correctional-System/3-Year-Recidivism-for-Offenders-Released-from-Pris/mw8r-vqy4), [Kaggle, slonnadube](https://www.kaggle.com/slonnadube/recidivism-for-offenders-released-from-prison)
    - "This dataset reports whether an offender is re-admitted to prison or not within three years from being released from prison in Iowa. The recidivism reporting year is the fiscal year (year ending June 30) marking the end of the three year tracking period.
       The Department of Corrections uses recidivism as an indicator on whether strategies are reducing offenders relapse into criminal behavior. A three year time frame is used as studies have shown if an offender relapses into criminal behavior it is most likely to happen within three years of being released."
       
- [Prison Policy Initiative](https://www.prisonpolicy.org/profiles/)

## Directory
### 1. Importing Necessary Libraries
### 2. Importing Datasets, Merging, and Inspecting Data
### 3. Cleaning and Converting Data

## 1. Importing Necessary Libraries

In [1]:
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
from matplotlib.patches import Patch
from matplotlib import pyplot
plt.style.use('ggplot')
from imblearn.over_sampling import SMOTE

import pandas as pd

## 2. Importing Datasets, Merging, and Inspecting Data

### JAIL POPULATION
#### Crime and Incarceration by State

In [2]:
st_data=pd.read_csv('prison_custody_by_state.csv')
st_data.head()

Unnamed: 0,jurisdiction,includes_jails,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Federal,0,149852,158216,168144,177600,186364,190844,197285,198414,205087,206968,214774,216915,214989,209561,195622,188311
1,Alabama,0,24741,25100,27614,25635,24315,24103,25253,25363,27241,27345,26813,26768,26825,26145,25212,23745
2,Alaska,1,4570,4351,4472,4534,4798,5052,5151,4997,5472,5369,6216,6308,5081,6323,5247,4378
3,Arizona,0,27710,29359,31084,32384,33345,35752,37700,39455,40544,40130,39949,40013,41031,42136,42204,42248
4,Arkansas,0,11489,11849,12068,12577,12455,12854,13275,13135,13338,14192,14090,14043,14295,15250,15784,15833


In [3]:
st_data.describe()

Unnamed: 0,includes_jails
count,51.0
mean,0.117647
std,0.325396
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [4]:
st_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   jurisdiction    51 non-null     object
 1   includes_jails  51 non-null     int64 
 2   2001            51 non-null     object
 3   2002            51 non-null     object
 4   2003            51 non-null     object
 5   2004            51 non-null     object
 6   2005            51 non-null     object
 7   2006            51 non-null     object
 8   2007            51 non-null     object
 9   2008            51 non-null     object
 10  2009            51 non-null     object
 11  2010            51 non-null     object
 12  2011            51 non-null     object
 13  2012            51 non-null     object
 14  2013            51 non-null     object
 15  2014            51 non-null     object
 16  2015            51 non-null     object
 17  2016            51 non-null     object
dtypes: int64(1),

In [5]:
st_data['jurisdiction'].value_counts()

Michigan          1
Illinois          1
Wisconsin         1
New Hampshire     1
New Mexico        1
Virginia          1
Colorado          1
West Virginia     1
Connecticut       1
Maryland          1
Utah              1
Washington        1
South Dakota      1
Iowa              1
Pennsylvania      1
Alabama           1
Oklahoma          1
New York          1
Kentucky          1
New Jersey        1
Vermont           1
Florida           1
Minnesota         1
Rhode Island      1
Georgia           1
Tennessee         1
South Carolina    1
Nebraska          1
Indiana           1
Kansas            1
Delaware          1
Massachusetts     1
Ohio              1
Oregon            1
Missouri          1
Louisiana         1
Arizona           1
Montana           1
Mississippi       1
Idaho             1
Wyoming           1
North Dakota      1
North Carolina    1
Nevada            1
Texas             1
Arkansas          1
Alaska            1
Hawaii            1
California        1
Maine             1


In [6]:
st_data.isnull().sum()

jurisdiction      0
includes_jails    0
2001              0
2002              0
2003              0
2004              0
2005              0
2006              0
2007              0
2008              0
2009              0
2010              0
2011              0
2012              0
2013              0
2014              0
2015              0
2016              0
dtype: int64

In [7]:
st_data =st_data.dropna()
st_data.isnull().sum()

jurisdiction      0
includes_jails    0
2001              0
2002              0
2003              0
2004              0
2005              0
2006              0
2007              0
2008              0
2009              0
2010              0
2011              0
2012              0
2013              0
2014              0
2015              0
2016              0
dtype: int64

#### Dropping unnecessary columns

In [8]:
st_data = st_data.drop(['includes_jails'], axis = 1)

In [9]:
st_data.head()

Unnamed: 0,jurisdiction,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Federal,149852,158216,168144,177600,186364,190844,197285,198414,205087,206968,214774,216915,214989,209561,195622,188311
1,Alabama,24741,25100,27614,25635,24315,24103,25253,25363,27241,27345,26813,26768,26825,26145,25212,23745
2,Alaska,4570,4351,4472,4534,4798,5052,5151,4997,5472,5369,6216,6308,5081,6323,5247,4378
3,Arizona,27710,29359,31084,32384,33345,35752,37700,39455,40544,40130,39949,40013,41031,42136,42204,42248
4,Arkansas,11489,11849,12068,12577,12455,12854,13275,13135,13338,14192,14090,14043,14295,15250,15784,15833


#### Converting year columns into rows

In [10]:
st_data = st_data.melt(id_vars=["jurisdiction"], 
        var_name="Year", 
        value_name="Prison Population")

In [11]:
st_data.head()

Unnamed: 0,jurisdiction,Year,Prison Population
0,Federal,2001,149852
1,Alabama,2001,24741
2,Alaska,2001,4570
3,Arizona,2001,27710
4,Arkansas,2001,11489


#### Creating list for individual state analysis.

In [12]:
STATES = ['Louisiana', 'Pennsylvania', 'Iowa','Massachusetts']

In [13]:
st_data=st_data[st_data.jurisdiction.isin(STATES)]
st_data.head()

Unnamed: 0,jurisdiction,Year,Prison Population
15,Iowa,2001,7962
18,Louisiana,2001,19660
21,Massachusetts,2001,10203
38,Pennsylvania,2001,37641
66,Iowa,2002,8398


In [14]:
st_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 64 entries, 15 to 803
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   jurisdiction       64 non-null     object
 1   Year               64 non-null     object
 2   Prison Population  64 non-null     object
dtypes: object(3)
memory usage: 2.0+ KB


#### Renaming 'jurisdiction' column to 'State'

In [15]:
st_data = st_data.rename(columns={"jurisdiction": "State"})

In [16]:
st_data.head()

Unnamed: 0,State,Year,Prison Population
15,Iowa,2001,7962
18,Louisiana,2001,19660
21,Massachusetts,2001,10203
38,Pennsylvania,2001,37641
66,Iowa,2002,8398


#### Converting st_data to datetime

In [17]:
st_data.Year = pd.to_datetime(st_data.Year, format='%Y')

In [18]:
st_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 64 entries, 15 to 803
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   State              64 non-null     object        
 1   Year               64 non-null     datetime64[ns]
 2   Prison Population  64 non-null     object        
dtypes: datetime64[ns](1), object(2)
memory usage: 2.0+ KB


#### Cleaning up data before pivoting 

In [19]:
st_data = st_data.replace(',','', regex=True)

In [20]:
st_data["Prison Population"] = st_data["Prison Population"].astype(float)

#### Pivoting st_data for EDA

In [21]:
st_data = pd.pivot_table(st_data, values=['Prison Population'], index='Year',
                    columns=['State'])
st_data.head()

Unnamed: 0_level_0,Prison Population,Prison Population,Prison Population,Prison Population
State,Iowa,Louisiana,Massachusetts,Pennsylvania
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2001-01-01,7962.0,19660.0,10203.0,37641.0
2002-01-01,8398.0,20010.0,9879.0,39724.0
2003-01-01,8546.0,19498.0,9828.0,40879.0
2004-01-01,8525.0,19470.0,9825.0,40506.0
2005-01-01,8737.0,19371.0,10348.0,41941.0


In [22]:
st_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16 entries, 2001-01-01 to 2016-01-01
Data columns (total 4 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   (Prison Population, Iowa)           16 non-null     float64
 1   (Prison Population, Louisiana)      16 non-null     float64
 2   (Prison Population, Massachusetts)  16 non-null     float64
 3   (Prison Population, Pennsylvania)   16 non-null     float64
dtypes: float64(4)
memory usage: 640.0 bytes


### Iowa Recidivism  1
##### Having sourced two incomplete datsets on Iowa recidivism, both need to be cleaned and joined 

In [23]:
df_IR = pd.read_csv('3YearIowa.csv')
df_IR.head()

Unnamed: 0,Fiscal Year Released,Recidivism Reporting Year,Main Supervising District,Release Type,Race - Ethnicity,Age At Release,Sex,Offense Classification,Offense Type,Offense Subtype,Return to Prison,Days to Return,Recidivism Type,New Offense Classification,New Offense Type,New Offense Sub Type,Target Population
0,2010,2013,7JD,Parole,Black - Non-Hispanic,25-34,Male,C Felony,Violent,Robbery,Yes,433.0,New,C Felony,Drug,Trafficking,Yes
1,2010,2013,,Discharged – End of Sentence,White - Non-Hispanic,25-34,Male,D Felony,Property,Theft,Yes,453.0,Tech,,,,No
2,2010,2013,5JD,Parole,White - Non-Hispanic,35-44,Male,B Felony,Drug,Trafficking,Yes,832.0,Tech,,,,Yes
3,2010,2013,6JD,Parole,White - Non-Hispanic,25-34,Male,B Felony,Other,Other Criminal,No,,No Recidivism,,,,Yes
4,2010,2013,,Discharged – End of Sentence,Black - Non-Hispanic,35-44,Male,D Felony,Violent,Assault,Yes,116.0,Tech,,,,No


In [24]:
df_IR.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26020 entries, 0 to 26019
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Fiscal Year Released        26020 non-null  int64  
 1   Recidivism Reporting Year   26020 non-null  int64  
 2   Main Supervising District   16439 non-null  object 
 3   Release Type                24258 non-null  object 
 4   Race - Ethnicity            25990 non-null  object 
 5   Age At Release              26017 non-null  object 
 6   Sex                         26017 non-null  object 
 7   Offense Classification      26020 non-null  object 
 8   Offense Type                26020 non-null  object 
 9   Offense Subtype             26020 non-null  object 
 10  Return to Prison            26020 non-null  object 
 11  Days to Return              8681 non-null   float64
 12  Recidivism Type             26020 non-null  object 
 13  New Offense Classification  671

##### Narrowing dataframe for merging 

In [25]:
df_IR = df_IR[['Fiscal Year Released', 'Race - Ethnicity', 'Main Supervising District', 'Sex',
              'Recidivism Type', 'Age At Release ']]

In [26]:
df_IR.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26020 entries, 0 to 26019
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Fiscal Year Released       26020 non-null  int64 
 1   Race - Ethnicity           25990 non-null  object
 2   Main Supervising District  16439 non-null  object
 3   Sex                        26017 non-null  object
 4   Recidivism Type            26020 non-null  object
 5   Age At Release             26017 non-null  object
dtypes: int64(1), object(5)
memory usage: 1.2+ MB


In [27]:
df_IR.head()

Unnamed: 0,Fiscal Year Released,Race - Ethnicity,Main Supervising District,Sex,Recidivism Type,Age At Release
0,2010,Black - Non-Hispanic,7JD,Male,New,25-34
1,2010,White - Non-Hispanic,,Male,Tech,25-34
2,2010,White - Non-Hispanic,5JD,Male,Tech,35-44
3,2010,White - Non-Hispanic,6JD,Male,No Recidivism,25-34
4,2010,Black - Non-Hispanic,,Male,Tech,35-44


### Iowa Recidivism 2

In [28]:
df_RI = pd.read_csv('3year_Iowa.csv')
df_RI.head()

Unnamed: 0,Fiscal Year Released,Recidivism Reporting Year,Race - Ethnicity,Age At Release,Convicting Offense Classification,Convicting Offense Type,Convicting Offense Subtype,Main Supervising District,Release Type,Release type: Paroled to Detainder united,Part of Target Population,Recidivism - Return to Prison numeric
0,2010,2013,White - Non-Hispanic,Under 25,D Felony,Violent,Assault,4JD,Parole,Parole,Yes,1
1,2010,2013,White - Non-Hispanic,55 and Older,D Felony,Public Order,OWI,7JD,Parole,Parole,Yes,1
2,2010,2013,White - Non-Hispanic,25-34,D Felony,Property,Burglary,5JD,Parole,Parole,Yes,1
3,2010,2013,White - Non-Hispanic,55 and Older,C Felony,Drug,Trafficking,8JD,Parole,Parole,Yes,1
4,2010,2013,Black - Non-Hispanic,25-34,D Felony,Drug,Trafficking,3JD,Parole,Parole,Yes,1


In [29]:
df_RI.isnull().sum()

Fiscal Year Released                            0
Recidivism Reporting Year                       0
Race - Ethnicity                               30
Age At Release                                  3
Convicting Offense Classification               0
Convicting Offense Type                         0
Convicting Offense Subtype                      0
Main Supervising District                    9581
Release Type                                 1762
Release type: Paroled to Detainder united    1762
Part of Target Population                       0
Recidivism - Return to Prison numeric           0
dtype: int64

In [30]:
df_RI.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26020 entries, 0 to 26019
Data columns (total 12 columns):
 #   Column                                     Non-Null Count  Dtype 
---  ------                                     --------------  ----- 
 0   Fiscal Year Released                       26020 non-null  int64 
 1   Recidivism Reporting Year                  26020 non-null  int64 
 2   Race - Ethnicity                           25990 non-null  object
 3   Age At Release                             26017 non-null  object
 4   Convicting Offense Classification          26020 non-null  object
 5   Convicting Offense Type                    26020 non-null  object
 6   Convicting Offense Subtype                 26020 non-null  object
 7   Main Supervising District                  16439 non-null  object
 8   Release Type                               24258 non-null  object
 9   Release type: Paroled to Detainder united  24258 non-null  object
 10  Part of Target Population         

##### Comparing 'Release Type' columns

In [31]:
df_RI['Release type: Paroled to Detainder united'].value_counts()

Parole                        15722
Discharged End of Sentence     7374
Special Sentence                748
Paroled to Detainer             414
Name: Release type: Paroled to Detainder united, dtype: int64

In [32]:
df_RI['Release Type'].value_counts()

Parole                                 15721
Discharged End of Sentence              7374
Special Sentence                         748
Paroled to Detainer - Out of State       137
Paroled to Detainer - INS                134
Paroled to Detainer - U.S. Marshall       77
Paroled to Detainer - Iowa                66
Interstate Compact Parole                  1
Name: Release Type, dtype: int64

### Final Iowa (JOINING IOWA 1 & 2)

In [33]:
data = df_RI.join(df_IR, rsuffix="_right")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26020 entries, 0 to 26019
Data columns (total 18 columns):
 #   Column                                     Non-Null Count  Dtype 
---  ------                                     --------------  ----- 
 0   Fiscal Year Released                       26020 non-null  int64 
 1   Recidivism Reporting Year                  26020 non-null  int64 
 2   Race - Ethnicity                           25990 non-null  object
 3   Age At Release                             26017 non-null  object
 4   Convicting Offense Classification          26020 non-null  object
 5   Convicting Offense Type                    26020 non-null  object
 6   Convicting Offense Subtype                 26020 non-null  object
 7   Main Supervising District                  16439 non-null  object
 8   Release Type                               24258 non-null  object
 9   Release type: Paroled to Detainder united  24258 non-null  object
 10  Part of Target Population         

##### Narrowing and ordering final dataframe

In [34]:
data = data[['Fiscal Year Released', 'Recidivism Reporting Year', 'Race - Ethnicity', 'Sex', 'Age At Release ',
            'Convicting Offense Classification', 'Convicting Offense Type', 'Convicting Offense Subtype',
            'Release type: Paroled to Detainder united', 'Part of Target Population', 'Recidivism - Return to Prison numeric']]

In [35]:
data.head()

Unnamed: 0,Fiscal Year Released,Recidivism Reporting Year,Race - Ethnicity,Sex,Age At Release,Convicting Offense Classification,Convicting Offense Type,Convicting Offense Subtype,Release type: Paroled to Detainder united,Part of Target Population,Recidivism - Return to Prison numeric
0,2010,2013,White - Non-Hispanic,Male,Under 25,D Felony,Violent,Assault,Parole,Yes,1
1,2010,2013,White - Non-Hispanic,Male,55 and Older,D Felony,Public Order,OWI,Parole,Yes,1
2,2010,2013,White - Non-Hispanic,Male,25-34,D Felony,Property,Burglary,Parole,Yes,1
3,2010,2013,White - Non-Hispanic,Male,55 and Older,C Felony,Drug,Trafficking,Parole,Yes,1
4,2010,2013,Black - Non-Hispanic,Male,25-34,D Felony,Drug,Trafficking,Parole,Yes,1


## 3. Cleaning and Converting Iowa Data for Classification Modeling

In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26020 entries, 0 to 26019
Data columns (total 11 columns):
 #   Column                                     Non-Null Count  Dtype 
---  ------                                     --------------  ----- 
 0   Fiscal Year Released                       26020 non-null  int64 
 1   Recidivism Reporting Year                  26020 non-null  int64 
 2   Race - Ethnicity                           25990 non-null  object
 3   Sex                                        26017 non-null  object
 4   Age At Release                             26017 non-null  object
 5   Convicting Offense Classification          26020 non-null  object
 6   Convicting Offense Type                    26020 non-null  object
 7   Convicting Offense Subtype                 26020 non-null  object
 8   Release type: Paroled to Detainder united  24258 non-null  object
 9   Part of Target Population                  26020 non-null  object
 10  Recidivism - Return to Prison nume

##### Renaming columns for comprehension and ease of use

In [37]:
data = data.rename(columns={"Recidivism - Return to Prison numeric": "Recidivism", "Race - Ethnicity":'Ethnicity',
                           "Release type: Paroled to Detainder united": "Release Type"})

In [38]:
data.head()

Unnamed: 0,Fiscal Year Released,Recidivism Reporting Year,Ethnicity,Sex,Age At Release,Convicting Offense Classification,Convicting Offense Type,Convicting Offense Subtype,Release Type,Part of Target Population,Recidivism
0,2010,2013,White - Non-Hispanic,Male,Under 25,D Felony,Violent,Assault,Parole,Yes,1
1,2010,2013,White - Non-Hispanic,Male,55 and Older,D Felony,Public Order,OWI,Parole,Yes,1
2,2010,2013,White - Non-Hispanic,Male,25-34,D Felony,Property,Burglary,Parole,Yes,1
3,2010,2013,White - Non-Hispanic,Male,55 and Older,C Felony,Drug,Trafficking,Parole,Yes,1
4,2010,2013,Black - Non-Hispanic,Male,25-34,D Felony,Drug,Trafficking,Parole,Yes,1


##### Examining missing values and cleaning up data before Auto EDA

In [39]:
data.isnull().sum()

Fiscal Year Released                    0
Recidivism Reporting Year               0
Ethnicity                              30
Sex                                     3
Age At Release                          3
Convicting Offense Classification       0
Convicting Offense Type                 0
Convicting Offense Subtype              0
Release Type                         1762
Part of Target Population               0
Recidivism                              0
dtype: int64

In [40]:
data['Recidivism'].value_counts()

0    17339
1     8681
Name: Recidivism, dtype: int64

In [41]:
data['Ethnicity'].value_counts()

White - Non-Hispanic                               17584
Black - Non-Hispanic                                6109
White - Hispanic                                    1522
American Indian or Alaska Native - Non-Hispanic      502
Asian or Pacific Islander - Non-Hispanic             192
Black - Hispanic                                      37
American Indian or Alaska Native - Hispanic           20
White -                                               12
N/A -                                                  5
Asian or Pacific Islander - Hispanic                   5
Black -                                                2
Name: Ethnicity, dtype: int64

##### Converting nans for 'Ethnicity' to 'unknown'

In [42]:
data['Ethnicity'] = data['Ethnicity'].fillna('Unknown')

In [43]:
data['Ethnicity'].value_counts()

White - Non-Hispanic                               17584
Black - Non-Hispanic                                6109
White - Hispanic                                    1522
American Indian or Alaska Native - Non-Hispanic      502
Asian or Pacific Islander - Non-Hispanic             192
Black - Hispanic                                      37
Unknown                                               30
American Indian or Alaska Native - Hispanic           20
White -                                               12
N/A -                                                  5
Asian or Pacific Islander - Hispanic                   5
Black -                                                2
Name: Ethnicity, dtype: int64

##### Dropping nans for 'Release Type,' 'Sex,' 'Age At Release'

In [44]:
data = data.dropna()

In [45]:
data.isnull().sum()

Fiscal Year Released                 0
Recidivism Reporting Year            0
Ethnicity                            0
Sex                                  0
Age At Release                       0
Convicting Offense Classification    0
Convicting Offense Type              0
Convicting Offense Subtype           0
Release Type                         0
Part of Target Population            0
Recidivism                           0
dtype: int64