<a href="https://colab.research.google.com/github/feaviolp/msc-project/blob/main/NIJ%20EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploratory Data Analysis of the NIJ recidivism dataset**

Import libraries

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
import scipy.stats as st
from sklearn import linear_model
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
# ignore future deprecation
warnings.filterwarnings('ignore')

Load the CSV file into a pandas datafram

In [None]:
url = "https://raw.githubusercontent.com/feaviolp/msc-project/main/NIJ_s_Recidivism_Challenge_Full_Dataset_20240222.csv"
NIJ = pd.read_csv(url)

Look at the shape and features of the dataset

In [30]:
NIJ.shape

(25835, 54)

In [21]:
NIJ.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25835 entries, 0 to 25834
Data columns (total 54 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   ID                                                 25835 non-null  int64  
 1   Gender                                             25835 non-null  object 
 2   Race                                               25835 non-null  object 
 3   Age_at_Release                                     25835 non-null  object 
 4   Residence_PUMA                                     25835 non-null  int64  
 5   Gang_Affiliated                                    22668 non-null  object 
 6   Supervision_Risk_Score_First                       25360 non-null  float64
 7   Supervision_Level_First                            24115 non-null  object 
 8   Education_Level                                    25835 non-null  object 
 9   Depend

The dataset has 25,836 rows and 54 columns.

Some of the columns have missing values (less than 25,836 values).


Next describe the dataset, then look at the top and bottom 4 rows.

In [23]:
NIJ.describe()

Unnamed: 0,ID,Residence_PUMA,Supervision_Risk_Score_First,Avg_Days_per_DrugTest,DrugTests_THC_Positive,DrugTests_Cocaine_Positive,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year,Training_Sample
count,25835.0,25835.0,25360.0,19732.0,20663.0,20663.0,20663.0,20663.0,25373.0,25027.0,25835.0
mean,13314.004838,12.361796,6.082216,93.890044,0.06335,0.013741,0.01289,0.00755,0.482331,0.769295,0.697813
std,7722.206327,7.133742,2.381442,117.169847,0.138453,0.061233,0.060581,0.04115,0.425004,0.813787,0.459215
min,1.0,1.0,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6626.5,6.0,4.0,28.837366,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,13270.0,12.0,6.0,55.424812,0.0,0.0,0.0,0.0,0.475728,0.635217,1.0
75%,20021.5,18.0,8.0,110.333333,0.071429,0.0,0.0,0.0,0.969325,1.0,1.0
max,26761.0,25.0,10.0,1088.5,1.0,1.0,1.0,1.0,1.0,8.0,1.0


In [24]:
NIJ.head()

Unnamed: 0,ID,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,...,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year,Employment_Exempt,Recidivism_Within_3years,Recidivism_Arrest_Year1,Recidivism_Arrest_Year2,Recidivism_Arrest_Year3,Training_Sample
0,1,M,BLACK,43-47,16,False,3.0,Standard,At least some college,3 or more,...,0.0,0.0,0.488562,0.44761,False,False,False,False,False,1
1,2,M,BLACK,33-37,16,False,6.0,Specialized,Less than HS diploma,1,...,0.0,0.0,0.425234,2.0,False,True,False,False,True,1
2,3,M,BLACK,48 or older,24,False,7.0,High,At least some college,3 or more,...,0.166667,0.0,0.0,0.0,False,True,False,True,False,1
3,4,M,WHITE,38-42,16,False,7.0,High,Less than HS diploma,1,...,0.0,0.0,1.0,0.718996,False,False,False,False,False,1
4,5,M,WHITE,33-37,16,False,4.0,Specialized,Less than HS diploma,3 or more,...,0.058824,0.0,0.203562,0.929389,False,True,True,False,False,1


In [25]:
NIJ.tail()

Unnamed: 0,ID,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,...,DrugTests_Meth_Positive,DrugTests_Other_Positive,Percent_Days_Employed,Jobs_Per_Year,Employment_Exempt,Recidivism_Within_3years,Recidivism_Arrest_Year1,Recidivism_Arrest_Year2,Recidivism_Arrest_Year3,Training_Sample
25830,26756,M,BLACK,23-27,9,False,5.0,Standard,At least some college,1,...,0.0,0.0,0.189507,0.572044,False,True,True,False,False,1
25831,26758,M,WHITE,38-42,25,False,5.0,Standard,At least some college,3 or more,...,0.0,0.0,0.757098,0.576104,False,True,False,True,False,1
25832,26759,M,BLACK,33-37,15,False,5.0,Standard,At least some college,3 or more,...,,,0.711138,0.894125,False,True,False,True,False,1
25833,26760,F,WHITE,33-37,15,,5.0,Standard,At least some college,3 or more,...,0.0,0.0,0.0,0.0,True,False,False,False,False,1
25834,26761,M,WHITE,28-32,12,False,5.0,Standard,High School Diploma,3 or more,...,0.0,0.0,0.124454,0.398745,False,True,True,False,False,1


Now take a look at the rows with missing values.

In [26]:
NIJ.isna().sum()

ID                                                      0
Gender                                                  0
Race                                                    0
Age_at_Release                                          0
Residence_PUMA                                          0
Gang_Affiliated                                      3167
Supervision_Risk_Score_First                          475
Supervision_Level_First                              1720
Education_Level                                         0
Dependents                                              0
Prison_Offense                                       3277
Prison_Years                                            0
Prior_Arrest_Episodes_Felony                            0
Prior_Arrest_Episodes_Misd                              0
Prior_Arrest_Episodes_Violent                           0
Prior_Arrest_Episodes_Property                          0
Prior_Arrest_Episodes_Drug                              0
Prior_Arrest_E