### Group Project 4 : Comparing 3 Models for Predicting Recidivism

For background on this project, please see the [README](../README.md).

**Notebooks**
- Data Acquisition & Cleaning (this notebook)
- [Exploratory Data Analysis](./02_eda.ipynb)
- [Modeling](./03_modeling.ipynb)
- [Model Tuning](./04_tuning.ipynb)
- [Experiments](./04a_experiments.ipynb)
- [Results and Recommendations](./05_results.ipynb)

**In this notebook, you'll find (for each of the 3 models):**
- Data ingestion
- Cleaning
- New feature engineering
- etc. TODO

In [573]:
import pandas as pd

**Model 1: Base feature set - New York**

This dataset was pulled directly from the ny.gov website. It represents the return status within three years of release from prison for former inmates in the State of New York.

|Feature|Type|Description|
|---|---|---|
|Release Year|int|The year the inmate was released from prison|
|County of Indictment|object|The county within the State of New York where the inmate was indicted|
|Gender|object|The inmates gender|
|Age at Release|int|The age of the inmate at the time of release from prison|
|Return Status|object|The status of the inmate within 3 years of release (Returned because of New Offense or Parole Violation, Not Returned)|

In [538]:
ny_df = pd.read_csv('../data/NY/newyork.csv')
ny_df.head()

Unnamed: 0,Release Year,County of Indictment,Gender,Age at Release,Return Status
0,2008,UNKNOWN,MALE,55,Not Returned
1,2008,ALBANY,MALE,16,Returned Parole Violation
2,2008,ALBANY,MALE,17,Not Returned
3,2008,ALBANY,MALE,17,Returned Parole Violation
4,2008,ALBANY,MALE,18,Not Returned


In [539]:
#Quick overview of the data shows 188k observations with only a few features including release year, county of indictment, age at release and gender. Return Status will be the target.
ny_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188650 entries, 0 to 188649
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   Release Year          188650 non-null  int64 
 1   County of Indictment  188650 non-null  object
 2   Gender                188650 non-null  object
 3   Age at Release        188650 non-null  int64 
 4   Return Status         188650 non-null  object
dtypes: int64(2), object(3)
memory usage: 7.2+ MB


In [540]:
#Data is fairly clean no null values
ny_df.isnull().sum()

Release Year            0
County of Indictment    0
Gender                  0
Age at Release          0
Return Status           0
dtype: int64

In [541]:
#Transforming the target variable 'Return Status' into 1s and 0s. 1 representing someone who returned to prison within 3 years of release and 0 representing someone who did not. 
ny_df['recidivism'] = ny_df['Return Status'].map({'Not Returned': 0, 'Returned Parole Violation' : 1, 'New Felony Offense' : 1})

In [542]:
#Transforming the Gender into 1s and 0s so it can be used in modeling. 1s represent Male 0s Female.
ny_df['gender_map'] = ny_df['Gender'].map({'MALE': 1, 'FEMALE': 0})
ny_df.head()

Unnamed: 0,Release Year,County of Indictment,Gender,Age at Release,Return Status,recidivism,gender_map
0,2008,UNKNOWN,MALE,55,Not Returned,0,1
1,2008,ALBANY,MALE,16,Returned Parole Violation,1,1
2,2008,ALBANY,MALE,17,Not Returned,0,1
3,2008,ALBANY,MALE,17,Returned Parole Violation,1,1
4,2008,ALBANY,MALE,18,Not Returned,0,1


In [543]:
%store ny_df

Stored 'ny_df' (DataFrame)


**Model 2: Criminal history feature set - Florida**

The Florida dataset comprises a number of SQLite tables related to criminal history for around 11,000 Broward County citizens. This data was collected as part of the evaluation of the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) decision support tool used in Broward County and in other U.S. states - further descriptions available on [Wikipedia](https://en.wikipedia.org/wiki/COMPAS_(software)) - so there are some COMPAS scores within this dataset as well.

In order to retrieve the data and put it into a form that was conducive to Pandas analysis, two SQL queries were constructed:

- [people.sql](../data/FL/people.sql) - basic demographic data, incarceration dates, and COMPAS scores
- [charges.sql](../data/FL/charges.sql) - case management, charge and arrest data

The results of those queries were then exported to CSV as [final_people.csv](../data/FL/final_people.csv) and [final_charges.csv](../data/fl/final_charges.csv) within the **data** subfolder, and will serve as the basis for the rest of this cleaning and transformation exercise.

In [544]:
# read in people dataset and take a look
people = pd.read_csv('../data/FL/final_people.csv')
people.head()

Unnamed: 0,person_id,sex,race,birth_date,first_incarceration_date,first_incarceration_release,last_incarceration_date,last_incarceration_release,num_incarcerations,comp_f_min_score,...,comp_f_max_score,comp_f_max_decile,comp_r_min_score,comp_r_min_decile,comp_r_max_score,comp_r_max_decile,comp_v_min_score,comp_v_min_decile,comp_v_max_score,comp_v_max_decile
0,1,Male,Other,1947-04-18 00:00:00.000000,2013-08-13 06:03:42.000000,2013-08-14 05:41:20.000000,2014-07-07 09:26:12.000000,2014-07-14 08:24:15.000000,2,13,...,13,1,-2.78,1,-2.78,1,-4.31,1,-4.31,1
1,2,Male,Caucasian,1985-02-06 00:00:00.000000,2014-12-30 10:47:52.000000,2015-01-03 02:18:24.000000,2014-12-30 10:47:52.000000,2015-01-03 02:18:24.000000,1,16,...,16,2,-0.34,5,-0.34,5,-2.75,2,-2.75,2
2,3,Male,African-American,1982-01-22 00:00:00.000000,2015-10-15 00:00:00.000000,2015-12-07 00:00:00.000000,2013-01-26 03:45:27.000000,2013-02-05 05:36:53.000000,2,25,...,25,6,-0.76,3,-0.76,3,-3.07,1,-3.07,1
3,4,Male,African-American,1991-05-14 00:00:00.000000,2013-04-13 04:58:34.000000,2013-04-14 07:02:04.000000,2016-01-08 09:59:55.000000,2016-01-09 04:41:39.000000,5,26,...,26,7,-0.66,4,-0.66,4,-2.26,3,-2.26,3
4,5,Male,African-American,1993-01-21 00:00:00.000000,,,,,0,19,...,19,3,0.16,8,0.16,8,-1.59,6,-1.59,6


In [545]:
# check people data types
people.dtypes

person_id                        int64
sex                             object
race                            object
birth_date                      object
first_incarceration_date        object
first_incarceration_release     object
last_incarceration_date         object
last_incarceration_release      object
num_incarcerations               int64
comp_f_min_score                 int64
comp_f_min_decile                int64
comp_f_max_score                 int64
comp_f_max_decile                int64
comp_r_min_score               float64
comp_r_min_decile                int64
comp_r_max_score               float64
comp_r_max_decile                int64
comp_v_min_score               float64
comp_v_min_decile                int64
comp_v_max_score               float64
comp_v_max_decile                int64
dtype: object

CONCLUSIONS
- all datetime columns must be converted
- sex and race should be dummified

In [546]:
# convert all datetime columns
people['birth_date'] = pd.to_datetime(people['birth_date'])
people['first_incarceration_date'] = pd.to_datetime(people['first_incarceration_date'])
people['first_incarceration_release'] = pd.to_datetime(people['first_incarceration_release'])
people['last_incarceration_date'] = pd.to_datetime(people['last_incarceration_date'])
people['last_incarceration_release'] = pd.to_datetime(people['last_incarceration_release'])

In [547]:
# dummify sex and race
people = pd.get_dummies(data = people, columns = ['sex', 'race'], drop_first = True)

In [548]:
# final dtype/null check
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11757 entries, 0 to 11756
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   person_id                    11757 non-null  int64         
 1   birth_date                   11757 non-null  datetime64[ns]
 2   first_incarceration_date     11127 non-null  datetime64[ns]
 3   first_incarceration_release  11127 non-null  datetime64[ns]
 4   last_incarceration_date      11127 non-null  datetime64[ns]
 5   last_incarceration_release   11127 non-null  datetime64[ns]
 6   num_incarcerations           11757 non-null  int64         
 7   comp_f_min_score             11757 non-null  int64         
 8   comp_f_min_decile            11757 non-null  int64         
 9   comp_f_max_score             11757 non-null  int64         
 10  comp_f_max_decile            11757 non-null  int64         
 11  comp_r_min_score             11757 non-nu

CONCLUSIONS
- Our data types are now looking good
- It appears that there are 630 people who have no incarceration data (the nulls from above) - they should be dropped from our analysis, since at least an initial incarceration is required to qualify for recidivism
- The COMPAS scores will be used for EDA and comparison with our model, but will not be part of our model

In [549]:
people = people[people['num_incarcerations'] > 0]
people.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11127 entries, 0 to 11756
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   person_id                    11127 non-null  int64         
 1   birth_date                   11127 non-null  datetime64[ns]
 2   first_incarceration_date     11127 non-null  datetime64[ns]
 3   first_incarceration_release  11127 non-null  datetime64[ns]
 4   last_incarceration_date      11127 non-null  datetime64[ns]
 5   last_incarceration_release   11127 non-null  datetime64[ns]
 6   num_incarcerations           11127 non-null  int64         
 7   comp_f_min_score             11127 non-null  int64         
 8   comp_f_min_decile            11127 non-null  int64         
 9   comp_f_max_score             11127 non-null  int64         
 10  comp_f_max_decile            11127 non-null  int64         
 11  comp_r_min_score             11127 non-nu

CONCLUSIONS
- Our demographic (birth date/sex/race) and incarceration statistics are now in appropriate format for feature engineering
- Now we need to incorporate data related to arrests and charges

In [550]:
# read in charges dataset and take a look
charges = pd.read_csv('../data/FL/final_charges.csv')
charges.head()

Unnamed: 0,person_id,case_number,offense_date,charge_degree,charge,arrest_date
0,1,09083797TI30A,2009-08-11 00:00:00.000000,(0),Unlawful Speed (Requires Speeds),
1,1,09098832TI20A,2009-10-24 00:00:00.000000,(0),Speed/65 Interstate,
2,1,13009443TI30A,2013-01-14 00:00:00.000000,(0),Disobey/Ran Stop Sign,
3,1,13009443TI30A,2013-01-14 00:00:00.000000,(0),Expired Tag/Infraction,
4,1,13011352CF10A,2013-08-13 00:00:00.000000,(F3),Aggravated Assault w/Firearm,2013-08-13 00:00:00.000000


In [551]:
charges.dtypes

person_id         int64
case_number      object
offense_date     object
charge_degree    object
charge           object
arrest_date      object
dtype: object

In [552]:
# convert datetimes
charges['offense_date'] = pd.to_datetime(charges['offense_date'])
charges['arrest_date'] = pd.to_datetime(charges['arrest_date'])

In [553]:
# final dtypes/null check
charges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140034 entries, 0 to 140033
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   person_id      140034 non-null  int64         
 1   case_number    140034 non-null  object        
 2   offense_date   140034 non-null  datetime64[ns]
 3   charge_degree  140034 non-null  object        
 4   charge         139605 non-null  object        
 5   arrest_date    92150 non-null   datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(3)
memory usage: 6.4+ MB


CONCLUSIONS
- We will collect first/last dates and charge degrees for charges and arrests per person as potentially useful features
- We will retrieve counts of charges by degree per person (pivoted) as well
- We will collect aggregated charge text data for NLP
- We will calculate mean time between offenses and between arrests - since this will not apply to non-recidivists (no second offense), this will be strictly for EDA and not part of the model

In [554]:
# Collect first/last charge and arrest data
# Using the offense date as our "first/last" marker
# And add to people dataframe
people['first_arrest_date'] = \
    people.join(charges.sort_values(by = ['person_id', 'offense_date']).groupby('person_id')['arrest_date'].first(),
    on = 'person_id',
    how = 'left')['arrest_date']
people['last_arrest_date'] = \
    people.join(charges.sort_values(by = ['person_id', 'offense_date']).groupby('person_id')['arrest_date'].last(),
    on = 'person_id',
    how = 'left')['arrest_date']
people['first_charge_degree'] = \
    people.join(charges.sort_values(by = ['person_id', 'offense_date']).groupby('person_id')['charge_degree'].first(),
    on = 'person_id',
    how = 'left')['charge_degree']
people['last_charge_degree'] = \
    people.join(charges.sort_values(by = ['person_id', 'offense_date']).groupby('person_id')['charge_degree'].last(),
    on = 'person_id',
    how = 'left')['charge_degree']

In [555]:
# pivot the degrees of charges into counts per person
# and add to people dataframe
charge_pivot = pd.pivot_table(charges, index = 'person_id', columns = 'charge_degree', aggfunc = 'count')['arrest_date']
charge_pivot.columns = ['charge_degree_count_' + col.replace('(', '').replace(')', '') for col in charge_pivot.columns]
people = people.join(charge_pivot, on = 'person_id', how = 'left')


In [556]:
# aggregate charges per person for NLP
# and add to people dataframe
# https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby
agg_charges = charges[charges['charge'].notnull()].groupby('person_id').agg({'charge': ' '.join})
people['agg_charges'] = \
    people.join(agg_charges, on = 'person_id', how = 'left')['charge']

In [557]:
# calculate mean time between offenses and between arrests per person
# and add to people dataframe
# https://stackoverflow.com/questions/45241221/python-pandas-calculate-average-days-between-dates
charges['previous_offense'] = \
    charges.sort_values(by = ['person_id', 'offense_date']).groupby(['person_id'])['offense_date'].shift(1)
charges['days_between_offenses'] = \
    (charges['offense_date'] - charges['previous_offense']).apply(lambda x: x.days)
charges['previous_arrest'] = \
    charges.sort_values(by = ['person_id', 'arrest_date']).groupby(['person_id'])['arrest_date'].shift(1)
charges['days_between_arrests'] = \
    (charges['arrest_date'] - charges['previous_arrest']).apply(lambda x: x.days)

# need to remove zeros - don't want them affecting the averages
days_between_offenses = charges[charges['days_between_offenses'] > 0].groupby('person_id')['days_between_offenses'].mean()
days_between_arrests = charges[charges['days_between_arrests'] > 0].groupby('person_id')['days_between_arrests'].mean()

people['avg_days_between_offenses'] = \
    people.join(days_between_offenses, on = 'person_id', how = 'left')['days_between_offenses']
people['avg_days_between_arrests'] = \
    people.join(days_between_arrests, on = 'person_id', how = 'left')['days_between_arrests']

In [558]:
people.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11127 entries, 0 to 11756
Data columns (total 50 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   person_id                       11127 non-null  int64         
 1   birth_date                      11127 non-null  datetime64[ns]
 2   first_incarceration_date        11127 non-null  datetime64[ns]
 3   first_incarceration_release     11127 non-null  datetime64[ns]
 4   last_incarceration_date         11127 non-null  datetime64[ns]
 5   last_incarceration_release      11127 non-null  datetime64[ns]
 6   num_incarcerations              11127 non-null  int64         
 7   comp_f_min_score                11127 non-null  int64         
 8   comp_f_min_decile               11127 non-null  int64         
 9   comp_f_max_score                11127 non-null  int64         
 10  comp_f_max_decile               11127 non-null  int64         
 11  co

CONCLUSIONS
- It appears that there are about 500 people with no arrest data - they should be dropped from our analysis
- The **first_charge_degree** and **last_charge_degree** columns should be dummified
- Zeros should be imputed for the **charge_degree_XXX** columns
- Empty string should be imputed for **agg_charges** column

In [559]:
people = people[people['first_arrest_date'].notnull()]
people.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10645 entries, 0 to 11756
Data columns (total 50 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   person_id                       10645 non-null  int64         
 1   birth_date                      10645 non-null  datetime64[ns]
 2   first_incarceration_date        10645 non-null  datetime64[ns]
 3   first_incarceration_release     10645 non-null  datetime64[ns]
 4   last_incarceration_date         10645 non-null  datetime64[ns]
 5   last_incarceration_release      10645 non-null  datetime64[ns]
 6   num_incarcerations              10645 non-null  int64         
 7   comp_f_min_score                10645 non-null  int64         
 8   comp_f_min_decile               10645 non-null  int64         
 9   comp_f_max_score                10645 non-null  int64         
 10  comp_f_max_decile               10645 non-null  int64         
 11  co

In [560]:
people['first_charge_degree'] = people['first_charge_degree'].str.replace('(', '', regex = False).str.replace(')', '', regex = False)
people['last_charge_degree'] = people['last_charge_degree'].str.replace('(', '', regex = False).str.replace(')', '', regex = False)
people = pd.get_dummies(data = people, columns = ['first_charge_degree', 'last_charge_degree'], drop_first = True)

In [561]:
people[[col for col in people if col.startswith('charge_degree_')]] = people[[col for col in people if col.startswith('charge_degree_')]].fillna(value = 0)
people.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10645 entries, 0 to 11756
Data columns (total 75 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   person_id                       10645 non-null  int64         
 1   birth_date                      10645 non-null  datetime64[ns]
 2   first_incarceration_date        10645 non-null  datetime64[ns]
 3   first_incarceration_release     10645 non-null  datetime64[ns]
 4   last_incarceration_date         10645 non-null  datetime64[ns]
 5   last_incarceration_release      10645 non-null  datetime64[ns]
 6   num_incarcerations              10645 non-null  int64         
 7   comp_f_min_score                10645 non-null  int64         
 8   comp_f_min_decile               10645 non-null  int64         
 9   comp_f_max_score                10645 non-null  int64         
 10  comp_f_max_decile               10645 non-null  int64         
 11  co

In [562]:
people[['agg_charges']] = people[['agg_charges']].fillna(value = '')
people.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10645 entries, 0 to 11756
Data columns (total 75 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   person_id                       10645 non-null  int64         
 1   birth_date                      10645 non-null  datetime64[ns]
 2   first_incarceration_date        10645 non-null  datetime64[ns]
 3   first_incarceration_release     10645 non-null  datetime64[ns]
 4   last_incarceration_date         10645 non-null  datetime64[ns]
 5   last_incarceration_release      10645 non-null  datetime64[ns]
 6   num_incarcerations              10645 non-null  int64         
 7   comp_f_min_score                10645 non-null  int64         
 8   comp_f_min_decile               10645 non-null  int64         
 9   comp_f_max_score                10645 non-null  int64         
 10  comp_f_max_decile               10645 non-null  int64         
 11  co

In [563]:
people[['avg_days_between_offenses']] = people[['avg_days_between_offenses']].fillna(value = 999999)
people[['avg_days_between_arrests']] = people[['avg_days_between_arrests']].fillna(value = 999999)
people.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10645 entries, 0 to 11756
Data columns (total 75 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   person_id                       10645 non-null  int64         
 1   birth_date                      10645 non-null  datetime64[ns]
 2   first_incarceration_date        10645 non-null  datetime64[ns]
 3   first_incarceration_release     10645 non-null  datetime64[ns]
 4   last_incarceration_date         10645 non-null  datetime64[ns]
 5   last_incarceration_release      10645 non-null  datetime64[ns]
 6   num_incarcerations              10645 non-null  int64         
 7   comp_f_min_score                10645 non-null  int64         
 8   comp_f_min_decile               10645 non-null  int64         
 9   comp_f_max_score                10645 non-null  int64         
 10  comp_f_max_decile               10645 non-null  int64         
 11  co

CONCLUSIONS
- We've resolved all null issues
- We lost about 1,000 rows, but the vast majority of those were people for whom recidivism was incalculable, so it's an acceptable loss
- We will keep agg_charges to the side for potential NLP analysis in vectorized form later
- We have 7 datetime columns that need to be reengineered into numerics, since we're not attempting time series analysis here - the most useful scenario is probably to convert the non-date-of-birth columns into "years since birth" quantities by referencing the date of birth column
- All charge degrees that are not 'FX' or 'MX' (where X is the degree of felony
or misdemeanor) are just infractions and don't really need to be their own
columns - we can combine those.
- There are also some potentially interesting features we could engineer that summarize charges - total number of charges, number of felonies/misdemeanors/infractions.
- We need to add our actual target column, **recidivism**! We will apply the agreed-upon definition that a recidivist is someone who was incarcerated at
least once, and then had at least 1 subsequent arrest or incarceration.

In [564]:
# construct "age at" columns and drop the datetimes
people['age_at_first_incarceration'] = (people['first_incarceration_date'] - \
    people['birth_date']).dt.days // 365
people['age_at_first_release'] = (people['first_incarceration_release'] - \
    people['birth_date']).dt.days // 365
people['age_at_last_incarceration'] = (people['last_incarceration_date'] - \
    people['birth_date']).dt.days // 365
people['age_at_last_release'] = (people['last_incarceration_release'] - \
    people['birth_date']).dt.days // 365
people['age_at_first_arrest'] = (people['first_arrest_date'] - \
    people['birth_date']).dt.days // 365
people['age_at_last_arrest'] = (people['last_arrest_date'] - \
    people['birth_date']).dt.days // 365

In [565]:
# create our target column
# if someone's last arrest date is after their first incarceration release date, they are defined as recidivist
people['recidivism'] = ((people['last_arrest_date'] - people['first_incarceration_release']).dt.days > 0).astype(int)

In [566]:
# now that we're done with datetime columns, we can drop them
people = people.drop(columns = ['first_incarceration_date', 'last_incarceration_date', \
    'first_incarceration_release', 'last_incarceration_release', \
        'first_arrest_date', 'last_arrest_date', 'birth_date'])

In [567]:
# summarize infraction charge degree counts and drop source columns
infraction_chg_counts = [col for col in people if col.startswith('charge_degree_count_')
    and (col.startswith('charge_degree_count_MO') or
    (not col.startswith('charge_degree_count_F')
    and not col.startswith('charge_degree_count_M')))]
people['charge_degree_count_INF'] = people[infraction_chg_counts].sum(axis = 1)
people = people.drop(columns = infraction_chg_counts)

In [568]:
# summarize first infraction dummies - so we use max here, not sum!
first_infractions = [col for col in people if col.startswith('first_charge_degree_')
    and (col.startswith('first_charge_degree_MO') or
    (not col.startswith('first_charge_degree_F')
    and not col.startswith('first_charge_degree_M')))]
people['first_charge_degree_INF'] = people[first_infractions].max(axis = 1)
people = people.drop(columns = first_infractions)

In [569]:
# summarize last infraction dummies - so we use max here, not sum!
last_infractions = [col for col in people if col.startswith('last_charge_degree_')
    and (col.startswith('last_charge_degree_MO') or
    (not col.startswith('last_charge_degree_F')
    and not col.startswith('last_charge_degree_M')))]
people['last_charge_degree_INF'] = people[last_infractions].max(axis = 1)
people = people.drop(columns = last_infractions)

In [570]:
# A couple more potentially interesting charge statistics:
# total misdemeanor charges, total felony charges, total charges overall
people['total_misdemeanor_charge_count'] = people[[col for col in people if col.startswith('charge_degree_count_M')]].sum(axis = 1)
people['total_felony_charge_count'] = people[[col for col in people if col.startswith('charge_degree_count_F')]].sum(axis = 1)
people['total_charge_count'] = people[[col for col in people if col.startswith('charge_degree_count_')]].sum(axis = 1)

In [571]:
people.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10645 entries, 0 to 11756
Data columns (total 61 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   person_id                       10645 non-null  int64  
 1   num_incarcerations              10645 non-null  int64  
 2   comp_f_min_score                10645 non-null  int64  
 3   comp_f_min_decile               10645 non-null  int64  
 4   comp_f_max_score                10645 non-null  int64  
 5   comp_f_max_decile               10645 non-null  int64  
 6   comp_r_min_score                10645 non-null  float64
 7   comp_r_min_decile               10645 non-null  int64  
 8   comp_r_max_score                10645 non-null  float64
 9   comp_r_max_decile               10645 non-null  int64  
 10  comp_v_min_score                10645 non-null  float64
 11  comp_v_min_decile               10645 non-null  int64  
 12  comp_v_max_score                

CONCLUSIONS
- We now have a number of potentially interesting model features, as well as a number of EDA-only features
- We have mitigated all null and datatype issues
- We're ready to export!

In [572]:
people.to_csv('../data/FL/FL_final.csv', index = False)

**Model 3: Behavioral feature set - Georgia**

TODO provide some background

**FINAL NOTES**:
- The final datasets for modeling are exported:
  - [here](../data/NY/NY_final.csv) for Model 1 (NY)
  - [here](../data/FL/FL_final.csv) for Model 2 (FL)
  - [here](../data/GA/GA_final.csv) for Model 3 (GA)
- The next notebook in the series is [Exploratory Data Analysis](./02_eda.ipynb).