*Created by Will Dinneen (willdinneen@gmail.com) for the PDRI-DevLab Junior Data Scientist Position @ UPenn*

## Introduction

- Here I will write my overall methodology and considerations, as well as a summary of key results

- In every section I will also explain my methodology more in depth before each code block

In [161]:
# # FOR PRINTING OUT TABLES!!

# df = grades_df[["person_id", "gr9_fall_math", "gr9_fall_hist"]].head(5)

# print(df.to_markdown(tablefmt="psql", index=None))

## Data Management

**case.csv** is the main dataset and reflects dates of arrest and disposition (trial or court appearance) during the period in which the program operated. The file also contains an indicator of whether the arrestee was referred to the intervention program for that arrest (i.e. whether they were treated), whether the person was rearrested while awaiting trial, the number of prior arrests at the time of program entry, and the arrest location. 

**demo.csv** contains demographic information about arrestees, including some who were not included in the program evaluation.

**prior_arrests.csv** reflects pre-period arrests among individuals in
*case.csv*;the pre-period ran from 2008-2011.

**grades.csv** includes 9th and 10th grade course grades for a subset of individuals in *case.csv*.

In [162]:
# Imports
import pandas as pd

# Datasets
case_df = pd.read_csv('../case.csv')
demo_df = pd.read_csv('../demo.csv')
prior_arrests_df = pd.read_csv('../prior_arrests.csv')
grades_df = pd.read_csv('../grades.csv')

In [163]:
# 1. Recode it so that males are consistently coded as “M” and females are consistently coded as “F”.

print("Gender column values pre-recode:")
print(demo_df["gender"].unique())

# Gender column recoding
gender_recode = {
    'male': 'M',
    'female': 'F'
}

demo_df["gender"] = demo_df["gender"].replace(gender_recode)

print("\nGender column values post-recode:")
print(demo_df["gender"].unique())

Gender column values pre-recode:
['F' 'M' 'male' 'female']

Gender column values post-recode:
['F' 'M']


In [180]:
# 2. Merge the case and demo datasets together so that each row in the case dataset also contains the demographics of the defendant.

# Confirm Data Integrity
print("---- Checking Data Integrity ----\n")

missing_demo_values = demo_df["person_id"].isnull().sum()
missing_case_values = case_df["person_id"].isnull().sum()
print(f"Missing demographic values: {missing_demo_values}")
print(f"Missing case values: {missing_case_values}")

duplicate_demo_ids = demo_df["person_id"].duplicated().sum()
print(f"\nDuplicate demographic ids: {duplicate_demo_ids}")
# Check to see if duplicated IDs are duplicated rows
duplicate_ids = demo_df[demo_df["person_id"].duplicated(keep=False)]
duplicated_rows = duplicate_ids[duplicate_ids.duplicated(keep=False)]
print(f"Number of contradictory demographic ids: {len(duplicate_ids) - len(duplicated_rows)}")

# Dropping duplicates
demo_df = demo_df.drop_duplicates()

# Measure Differences Accross Demo & Case data
print("\n\n---- Comparing Demographic & Case Representation ----\n")

unique_case_ids = case_df["person_id"].unique()
unique_demo_ids = demo_df["person_id"].unique()
print("Number of unique persons in case data:")
print(len(unique_case_ids))

print("\nNumber of unique persons in demographic data:")
print(len(unique_demo_ids))

print("\nDifference in number of unique persons in demographic vs case data:")
print(len(unique_demo_ids) - len(unique_case_ids))

# Identify extra demo IDs & save to file for later investigation 
extra_demo_ids = set(unique_demo_ids) - set(unique_case_ids)
extra_demo_ids_df = pd.DataFrame(extra_demo_ids, columns=['person_id'])
extra_demo_ids_df.to_csv('./outputs/extra_demo_ids.csv', index=False)

# Merge the df
merged_df = case_df.merge(demo_df, on='person_id', how='left')

# Check for missing values
print(f"\nMissing values after merge:\n{merged_df.isnull().sum()}")

---- Checking Data Integrity ----

Missing demographic values: 0
Missing case values: 0

Duplicate demographic ids: 0
Number of contradictory demographic ids: 0


---- Comparing Demographic & Case Representation ----

Number of unique persons in case data:
15353

Number of unique persons in demographic data:
15715

Difference in number of unique persons in demographic vs case data:
362

Missing values after merge:
caseid           0
person_id        0
arrest_date      0
dispos_date      0
treat            0
re_arrest        0
prior_arrests    0
address          0
race             0
gender           0
bdate            0
dtype: int64


In [188]:
# 3. For the purpose of this analysis, please restrict the data to only individuals who were arrested in Chicago.

# Check the format of the address column
address_format_check = merged_df["address"].str.contains(", ")
print(f"Number of addresses with correct format: {address_format_check.sum()}")
print(f"Number of addresses with incorrect format: {(~address_format_check).sum()}")

# Extract the city from the address column
merged_df["city"] = merged_df["address"].str.split(", ").str[1]

# Confirm that all cities are the same case
merged_df["city"] = merged_df["city"].str.lower()

# Check to see if there are any misspelled values
print(f"\nRepresented cities: {merged_df['city'].unique()}")

# Extract all rows with arrests in Chicago
chicago_df = merged_df[merged_df["city"] == "chicago"]

print(f"\nNumber of cases in Chicago: {len(chicago_df)}")

Number of addresses with correct format: 26000
Number of addresses with incorrect format: 0

Represented cities: ['chicago' 'oak lawn' 'cicero']

Number of cases in Chicago: 25000


## Variable Creation

In [4]:
# 1. Create an age variable equal to the defendant’s age at the time of arrest for each case.

In [5]:
# 2. Please construct measures for 9th and 10th grade GPA for this target population.

In [6]:
# 3.a. Please reconstruct the variable (prior arrests) using the prior_arrests.csv file. 

# 3.b. Please reconstruct this indicator (re_arrest).

# 3.c. Please show that the variables you reconstructed are equal to the versions in the provided datasets.

## Statistical Analysis

In [7]:
# 1. Describe the demographic characteristics of the study population based on the data available to you.

# 1.a. Are the treatment and control groups balanced? Please present your answer in the form of a table.

# 1.b. Choose one observable characteristic and visualize the difference between enrolled and not enrolled subjects.

In [None]:
# 3. Did participating in the program reduce the likelihood of re-arrest before disposition? Explain your answer and your methodology.

In [None]:
# 4. Using the data available to you, what recommendation would you make regarding who to serve?

## Conclusion