## To-do:
1. Adjust the code for the clean "DOF" issue -- done
2. Fix the issue -- adjust the isues with the at_risk date -- done
2. Fix the "Free Time Issue" -- done
3. Address issues with negative time to recidivate (current strategy may not be perfect)
3. Make sure that the final exported dataframe is at the id_variable, dos level
4. Document (in Word Doc or even at the end of this file the logic used at each step of the process and why decisions where made)

## Ongoing Questions:
insert any lingering questions here
1. ~~Should we subset the data to where dof < OR equal to dos or JUST keep data that is strictly less then (dof < dos)?~~
2. What additional variables besides JP_MIN, OGS, need to be aggregated at the ID_VARIABLE, DOS-LEVELL (perhaps PRS?() Is this something that Audrey has already done for the demographics dataset?

## Load Data

In [74]:
#this jupyter notebook is essentially the same as the "recidivism-check" notebook, just cleaned up a bit (hence the name)
#import required libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import sqlite3

#get the folder path for this data
pa_sentencing_path = os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd())))

#read in the correct data file (need to read in this file because of the additional columns it has)
psc_trimmed = pd.read_csv(os.path.join(pa_sentencing_path, "Project", "data", "PSC_data_trimmed_v1.csv"))

  interactivity=interactivity, compiler=compiler, result=result)


## Run this code chunk below only once 

In [75]:
## Run this only once to create an in-memory database for the data (should speed up retrieval times)

#read in the trimmed dataset
# psc_trimmed = pd.read_csv(os.path.join(pa_sentencing_path, "Project", "data", "Main.csv"),
# usecols = ['JPR_ID','id_variable','dof','dos','prs','INC_SANCTION_EXISTS','JP_CC_BUG','JP_MIN','JP_LIFE_DEATH','OFN_LIFE_DEATH'])

# conn = sqlite3.connect("psc_data2.db")
# cur = conn.cursor()
# ## Creating Table
# cur.execute('''CREATE TABLE IF NOT EXISTS raw_data  
#         (JPR_ID	INTEGER, 
#         prs TEXT, 
#         INC_SANCTION_EXISTS TEXT, 
#         dof TEXT,	
#         dos TEXT, 
#         JP_MIN TEXT,	
#         OFN_LIFE_DEATH TEXT,
#         id_variable INTEGER, 
#         JP_LIFE_DEATH TEXT,
#         JP_CC_BUG TEXT)''')

# insert_query = """ INSERT INTO raw_data (
#    JPR_ID,	
#    prs,	
#    INC_SANCTION_EXISTS,
#    dof,	
#    dos,	
#    JP_MIN,	
#    OFN_LIFE_DEATH,	
#    id_variable,	
#    JP_LIFE_DEATH,	
#    JP_CC_BUG)
#    VALUES (?,?,?,?,?,?,?,?,?,?)"""
# raw_list = psc_trimmed.values.tolist()    
# cur.executemany(insert_query, raw_list)
# conn.commit()

In [76]:
# import os
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# import datetime
# import sqlite3
# def load_data():
#     """ Read the sqlite database, from the file ".db" into a pandas dataframe
#     Returns:
#         pd.DataFrame : a dataframe 
#     """    
#     conn = sqlite3.connect("psc_data2.db")
#     df_raw = pd.read_sql_query("SELECT * FROM raw_data", conn)
#     return(df_raw)

# df_tbl_db = load_data()

In [77]:
# copying the original loaded data to a working data frame to use and compare with later
#df = df_tbl_db.copy() #if accessing the database

df = psc_trimmed.copy() # if accessing the psc_trimmed file directly


#change column names to uppercase
df.columns = df.columns.str.upper()


In [78]:
df.head() #inspect the dataset

Unnamed: 0,JPR_ID,OFF_SEX,OFF_RACE,DOFAGE,OTN,OFN_TITLE,OFN_COUNT,OFN_LABEL,OFN_GRADE,GRADE,...,STAT_MIN,DISPOSITION,CONFORMITY,REASON_ONE,REASON_TWO,REASON_THREE,MORE_REASONS,PRS_MANUAL,PRS_LAPSING,PRS_NONLAPSING
0,640001,F,White,36.689938,H182628-5,18,1,Corruption of Minors - when of a sexual nature,M-1,4,...,30,Nolo Contendere,Standard,,,,False,,0,0
1,642480,M,White,18.540726,0,75,1,DUI - M-2,M-2,3,...,12,Neg. Guilty Plea,Standard/Mandatory,,,,False,,0,0
2,660434,M,White,36.914442,H3618344,75,1,DUI - M-2,M-2,3,...,12,Non-Neg. Guilty Plea,Standard/Mandatory,,,,False,,0,0
3,628940,M,Black,22.297057,G0816126,18,1,Simple Assault,M-2,3,...,12,Neg. Guilty Plea,Standard,,,,False,,0,0
4,594048,M,White,40.087611,H240127-6,75,1,DUI - M-1,M-1,4,...,30,,Standard/Mandatory,,,,False,,1,1


## Convert Dates

In [79]:
#convert date strings to datetime variable
df[['DOF','DOS']] = df[['DOF','DOS']].apply(pd.to_datetime,format="%d %b %y")


In [80]:
# extracting out the just the year from the date to be used later 
df['DOF_YEAR'] = pd.DatetimeIndex(df['DOF']).year
df['DOS_YEAR'] = pd.DatetimeIndex(df['DOS']).year

In [81]:
#checking the range of values for the DOF and DOS variables
print("The minimum date of offense in the dataset is: {}".format(df[["DOF"]].min()[0]))
print("The maximum date of offense in the dataset is: {}".format(df[["DOF"]].max()[0]))
print("The minimum date of sentencing in the dataset is: {}".format(df[["DOS"]].min()[0]))
print("The maximum date of sentencing in the dataset is: {}".format(df[["DOS"]].max()[0]))

The minimum date of offense in the dataset is: 1984-11-14 00:00:00
The maximum date of offense in the dataset is: 2020-05-08 00:00:00
The minimum date of sentencing in the dataset is: 2001-01-01 00:00:00
The maximum date of sentencing in the dataset is: 2019-12-31 00:00:00


##### Note: As shown in the above code chunk, there **isn't** anamolous behavior in the date ranges (i.e. a date in the year 1909 or 2090) for the date of offense (DOF) or date of sentence (DOS) variables -- therefore, an additional date correction was **not** applied in this case.

## Clean DOS > DOF

Note: group offense by ID_VAR, JPR_ID, MIN(DOF) to get the first DOF associated for a single JPR_ID

In [82]:
# df_no_missing = df[df['DOF'].notnull()]

# #subset to just those rows where DOF is missing
# dof_missing = df[df['DOF'].isnull()]
# not needed

#df[:20].head() 

#count how many values of DOF are missing in the original dataset
dof_missing = df[df['DOF'].isnull()]


print("There are {:,} rows with missing DOFs in the dataset.".format(len(dof_missing)))



There are 15,965 rows with missing DOFs in the dataset.


In [83]:
#group offense by ID_VAR, JPR_ID, MIN(DOF) to get the first DOF associated for a single JPR_ID -- fix issue with missing DOF
# for name, group in df.groupby("JPR_ID"):
#     #only applies if there is more than one charge 

#     #number of charges in a group
#     num_chargs = len(group)

#     if num_chargs > 1:
#         list_of_dof = group['DOF'].tolist()
#         min_dof = min(list_of_dof)

#         #check if nas are in the list and ONLY change the DOF to MIN IF the DOF is missing
#         updated_dof  = []
#         for x in list_of_dof:
#             if pd.isnull(x):
#                 updated_dof.append(min_dof)
#             else:
#                 updated_dof.append(x)
  
#         #only reassign IF for that row the DOF is missing
#         df.loc[df["JPR_ID"] == name, 'DOF'] = updated_dof

#at the JPR_ID level we only want ONE DOF because becuase we don't want to take into account DOF's that occur
#BEFORE the DOS (associated with the JPR_ID) as an instance of recidivism. -- each JPR_ID should have only ONE DOS
# for name, group in df[:20].groupby("JPR_ID"):  
#     #only applies if there is more than one charge 

#     #number of charges in a group
#     num_chargs = len(group)

#     if num_chargs > 1:
#         list_of_dof = group['DOF'].tolist()
#         min_dof = min(list_of_dof)
#         print(list_of_dof)
#         print(min_dof)

#         updated_dof = [min_dof for x in list_of_dof]

#         #reassign the DOF to the minimum for that JPR_ID
#         df.loc[df["JPR_ID"] == name, 'DOF'] = updated_dof


# df[:20].head()[["JPR_ID", "DOF"]]


In [84]:
#subset the data just to where the number of charges for a given JPR_ID is > 1
# counts_by_jpr_id = df.groupby(["JPR_ID"]).agg({"DOF": "count"})

# multiple_chargs_jpr_ids = list(counts_by_jpr_id.loc[counts_by_jpr_id["DOF"] > 1].index)

# #print(len(multiple_chargs_jpr_ids))

# test_df = df[df["JPR_ID"].isin(multiple_chargs_jpr_ids)]

#print(len(test_df))

In [85]:
#test_df.head()[["JPR_ID", "DOF"]]

#want the earliest offense for a given JPR_ID

In [86]:
#at the JPR_ID level we only want ONE DOF because becuase we don't want to take into account DOF's that occur
#BEFORE the DOS (associated with the JPR_ID) as an instance of recidivism. -- each JPR_ID should have only ONE DOS

#test_df["NEW_DOF"] = test_df.groupby(["JPR_ID"])["DOF"].transform("min")


#test_df.head()[["JPR_ID", "DOF", "NEW_DOF"]]
#one_test_example = test_df.loc[test_df["JPR_ID"] == 658826][["DOF", "NEW_DOF"]] 
# one_test_example["NEW_DOF"] = one_test_example.groupby(["JPR_ID"])["DOF"].transform("min")
# one_test_example


### **Step 1**: Make sure that we are only looking at the **minimum** value for the DOF across all of the charges associated with **one** JPR_ID. This is the procedure because we don't wan't to count a DOF as an instance of recidivism if it occurs BEFORE the date of sentencing

In [87]:
#at the JPR_ID level we only want ONE DOF because becuase we don't want to take into account DOF's that occur
#BEFORE the DOS (associated with the JPR_ID) as an instance of recidivism. -- each JPR_ID should have only ONE DOS

df["NEW_DOF"] = df.groupby(["JPR_ID"])["DOF"].transform("min")

In [88]:
df.head()[["JPR_ID", "DOF", "NEW_DOF"]]

Unnamed: 0,JPR_ID,DOF,NEW_DOF
0,640001,2000-04-01,2000-04-01
1,642480,1999-12-31,1999-12-31
2,660434,2000-12-23,2000-12-23
3,628940,2000-06-26,2000-06-26
4,594048,2000-10-15,2000-10-15


In [89]:
dof_missing = df[df['NEW_DOF'].isnull()]

percent_missing = len(dof_missing)/len(df)
print("After cleaning, there are {:,} ({:%}) rows with missing DOFs in the dataset.".format(len(dof_missing), percent_missing))



#random testing code here
#df.head()
# test = [pd.NaT, 1,2, 3]
# updated_dof = [x for x in test if pd.isnull(x)] 
# updated_dof

After cleaning, there are 11,785 (0.454381%) rows with missing DOFs in the dataset.


### **Step 2**: Subset the data to just include those rows where NEW_DOF <= DOS

In [90]:
#make sure the sentencing 
before_length = len(df)
df = df[df.NEW_DOF <= df.DOS] #should this be <= ?
after_length = len(df)

print("Before DOF <= DOS correction there were {:,} rows and after cleaning there were {:,} rows. A change of {:,}.".format(before_length, after_length, before_length - after_length))


Before DOF <= DOS correction there were 2,593,636 rows and after cleaning there were 2,581,813 rows. A change of 11,823.


In [91]:
#print(len(psc_trimmed), len(df), len(psc_trimmed) - len(df))

## Clean Missing PRS Score 

In [92]:
#subset to just the id variables with a PRS score missing
# id_varswith_prsmissing= set(df[df.prs.isnull()].id_variable)
# #remove id vars with missing PRS
# df_prs_notaffected = df[~df.id_variable.isin(id_varswith_prsmissing)]
# #reassign to working dataframe
# df = df_prs_notaffected 

In [93]:
# PRS SCORE CLEANING - removes only the rows where the PRS score is missing not the entire individual 
before_length = len(df)
df = df.loc[df['PRS'].notnull()]
after_length = len(df)

print("Before PRS correction there were {:,} rows and after cleaning there were {:,} rows. A change of {:,}.".format(before_length, after_length, before_length - after_length))




Before PRS correction there were 2,581,813 rows and after cleaning there were 2,581,787 rows. A change of 26.


## Clean JP CC Bug

In [94]:
# Obtaining the id variables with jp_bug
id_varswith_jpbug= set(df[df.JP_CC_BUG=='Y'].ID_VARIABLE)

In [95]:
# assigning all the rows associated with the jp bugs to a seperate dataframe 
df_with_jpbug=  df[df.ID_VARIABLE.isin(id_varswith_jpbug)]

In [96]:
# Removing the rows of these that are beyond 2016(rows in the future that are affected by the JP_CC_BUG)
df_jp_bug_cleaned = df_with_jpbug[df.DOS_YEAR<2016]

  


In [97]:
# Isolating the rows associated with id_vars in the original dataframe that is not associated with the bug
df_jpbug_notaffected = df[~df.ID_VARIABLE.isin(id_varswith_jpbug)]

In [98]:
# Rejoining the rows affected by the JP_CC_bug after cleaning them to the rows not affected by the bug
df_cleaned_1 = pd.concat([df_jpbug_notaffected,df_jp_bug_cleaned])  #new working df

df = df_cleaned_1

In [99]:
after_length = len(df)

print("After the JP_CC_BUG correction there are {:,} rows. ".format(after_length))


After the JP_CC_BUG correction there are 2,562,426 rows. 


## Implement At Risk Date Calculation Logic

NEED TO ADD THIS TO TAKE CARE OF TIME SERVED AND POTENTIALLY DIFFERENT JP_MIN VALS

IF INC_SANCTION_EXISTS = ‘N’: do nothing 
ELIF INC_SANCTION_EXISTS on at least 1 charge:
If DOS is the same across all charges:
Get MAX(JP_MIN) across charges for this JPR_ID
If DOS is different across charges:
Use JP_MIN + MAX(DOS) - [ MAX(DOS) - MIN(DOS) ] to get time incarcerated 
Should this be: JP_MIN + (MAX(DOS) + [ MAX(DOS) - MIN(DOS) ])


In [100]:
#Fix Issues with the missing JP_MIN
num_missing_jp_min = len(df.loc[pd.isna(df["JP_MIN"])]) #[["JPR_ID", "JP_MIN"]]
print("There are {:,} entries in the dataset missing a JP_MIN value.".format(num_missing_jp_min))

df["ADJ_JPMIN"] = df.groupby(["JPR_ID"])["JP_MIN"].transform("min")

#df.head()[["JPR_ID", "JP_MIN", "ADJ_JPMIN"]]

num_missing_jp_min = len(df.loc[pd.isna(df["ADJ_JPMIN"])]) #[["JPR_ID", "JP_MIN"]]
print("There are {:,} entries in the dataset missing a  ADJ_JPMIN value.".format(num_missing_jp_min))


There are 338,885 entries in the dataset missing a JP_MIN value.
There are 338,811 entries in the dataset missing a  ADJ_JPMIN value.


### **STEP 1**: Collapse the data at the ID_VARIABLE, DOS-LEVEL


In [116]:
df_collapsed = df.copy()

# #get the max values of the OGS and JP_MIN values -- possibly further adjustments need to be at this level
df_collapsed['OGS'] = df_collapsed.groupby(["ID_VARIABLE", "DOS"])["OGS"].transform(max)
df_collapsed["ADJ_JPMIN"] = df_collapsed.groupby(["ID_VARIABLE", "DOS"])["ADJ_JPMIN"].transform(max)

#collapse data to be at the id variable, DOS level (need to ungroup the data for the at_risk date calculation to work)
df_collapsed = df_collapsed.copy().groupby(["ID_VARIABLE", "DOS"]).first().reset_index()

#inspect the results
df_collapsed[["ID_VARIABLE", "DOS", "NEW_DOF", "INC_SANCTION_EXISTS", "ADJ_JPMIN", "OFN_LIFE_DEATH", "JP_LIFE_DEATH"]] 
#df_collapsed.loc[df_collapsed["JP_LIFE_DEATH"] == "N"][["NEW_DOF", "INC_SANCTION_EXISTS", "ADJ_JPMIN", "OFN_LIFE_DEATH", "JP_LIFE_DEATH"]] 

Unnamed: 0,ID_VARIABLE,DOS,NEW_DOF,INC_SANCTION_EXISTS,ADJ_JPMIN,OFN_LIFE_DEATH,JP_LIFE_DEATH
0,1000001,2010-02-18,2009-06-25,N,16.0,,
1,1000002,2017-01-31,2015-09-01,Y,120.0,,
2,1000003,2002-05-08,2001-09-07,N,0.0,,
3,1000003,2009-03-04,2009-03-04,Y,92.0,,
4,1000004,2013-12-10,2013-09-19,N,0.0,,
...,...,...,...,...,...,...,...
1481210,1916193,2002-01-07,2001-05-03,N,0.0,,
1481211,1916194,2016-11-14,2015-03-30,N,0.0,,
1481212,1916195,2009-06-04,2009-05-16,N,0.0,,
1481213,1916196,2014-03-03,2013-07-05,Y,31.0,,


### **STEP 2:** Calculate the AT_RISK_DT using the following logic

In [117]:
def create_at_risk_date(row):
    #need to account for REALLY large JP_MIN values
    #20 years in days = 10 * 365
    
    # Because of this error message OverflowError: Python int too large to convert to C long
    # 25 is more years than we have in our data, so their at_risk date also get set to some value far in the future
    upper_limit = 25.0 * 365.0
    
    num_days_in_month = 30.0
    
    #if offense has a life or death flag, set their at_risk_date abritarily large
    if row['OFN_LIFE_DEATH'] == "Y":
        at_risk_date = pd.to_datetime('2035-12-31')

    #if they were not incarcerated, then their at risk date is just their date of offense
    if row["INC_SANCTION_EXISTS"] == "N":
        at_risk_date = row['NEW_DOF']
    
    #if they were incarcerated, look at the below logic to determine their at-risk date
    else:

        if row["ADJ_JPMIN"] < upper_limit:

            if row["INC_SANCTION_EXISTS"] == "Y" and pd.notna(row['ADJ_JPMIN']):
                at_risk_date = row['DOS'] + pd.Timedelta(days = row['ADJ_JPMIN'])
            
            elif row["INC_SANCTION_EXISTS"] == "Y" and pd.notna(row['INCMIN']):
                at_risk_date = row['DOS'] + pd.Timedelta(days = row['INCMIN'] * num_days_in_month)

            #these are individuals who have life and death sentences
            #their inc_end date = 12/31/9999 -- see codebook for more details
            # elif row["INC_SANCTION_EXISTS"] == "Y" and pd.isna(row['INCMIN']) and pd.isna(row['ADJ_JPMIN']):
            #     at_risk_date = row["INC_END"]

            # elif row["INC_SANCTION_EXISTS"] == "N":
            #     at_risk_date = row['NEW_DOF']

            else:
                at_risk_date = row['INC_END']

        else:
            at_risk_date = pd.to_datetime('2035-12-31')
    
    return at_risk_date


# df["AT_RISK_DT"] = np.where(
#     df['INC_SANCTION_EXISTS'] == "Y" and pd.notna(df['JP_MIN']), 1, 0)

# test = df[:2000]
# #apply the function to the data (row by row)
# test["AT_RISK_DT"] = test.apply(create_at_risk_date, axis = 1)

#  #adjust so that the times do not include minutes and seconds
# test["AT_RISK_DT"] = pd.to_datetime(test["AT_RISK_DT"]).dt.date

# # #inspect the results
# test[['ID_VARIABLE', 'JPR_ID',"JP_MIN", "INCMIN", "INC_END", "ADJ_JPMIN", "INC_SANCTION_EXISTS", "DOS", "NEW_DOF", "AT_RISK_DT"]]

#test = df[:2000]
#apply the function to the data (row by row)
df_collapsed["AT_RISK_DT"] = df_collapsed.apply(create_at_risk_date, axis = 1)

 #adjust so that the times do not include minutes and seconds
df_collapsed["AT_RISK_DT"] = pd.to_datetime(df_collapsed["AT_RISK_DT"]).dt.date

# #inspect the results
df_collapsed[['ID_VARIABLE', 'JPR_ID',"JP_MIN", "INCMIN", "INC_END", "ADJ_JPMIN", "INC_SANCTION_EXISTS", "DOS", "NEW_DOF", "AT_RISK_DT"]]




Unnamed: 0,ID_VARIABLE,JPR_ID,JP_MIN,INCMIN,INC_END,ADJ_JPMIN,INC_SANCTION_EXISTS,DOS,NEW_DOF,AT_RISK_DT
0,1000001,4915383,16.0,0.526316,17 Jan 12,16.0,N,2010-02-18,2009-06-25,2009-06-25
1,1000002,5678165,120.0,4.000000,30 Jan 19,120.0,Y,2017-01-31,2015-09-01,2017-05-31
2,1000003,1070248,0.0,,,0.0,N,2002-05-08,2001-09-07,2001-09-07
3,1000003,4902797,92.0,3.000000,03 Sep 09,92.0,Y,2009-03-04,2009-03-04,2009-06-04
4,1000004,5318013,0.0,,,0.0,N,2013-12-10,2013-09-19,2013-09-19
...,...,...,...,...,...,...,...,...,...,...
1481210,1916193,425719,0.0,,,0.0,N,2002-01-07,2001-05-03,2001-05-03
1481211,1916194,5557602,0.0,,,0.0,N,2016-11-14,2015-03-30,2015-03-30
1481212,1916195,3884756,0.0,,,0.0,N,2009-06-04,2009-05-16,2009-05-16
1481213,1916196,5351194,31.0,1.000000,02 Mar 15,31.0,Y,2014-03-03,2013-07-05,2014-04-03


In the above at_risk_date calculation code, there is an "upper_limit" because the largest JP_MIN value is 230,000+ days, which is the equivalent of about 631 years. This person would not recidivate in our dataset and Python throws a "OverflowError: Python int too large to convert to C long" for these individuals. So, in order to allow the code to run, those with jp_min values equivalent to more days than we have data for, will just get an at-risk date very far into the future.

In [112]:
#OverflowError: Python int too large to convert to C long
# df["ADJ_JPMIN"].min()
largest_jpmin =  df_collapsed["ADJ_JPMIN"].max()
largest_jpmin_in_years = largest_jpmin/365.0
print("The largest JP_MIN value is {:,} days, which is {} years. This causes Python to throw the following error: OverflowError: Python int too large to convert to C long.".format(largest_jpmin, largest_jpmin_in_years))

#df[:1]["DOS"] + pd.Timedelta(days = largest_jpmin)

The largest JP_MIN value is 230,468.0 days, which is 631.4191780821918 years. This causes Python to throw the following error: OverflowError: Python int too large to convert to C long.


## Populate Next DOF

In [120]:
#sort the data
df_collapsed = df_collapsed.sort_values(by = ["ID_VARIABLE", "NEW_DOF"])

#shift the data up by one to create the new vaariable "NEXT_DOF"
df_collapsed['NEXT_DOF'] = df_collapsed.groupby(['ID_VARIABLE'])['NEW_DOF'].shift(-1).dt.date

df_collapsed[:20][["ID_VARIABLE", "JPR_ID", "DOS", "NEW_DOF", "NEXT_DOF", "AT_RISK_DT", "INC_SANCTION_EXISTS"]]

Unnamed: 0,ID_VARIABLE,JPR_ID,DOS,NEW_DOF,NEXT_DOF,AT_RISK_DT,INC_SANCTION_EXISTS
0,1000001,4915383,2010-02-18,2009-06-25,NaT,2009-06-25,N
1,1000002,5678165,2017-01-31,2015-09-01,NaT,2017-05-31,Y
2,1000003,1070248,2002-05-08,2001-09-07,2009-03-04,2001-09-07,N
3,1000003,4902797,2009-03-04,2009-03-04,NaT,2009-06-04,Y
4,1000004,5318013,2013-12-10,2013-09-19,2018-07-09,2013-09-19,N
5,1000004,5922309,2018-09-26,2018-07-09,NaT,2018-07-09,N
6,1000005,1203958,2008-08-11,2006-08-14,NaT,2009-02-10,Y
7,1000006,378762,2006-08-30,2005-10-08,NaT,2007-11-30,Y
8,1000007,43891,2004-03-02,2003-04-18,NaT,2003-04-18,N
9,1000008,4992054,2011-05-13,2011-01-16,NaT,2011-11-13,Y


## Check for "Free Time" -- Make Sure Individuals in the Dataset Have Enough Free Time

Individuals whose **first*** offense results in an AT_RISK DATE of 2017 or greater = remove these individuals


1. subset just to those whose at_risk date < max DOS df[["DOS"]].max()
2. then, we also want to remove those whose last next_dof is null and whose last dof > 2017

Essentially, we want to subset (whatever grouping variable we're using) to just those entries where next_dof is null and FOR THIS SAME ROW, if the dof >= pd.todatetime("2017-01-01") -- remove these entries

should get us to have the correct individuals with free time

In [143]:
#subset to those whose at_risk_date > the largest sentencing date that we have
#what is the maximum sentence date?

before_length = len(df_collapsed)

last_day = pd.to_datetime(df_collapsed[["DOS"]].max())[0]  
df_collapsed = df_collapsed[df_collapsed["AT_RISK_DT"] <= last_day]

after_length = len(df_collapsed) 

print("There are {:,} id_var, dos combos where the at risk date is after the last date of sentence available.".format(before_length - after_length))


There are 31,886 id_var, dos combos where the at risk date is after the last date of sentence available.


Here, I calculate a "LAST_DOF" variable, which will then be used to subset the data to only those whose latest offense was before 2017

In [148]:
df_collapsed["LAST_DOF"] = df_collapsed.loc[df_collapsed["NEXT_DOF"].isnull(), "NEW_DOF"]

#np.where(
   # df_collapsed['NEXT_DOF'].isnull(), df_collapsed['NEW_DOF'], pd.nan)


df_collapsed[["ID_VARIABLE", "DOS", "NEW_DOF", "NEXT_DOF", "LAST_DOF"]]

Unnamed: 0,ID_VARIABLE,DOS,NEW_DOF,NEXT_DOF,LAST_DOF
0,1000001,2010-02-18,2009-06-25,NaT,2009-06-25
1,1000002,2017-01-31,2015-09-01,NaT,2015-09-01
2,1000003,2002-05-08,2001-09-07,2009-03-04,NaT
3,1000003,2009-03-04,2009-03-04,NaT,2009-03-04
4,1000004,2013-12-10,2013-09-19,2018-07-09,NaT
...,...,...,...,...,...
1481210,1916193,2002-01-07,2001-05-03,,2001-05-03
1481211,1916194,2016-11-14,2015-03-30,,2015-03-30
1481212,1916195,2009-06-04,2009-05-16,,2009-05-16
1481213,1916196,2014-03-03,2013-07-05,,2013-07-05


In [155]:
#subset the data to only those whose last_dof is before 2017
before_length = len(df_collapsed)

last_day = pd.to_datetime("2017-01-01") 

#subset the dataset to either where the LAST_DOF is null OR LAST_DOF < last_day
df_collapsed = df_collapsed.loc[(df_collapsed["LAST_DOF"].isnull()) | (df_collapsed["LAST_DOF"] < last_day)]

after_length = len(df_collapsed) 

print("There are {:,} id_var, dos combos whose's last dof is not in scope.".format(before_length - after_length))
df_collapsed[["ID_VARIABLE", "DOS", "NEW_DOF", "NEXT_DOF", "LAST_DOF"]]


There are 125,474 id_var, dos combos whose's last dof is not in scope.


## CREATE TIME TO RECIDIVATE AND RECIDIVSM VARIABLES

In [None]:
# MAKE SURE THE AT_RISK_DT CALCULATIONS ARE ACCURATE

In [176]:
#inspect the results -- an issue where the next_dof is less than the next_dof
#print(df_collapsed["TIME_TO_RECIDIVATE"].min(), df_collapsed["TIME_TO_RECIDIVATE"].max())
list_neg_time = list(df_collapsed.loc[df_collapsed["TIME_TO_RECIDIVATE"] < 0]["ID_VARIABLE"])
neg_time_analysis = df_collapsed.loc[df_collapsed["ID_VARIABLE"].isin(list_neg_time)]

neg_time_analysis[["ID_VARIABLE", "DOS", "ADJ_JPMIN", "NEW_DOF", "NEXT_DOF", "AT_RISK_DT", "TIME_TO_RECIDIVATE"]]

#they committed another offense before their previous at risk date 
# df_collapsed['NEXT_DOF'] = df_collapsed.groupby(['ID_VARIABLE'])['NEW_DOF'].shift(-1).dt.date

# df_collapsed["PREVIOUS_AT_RISK_DT"] = df_collapsed.groupby(['ID_VARIABLE'])['AT_RISK_DT'].shift(1) #.dt.date
#df_collapsed[["ID_VARIABLE", "DOS", "ADJ_JPMIN", "NEW_DOF", "NEXT_DOF", "AT_RISK_DT", "PREVIOUS_AT_RISK_DT"]]

# test = df_collapsed.loc[df_collapsed["NEXT_DOF"] <df_collapsed["PREVIOUS_AT_RISK_DT"] ]
# test[["ID_VARIABLE", "DOS", "ADJ_JPMIN", "NEW_DOF", "NEXT_DOF", "AT_RISK_DT", "PREVIOUS_AT_RISK_DT"]]


Unnamed: 0,ID_VARIABLE,DOS,ADJ_JPMIN,NEW_DOF,NEXT_DOF,AT_RISK_DT,TIME_TO_RECIDIVATE
27,1000016,2003-01-10,1096.0,2001-04-20,2006-04-15,2006-01-10,95.0
28,1000016,2006-09-18,5.0,2006-04-15,2006-04-16,2006-09-23,-160.0
29,1000016,2006-10-16,1004.0,2006-04-16,NaT,2009-07-16,
30,1000017,2013-03-05,0.0,2012-10-12,2015-12-05,2012-10-12,1149.0
32,1000017,2016-06-10,44.0,2015-12-05,2016-02-07,2016-07-24,-168.0
...,...,...,...,...,...,...,...
1481165,1916166,2005-04-07,364.0,2004-02-07,2006-03-04,2006-04-06,-33.0
1481168,1916166,2009-06-01,1096.0,2006-03-04,2006-03-04,2012-06-01,-2281.0
1481169,1916166,2009-06-15,0.0,2006-03-04,2008-01-23,2006-03-04,690.0
1481167,1916166,2009-04-17,183.0,2008-01-23,2008-10-10,2008-01-23,261.0


In [156]:
#df_collapsed[["ID_VARIABLE", "DOS", "NEW_DOF", "NEXT_DOF", "LAST_DOF"]]

Unnamed: 0,ID_VARIABLE,DOS,NEW_DOF,NEXT_DOF,LAST_DOF
0,1000001,2010-02-18,2009-06-25,NaT,2009-06-25
1,1000002,2017-01-31,2015-09-01,NaT,2015-09-01
2,1000003,2002-05-08,2001-09-07,2009-03-04,NaT
3,1000003,2009-03-04,2009-03-04,NaT,2009-03-04
4,1000004,2013-12-10,2013-09-19,2018-07-09,NaT
...,...,...,...,...,...
1481209,1916192,2015-10-20,2014-05-06,,2014-05-06
1481210,1916193,2002-01-07,2001-05-03,,2001-05-03
1481211,1916194,2016-11-14,2015-03-30,,2015-03-30
1481212,1916195,2009-06-04,2009-05-16,,2009-05-16


In [178]:
#subtract the next_dof and at_risk_dt variables
df_collapsed['TIME_TO_RECIDIVATE'] = np.where(
    df_collapsed['NEXT_DOF'] > df_collapsed['AT_RISK_DT'],  pd.to_datetime(df_collapsed['NEXT_DOF']) - pd.to_datetime(df_collapsed['AT_RISK_DT']),

    pd.to_datetime(df_collapsed['NEXT_DOF']) - pd.to_datetime(df_collapsed['NEW_DOF'])

    )



#update the time to recidivate column to JUST be the number of days as an integer/float
df_collapsed['TIME_TO_RECIDIVATE'] = df_collapsed['TIME_TO_RECIDIVATE'].dt.days

df_collapsed[["ID_VARIABLE", "DOS", "NEW_DOF", "NEXT_DOF", "TIME_TO_RECIDIVATE"]]


Unnamed: 0,ID_VARIABLE,DOS,NEW_DOF,NEXT_DOF,TIME_TO_RECIDIVATE
0,1000001,2010-02-18,2009-06-25,NaT,
1,1000002,2017-01-31,2015-09-01,NaT,
2,1000003,2002-05-08,2001-09-07,2009-03-04,2735.0
3,1000003,2009-03-04,2009-03-04,NaT,
4,1000004,2013-12-10,2013-09-19,NaT,
...,...,...,...,...,...
1481209,1916192,2015-10-20,2014-05-06,,
1481210,1916193,2002-01-07,2001-05-03,,
1481211,1916194,2016-11-14,2015-03-30,,
1481212,1916195,2009-06-04,2009-05-16,,


In [179]:
# list_neg_time = list(df_collapsed.loc[df_collapsed["TIME_TO_RECIDIVATE"] < 0]["ID_VARIABLE"])
# neg_time_analysis = df_collapsed.loc[df_collapsed["ID_VARIABLE"].isin(list_neg_time)]
# neg_time_analysis[["ID_VARIABLE", "DOS", "ADJ_JPMIN", "NEW_DOF", "NEXT_DOF", "AT_RISK_DT", "TIME_TO_RECIDIVATE"]]


Unnamed: 0,ID_VARIABLE,DOS,ADJ_JPMIN,NEW_DOF,NEXT_DOF,AT_RISK_DT,TIME_TO_RECIDIVATE


In [180]:
#number of days in  years
three_years_in_days = float(3) * 365.0  
five_years_in_days = float(5) * 365.0  

#JUDICIAL-PROCEEDING LEVEL RECIDIVISM
#final_next_dof["RECIDIVISM_3Y"] = final_next_dof.apply(create_recidivism_var, years = 3, axis = 1)

df_collapsed["RECIDIVISM_3Y"] = np.where(
    df_collapsed['TIME_TO_RECIDIVATE'] <= three_years_in_days, 1, 0)

df_collapsed["RECIDIVISM_5Y"] = np.where(
    df_collapsed['TIME_TO_RECIDIVATE'] <= five_years_in_days, 1, 0)

# Export The Results to CSV (MAKE SURE DATA IS AT THE DOS, ID_VARIABLE LEVEL)

In [None]:
#Export the Results to a CSV

#double check that the 

#export the dataframe with the recidivism variables to a new dataframe
output_path = os.path.join(pa_sentencing_path, "Project", "data", "recidivsm_dataset.csv")

df_collapsed.to_csv(output_path) #export the final results