# Data Preparation for the Interactive Analysis Layer

The questions provided by Citizens for Juvenile Justice share an underlying structure: given multiple categories of incident (for example, incidents which are 100J eligible and have a non-guilty disposition), and a maximum number of incidents per category, how many individuals possess only incidents that belong to one of the given categories and do not have too many incidents in any of those categories?

Answering a question with this structure requires a holistic view of all of an individual's recorded incidents. For the interactive analysis layer to be able to process questions of this structure efficiently, that large amount of per-individual information, spanning multiple dataframe rows and columns, needs to be condensed.

The output files of this notebook contain a single column, with each row representing a single individual (cases in Middlesex). These values are **all-incident codes.** Each all-incident code is a string of digits that can only be 1 or 0, and is made up of six-digit single **incident codes.**

Take the example all-incident code: 100110011001. It contains 12 digits, so this individual has two 6-digit incidents. Their individual incident codes are 100110 and 011001.

>If the first digit is 1, the individual was a juvenile at the time of the incident.

>If the second digit is 1, every offense in the incident is considered eligible under 100J.

>If the third digit is 1, the incident contains at least one sex or murder offense.

>If the fourth digit is 1, the incident contains at least one offense with a guilty disposition.

>If the fifth digit is 1, then the incident contains at least one offense with missing disposition data.

>If the sixth digit is 1, then the incident contains at least one offense for which the waiting period has not yet finished.

Incident code 100110 would be interpreted as a juvenile incident entirely expungeable by 100J with no sex or murder offenses, but at least one guilty disposition. It is missing some disposition data and the waiting period has already finished.

Incident code 011001 would be interpreted as an adult incident with at least one offense non-expungeable by 100J, and at least one sex or murder offense. There are no guilty dispositions or missing dispositions, but the waiting period has not yet finished.

---
In addition to generating all-incident codes for the interactive analysis layer to use, the three specific questions provided by CfJJ are answered in this notebook for each region we have data for.

**IMPORTANT**: dummy data is currently being used to determine which offenses are misdeameanors and which are felonies. Until that information has been properly derived, the distinction between individuals eligible today and individuals eligible in the future is not meaningful.

In [354]:
import requests
import pandas as pd
import numpy as np
import re
import copy

In [355]:
# Create dataframes for Northwest, Suffolk, and Middlesex
nw = pd.read_csv('../data/cleaned/clean_northwestern.csv', encoding='utf8',
                    dtype={})
sf = pd.read_csv('../data/cleaned/clean_suffolk.csv', encoding='utf8',
                    dtype={})
ms = pd.read_csv('../data/cleaned/clean_middlesex.csv', encoding='utf8',
                    dtype={'Incident_Guilty_or_missing':str}, low_memory=False)
pd.set_option("display.max.columns", None)

## Step 1: Additional Columns

In [356]:
# Add column: 'Inc_Juvenile', so the information is found under the same column name in all regions
ms['Inc_Juvenile'] = ms['JuvenileC']
# Suffolk has no juvenile data; all incidents are treated as juvenile
sf['Inc_Juvenile'] = True
nw['Inc_Juvenile'] = nw['Age at Offense'] < 21

In [357]:
# Middlesex has no personal identifier, but the 'Case Number' column functions as if it were a personal identifier
# for the intents and purposes of this notebook. These adjustments allow Middlesex data to be processed by
# the same functions that process Northwest and Suffolk data
ms.rename(columns={'Case Number': 'Person ID'}, inplace=True)
ms['Incidents per Person'] = 1

In [358]:
# Add column: 'Inc_Felony' (boolean); TRUE if the incident contains at least one felony offense
# This is dummy data; eventually this information will be implemented earlier in the data pipeline
np.random.seed(42)
for x in [nw, ms, sf]:
    x['Inc_Felony'] = (np.random.randint(0,20, x.shape[0]))
    x['Inc_Felony'] = x['Inc_Felony'] == 19
    
nw['Inc_Felony'] = nw.groupby(['Person ID', 'Offense Date'])['Inc_Felony'].transform('min')
sf['Inc_Felony'] = sf.groupby(['Person ID', 'Offense Date'])['Inc_Felony'].transform('min')
ms['Inc_Felony'] = ms.groupby(['Person ID'])['Inc_Felony'].transform('min')

In [359]:
# Add column: 'Inc_Years_Remaining'; the number of years that must still pass before the incident may be eligible
# That's a maximum of 3 years for a misdemeanor, and 7 years for a felony
# Note that any incident for which the waiting period has already passed will have a value <= 0
for x in [nw, ms, sf]:
    x.loc[(x['Inc_Felony'] == True),['Inc_Years_Remaining']] = 7 - x['years_since_offense']
    x.loc[(x['Inc_Felony'] == False),['Inc_Years_Remaining']] = 3 - x['years_since_offense']

### Incident Codes

In [360]:
def generateIncidentCode(row):
    # This function returns a single incident code. For each digit position, a 1 indicates:
    # First: incident occured at a juvenile age
    # Second: all incident offenses are eligible for expungement under 100J
    # Third: at least one incident offense is a sex or murder offense
    # Fourth: at least one incident offense has a guilty disposition
    # Fifth: at least one incident offense lacks all disposition data
    # Sixth: not enough years have passed for the incident to be currently expungeable
    
    result = list('000000')
    if row['Inc_Juvenile']:
        result[0] = '1'
    if row['Inc_Expungeable_Attempts_Are']:
        result[1] = '1'
    if row['Inc_Sex_or_Murder']:
        result[2] = '1'
    if row['Incident_Guilty_or_missing'] == 'True':
        result[3] = '1'
    if row['Inc_Missing_Any_Dispo']:
        result[4] = '1'
    if row['Inc_Years_Remaining'] > 0:
        result[5] = '1'
        
    return ''.join(result)

In [361]:
# Add column: 'Incident Code'
for x in [nw, ms, sf]:
    x['Incident Code'] = x.apply(lambda row: generateIncidentCode(row), axis=1)

## Step 2: Reduce Dataframes Into All-Incident Codes (One Code per Individual)

In [362]:
summaryFrames = [nw_summary, sf_summary, ms_summary]

for count, x in enumerate([nw, sf, ms]):
    summaryFrames[count] = (x.groupby(['Person ID', 'Offense Date'])['Incident Code'].first().reset_index())
    summaryFrames[count]['All Incident Codes'] = summaryFrames[count].groupby(['Person ID'])['Incident Code'].transform(lambda y: ''.join(y))
    summaryFrames[count].drop_duplicates(subset='Person ID', inplace=True)
    summaryFrames[count].drop(columns=['Person ID', 'Offense Date', 'Incident Code'], inplace=True)

## Step 3: Answers to questions provided by CFJJ

In [363]:
# Given an all-incident code and the incident categories associated with a specific question, this function returns:
# 0 if the individual is never eligible
# 1 if the individual is eligible today and has no missing data
# 2 if the individual is eligible today but has some missing disposition data
# 3 if the individual will become eligible in the future and has no missing data
# 4 if the individual will becoem eligible in the future but has some missing disposition data
def determineEligibility(row, categories):
    categoryList = copy.deepcopy(categories)
    incidentString = row['All Incident Codes']
    incidents = re.findall('......', incidentString)
    eligibleToday = True
    missingDispo = False
    categoryFound = False
    
    for incident in incidents:
        if incident[-1] == '1':
            eligibleToday = False
        if incident[-2] == '1':
            missingDispo = True
        
        for category in categoryList:
            categoryRegex = re.compile(category[1])
            if categoryRegex.match(incident):
                categoryFound = True
                # The focused incident matches the focused category; decrement the category allotment
                category[0] = category[0] - 1
                if category[0] < 0:
                    # If any category exceeds its allotment, this individual is never eligible
                    return 0
        # If the incident does not belong to any of the given categories, this individual is never eligible
        if not categoryFound:
            return 0
        categoryFound = False
    # If this point is reached, the individual is eligible, but may still need to wait for the 3 or 7 years to pass
    if eligibleToday and not missingDispo:
        return 1
    elif eligibleToday and missingDispo:
        return 2
    elif not eligibleToday and not missingDispo:
        return 3
    elif not eligibleToday and missingDispo:
        return 4
    
    # This point shouldn't ever be reachable
    return -1

In [364]:
def printAnswers(categories, region):
    if region == 'nw':
        regionName = 'Northwest'
        df = nw_summary.copy()
        unit = 'individuals'
    elif region == 'sf':
        regionName = 'Suffolk'
        df = sf_summary.copy()
        unit = 'individuals'
    elif region == 'ms':
        regionName = 'Middlesex'
        df = ms_summary.copy()
        unit = 'cases'
    else:
        print('Invalid region provided')
        return
    
    df['Result'] = df.apply(lambda row: determineEligibility(row, categories), axis=1)
    
    neverEligible = (df['Result'].values == 0).sum()
    eligibleNow = (df['Result'].values == 1).sum()
    eligibleNowIncomplete = (df['Result'].values == 2).sum()
    eligibleLater = (df['Result'].values == 3).sum()
    eligibleLaterIncomplete = (df['Result'].values == 4).sum()
    
    print(regionName)
    print(eligibleNow + eligibleNowIncomplete, unit, 'are eligible today.', eligibleNowIncomplete, 'of them have incomplete disposition data.')
    print('An additional', eligibleLater + eligibleLaterIncomplete, unit, 'will become eligible after their waiting period has ended.', eligibleLaterIncomplete, 'of them have incomplete disposition data.')
    print(neverEligible, unit, 'will never be eligible.\n')

### Using the answerQuestion Function

To print the answers to any question of the CfJJ question structure discussed earlier, the question must be translated into a list that can be interpreted by the answerQuestions function.

Each item in the list represents one of the question's incident categories. These items are themselves two-item lists; the first value is the maximum allowed number of incidents within that category before an individual is ineligible, and the second value contains a four-digit **category code** representing the demands of that particular question category.

These four digits correspond exactly to the first four digits of an incident code, also discussed earlier. Each digit can be 1 or 0, but a period can be used to indicate both 1 and 0 are acceptable.

For example, the first category of the first question from CfJJ indicates that an eligible individual may have up to two incidents that meet this criteria: individual was a juvenile, no offenses are ineligible under 100J, and at least one offense has a guilty disposition. Following the meanings of the digits in an incident code, the category code representing this criteria would be '11.1'. The first digit is one because it must be a juvenile incident; the second digit is 1 because it must be a 100J eligible incident; the third digit is a period because the critera does not make any demands regarding sex/murder offenses; the fourth digit is 1 because there must be a guilty disposition.

Because the question allows up to 2 such incidents, this would be passed to answerQuestion as [2, '11.1']. A question can contain any number of these question categories.


In [365]:
def answerQuestion(categories):
    printAnswers(categories, 'nw')
    printAnswers(categories, 'sf')
    printAnswers(categories, 'ms')

**Question 1**

> Category 1: Up to 2 (Code 11.1)
>> - Must be Juvenile
>> - Must be fully expungeable by 100J
>> - Must have at least one guilty disposition

> Category 2: Up to 2 (Code 11.0)
>> - Must be Juvenile
>> - Must be fully expungeable by 100J
>> - Must not have any guilty dispositions

In [366]:
answerQuestion([[2, '11.1'], [2, '11.0']])

Northwest
980 individuals are eligible today. 34 of them have incomplete disposition data.
An additional 386 individuals will become eligible after their waiting period has ended. 19 of them have incomplete disposition data.
18151 individuals will never be eligible.

Suffolk
40728 individuals are eligible today. 5859 of them have incomplete disposition data.
An additional 9168 individuals will become eligible after their waiting period has ended. 3263 of them have incomplete disposition data.
40544 individuals will never be eligible.

Middlesex
2443 cases are eligible today. 0 of them have incomplete disposition data.
An additional 767 cases will become eligible after their waiting period has ended. 0 of them have incomplete disposition data.
160501 cases will never be eligible.



**Question 2**

> Category 1: Up to 2 (Code 1.01)
>> - Must be Juvenile
>> - Must not have any sex or murder offenses
>> - Must have at least one guilty disposition

> Category 2: Up to 2 (Code 1.00)
>> - Must be Juvenile
>> - Must not have any sex or murder offenses
>> - Must not have any guilty dispositions

In [367]:
answerQuestion([[2, '1.01'], [2, '1.00']])

Northwest
1510 individuals are eligible today. 47 of them have incomplete disposition data.
An additional 610 individuals will become eligible after their waiting period has ended. 30 of them have incomplete disposition data.
17397 individuals will never be eligible.

Suffolk
62112 individuals are eligible today. 7878 of them have incomplete disposition data.
An additional 15376 individuals will become eligible after their waiting period has ended. 6406 of them have incomplete disposition data.
12952 individuals will never be eligible.

Middlesex
4216 cases are eligible today. 0 of them have incomplete disposition data.
An additional 1466 cases will become eligible after their waiting period has ended. 0 of them have incomplete disposition data.
158029 cases will never be eligible.



**Question 3**

> Category 1: Must be 0 (Code 10.1)
>> - Must be Juvenile
>> - Must NOT be fully expungeable by 100J
>> - Must have at least one guilty disposition

> Category 2: Up to 4 (Code 1...)
>> - Must be Juvenile

In [368]:
answerQuestion([[0, '10.1'], [4, '1...']])

Northwest
1357 individuals are eligible today. 46 of them have incomplete disposition data.
An additional 567 individuals will become eligible after their waiting period has ended. 35 of them have incomplete disposition data.
17593 individuals will never be eligible.

Suffolk
62486 individuals are eligible today. 8898 of them have incomplete disposition data.
An additional 17184 individuals will become eligible after their waiting period has ended. 7843 of them have incomplete disposition data.
10770 individuals will never be eligible.

Middlesex
4111 cases are eligible today. 0 of them have incomplete disposition data.
An additional 1443 cases will become eligible after their waiting period has ended. 0 of them have incomplete disposition data.
158157 cases will never be eligible.



## Step 4: Answers to questions provided by CfJJ (Alternate Method)

The incident-code based approach used to answer the questions above is efficient and flexible, ideal for use by the interactive analysis layer. However, the approach is also complex enough to be prone to errors. As part of the testing to ensure the accuracy of the above method, the three questions provided by CfJJ are answered here using a different method not involving incident codes or regex pattern matching. The approach below is less flexible and efficient, but more straightforward to debug for accuracy.

It is important that the answers given to each question above exactly match the answers given for the same questions below. A mismatch indicates that an error has been made and must be located.

In [369]:
# Given a dataframe with one row for each incident, these functions return
# a slice of that dataframe containing incidents eligible under the specified question

# QUESTION ONE
def q1(df):
    # Q1 Category 1: Juvenile Incident, 100J Eligible, Guilty (Up to 2)
    df['Inc_C1'] = 0
    df.loc[
      (df['Inc_Juvenile'] == True) &
      (df['Inc_Expungeable_Attempts_Are'] == True) &
      (df['Incident_Guilty_or_missing'] == 'True'),
      'Inc_C1'
    ] = 1
    df['C1_Per_Person'] = df.groupby(['Person ID'])['Inc_C1'].transform('sum')

    # Q1 Category 2: Juvenile Incident, 100J Eligible, Not Guilty (Up to 2)
    df['Inc_C2'] = 0
    df.loc[
      (df['Inc_Juvenile'] == True) &
      (df['Inc_Expungeable_Attempts_Are'] == True) &
      (df['Incident_Guilty_or_missing'] != 'True'),
      'Inc_C2'
    ] = 1
    df['C2_Per_Person'] = df.groupby(['Person ID'])['Inc_C2'].transform('sum')

    # Determine eligible individuals based on the per-category caps
    eligible = df.loc[
      (df['C1_Per_Person'] <= 2) &
      (df['C2_Per_Person'] <= 2) &
      ((df['C1_Per_Person'] + df['C2_Per_Person']) == df['Incidents per Person'])
    ]
    
    return eligible

# QUESTION TWO
def q2(df):
    # Q2 Category 1: Juvenile Incident, Not Sex/Murder, Guilty (Up to 2)
    df['Inc_C1'] = 0
    df.loc[
      (df['Inc_Juvenile'] == True) &
      (df['Inc_Sex_or_Murder'] == False) &
      (df['Incident_Guilty_or_missing'] == 'True'),
      'Inc_C1'
    ] = 1
    df['C1_Per_Person'] = df.groupby(['Person ID'])['Inc_C1'].transform('sum')

    # Q2 Category 2: Juvenile Incident, Not Sex/Murder, Not Guilty (Up to 2)
    df['Inc_C2'] = 0
    df.loc[
      (df['Inc_Juvenile'] == True) &
      (df['Inc_Sex_or_Murder'] == False) &
      (df['Incident_Guilty_or_missing'] != 'True'),
      'Inc_C2'
    ] = 1
    df['C2_Per_Person'] = df.groupby(['Person ID'])['Inc_C2'].transform('sum')

    # Determine eligible individuals based on the per-category caps
    eligible = df.loc[
      (df['C1_Per_Person'] <= 2) &
      (df['C2_Per_Person'] <= 2) &
      ((df['C1_Per_Person'] + df['C2_Per_Person']) == df['Incidents per Person'])
    ]
    
    return eligible


# QUESTION THREE
def q3(df):
    # Q3 Category 1: Juvenile Incident, Anything that isn't both 100J ineligible and guilty (Up to 4)
    df['Inc_C1'] = 0
    df.loc[
      (df['Inc_Juvenile'] == True) &
      ((df['Inc_Expungeable_Attempts_Are'] == True) | (df['Incident_Guilty_or_missing'] != 'True')),
      'Inc_C1'
    ] = 1
    df['C1_Per_Person'] = df.groupby(['Person ID'])['Inc_C1'].transform('sum')
    
    # Determine eligible individuals based on the per-category caps
    eligible = df.loc[
      (df['C1_Per_Person'] <= 4) &
      (df['C1_Per_Person'] == df['Incidents per Person'])
    ]

    return eligible

In [370]:
# Process answers for each region, for each question

regionNames = ['Northwest', 'Suffolk', 'Middlesex']
unit = 'individuals'

for qcount, question in enumerate([q1, q2, q3]):
    print('QUESTION', qcount + 1, '\n')
    for count, region in enumerate([nw, sf, ms]):
        
        regionName = regionNames[count]
        df = region.copy()
        totalIndividuals = df['Person ID'].nunique()

        # Reduce dataframe to keep only the first row associated with each unique incident
        df = df.groupby(['Person ID', 'Offense Date']).first().reset_index()

        # Call function from above associated with the current question
        eligible = question(df)

        totalEligible = eligible['Person ID'].nunique()
        neverEligible = totalIndividuals - totalEligible

        eligible = eligible.copy()

        # Derive per-person column to indicate how long each eligible person must wait for ALL of their incidents to have finished the waiting period
        eligible['Highest_Years_Remaining'] = eligible.groupby(['Person ID'])['Inc_Years_Remaining'].transform('max')

        # Derive per-person column to indicate if any of the person's incidents have some missing disposition data
        eligible['Person_Missing_Any_Dispo'] = eligible.groupby(['Person ID'])['Inc_Missing_Any_Dispo'].transform('max')

        eligibleToday = eligible.loc[
          (eligible['Highest_Years_Remaining'] <= 0)
        ]
        eligibleNow = eligibleToday['Person ID'].nunique()

        eligibleFuture = eligible.loc[
          (eligible['Highest_Years_Remaining'] > 0)
        ]
        eligibleLater = eligibleFuture['Person ID'].nunique()

        eligibleTodayIncomplete = eligibleToday.loc[
          (eligibleToday['Person_Missing_Any_Dispo'] == True)
        ]
        eligibleNowIncomplete = eligibleTodayIncomplete['Person ID'].nunique()

        eligibleFutureIncomplete = eligibleFuture.loc[
          (eligibleFuture['Person_Missing_Any_Dispo'] == True)
        ]
        eligibleLaterIncomplete = eligibleFutureIncomplete['Person ID'].nunique()
        
        # Middlesex is discussed in terms of cases rather than individuals
        if count == 2:
            unit = 'cases'
        
        print(regionName)
        print(eligibleNow, unit, 'are eligible today.', eligibleNowIncomplete, 'of them have incomplete disposition data.')
        print('An additional', eligibleLater, unit, 'will become eligible after their waiting period has ended.', eligibleLaterIncomplete, 'of them have incomplete disposition data.')
        print(neverEligible, unit, 'will never be eligible.\n')
        
        unit = 'individuals'

QUESTION 1 

Northwest
980 individuals are eligible today. 34 of them have incomplete disposition data.
An additional 386 individuals will become eligible after their waiting period has ended. 19 of them have incomplete disposition data.
18151 individuals will never be eligible.

Suffolk
40728 individuals are eligible today. 5859 of them have incomplete disposition data.
An additional 9168 individuals will become eligible after their waiting period has ended. 3263 of them have incomplete disposition data.
40544 individuals will never be eligible.

Middlesex
2443 cases are eligible today. 0 of them have incomplete disposition data.
An additional 767 cases will become eligible after their waiting period has ended. 0 of them have incomplete disposition data.
160501 cases will never be eligible.

QUESTION 2 

Northwest
1510 individuals are eligible today. 47 of them have incomplete disposition data.
An additional 610 individuals will become eligible after their waiting period has ended. 30

## Step 5: Output Summary Files

These output files contain only a single column. Each row represents an individual (or case, for Middlesex), and contains a single all-incident code associated with that individual.

In [371]:
# Save the summary dataframes as csv files, overwriting them in the cleaned data folder
nw_file = nw_summary.to_csv('../data/cleaned/interactive_northwestern.csv', index=False)
ms_file = ms_summary.to_csv('../data/cleaned/interactive_middlesex.csv', index=False)
sf_file = sf_summary.to_csv('../data/cleaned/interactive_suffolk.csv', index=False)