# Clean Slate: Estimating offenses eligible for expungement under varying conditions
> Prepared by [Laura Feeney](https://github.com/laurafeeney) for Code for Boston's [Clean Slate project](https://github.com/codeforboston/clean-slate).

## Purpose
The purpose of this notebook is to get a list of all charges from both Northwestern and Suffolk data, then map expungement eligibility to each.

The resulting datasets will be output as a separate .csv that can be save and shared in the Clean Slate GitHub repo.


The purpose of this notebook is to join prosecution charges with expungability, and estimate the number of offenses which are eligible for expungement. This analysis will be run a number of times: under existing regulation, and under varying conditions allowing additional offenses to be expunged.

The target questions, provided June 28, 2020: 

1. How many people (under age 21) are eligible for expungement today? This would be people with only **one charge** that is not part of the list of ineligible offenses (per section 100J). 


2. How many people (under age 21) would be eligible based on only having **one incident** (which could include multiple charges) that are not part of the list of ineligible offenses?
 - How many people (under age 21) would be eligible based on only having **one incident** if only sex-based offenses or murder were excluded from expungement?
 

3. How many people (under age 21) would be eligible based on who has **not been found guilty** (given current offenses that are eligible for expungement)?
 - How many people (under age 21) would be eligible based on who has **not been found guilty** for all offenses except for murder or sex-based offenses?

-----

### Step 0
Import data, programs, etc.

-----

In [1]:
import pandas as pd
pd.set_option("display.max_rows", 200)
import numpy as np
import regex as re
import glob, os
import datetime 
from datetime import date 
from collections import defaultdict, Counter

In [2]:
# processed individual-level data from NW district with expungability.

nw = pd.read_csv('../../data/processed/merged_nw.csv', encoding='cp1252',
                    dtype={'Analysis notes':str, 'extra_criteria':str, 'Expungeable': str}) 
#nw.head()
nw['Expungeable'].value_counts(dropna=False)

#nw.loc[nw['Person ID']=="NW-12460", ['CMRoffense']] = "no"

Yes              55052
No               20013
NotApplicable      439
Attempt            220
NaN                  1
Name: Expungeable, dtype: int64

## Step 1
Add a few additional variables needed, look at summary stats

There are many offenses that are violations of the Code of Massachusetts Regulations (CMR) rather than a criminal offense. These include things like some driving or boating infractions (e.g., not having headlights on), or not having a hunting/fishing license. It's not clear whether these should be included at all. We'll run the analysis first including them at all stages, and then excluding them at all stages. 

In [3]:
### number of unique people
nw['Person ID'].nunique()

19753

In [4]:
### dates ###

reference_date = datetime.date(2020, 9, 1) # using "today.date() wouldn't be stable"

nw['Offense Date'] = pd.to_datetime(nw['Offense Date']).dt.date
nw['years_since_offense'] = (reference_date - nw['Offense Date'])/pd.Timedelta(1, 'D')/365

nw[nw['Offense Date']<datetime.date(1950, 1, 1)]

print("The earliest offense date is", min(nw['Offense Date']))
print("The max offense date is", max(nw['Offense Date']), "\n")

print(nw['years_since_offense'].describe())

print("\n There are a tail of dates that are probably wrong, but age is missing for most, and the most egregious:")
nw[nw['years_since_offense'] > nw['years_since_offense'].quantile(.9997)][[
    'Offense Date', 'Age at Offense', 'years_since_offense']]

The earliest offense date is 1750-12-03
The max offense date is 2018-12-30 

count    74915.000000
mean         4.460396
std          2.798864
min          1.673973
25%          3.060274
50%          4.328767
75%          5.638356
max        269.926027
Name: years_since_offense, dtype: float64

 There are a tail of dates that are probably wrong, but age is missing for most, and the most egregious:


Unnamed: 0,Offense Date,Age at Offense,years_since_offense
2293,1882-11-30,,137.846575
2294,1882-11-30,,137.846575
2295,1882-11-30,,137.846575
43250,1975-06-30,,45.205479
54403,1961-05-14,,59.342466
56406,1750-12-03,,269.926027
56407,1750-12-03,,269.926027
60782,1975-08-01,28.0,45.117808
60783,1975-08-01,28.0,45.117808
60784,1975-08-01,28.0,45.117808


In [5]:
#Manipulating the dataframe to 1. Drop all CMR offenses 2. Drop CMR-related columns

print(f'There are {nw.shape[0]} total offenses including CMR.')

nw = nw.loc[nw['CMRoffense'] == 'no']
nw = nw.drop(columns = ['CMRoffense'])

print(f'After we drop CMR, there are {nw.shape[0]} total offenses.')



There are 75725 total offenses including CMR.
After we drop CMR, there are 75285 total offenses.


In [6]:
# distribution of # of charges

nw['num_offenses']=nw.groupby('Person ID')['Person ID'].transform('count')
print(nw['num_offenses'].describe())

#nw.loc[nw['num_offenses']<30].hist(column='num_offenses', bins=15)

count    75285.000000
mean        10.075752
std         18.230430
min          1.000000
25%          3.000000
50%          5.000000
75%         11.000000
max        202.000000
Name: num_offenses, dtype: float64


In [7]:
a = nw[nw['num_offenses']==1]['Person ID'].nunique()

#print(a, b, c, d, e)
print("# People with only 1 offense", a)

# People with only 1 offense 5026


In [8]:
print(nw['Expungeable'].value_counts())

Yes        55052
No         20013
Attempt      220
Name: Expungeable, dtype: int64


## Question 1
- How many people are eligible for expungement today? 
    - ---> *Only one charge*, Offense committed before 21st birthday, charge is not part of the list of ineligible offenses (per section 100J), charge is in the correct timeframe from today's date

We don't have misdemeanor / felony info, so will show the # that are more than 3 or more than 7 years from today.

----

In [9]:
eligible_all_ages = nw.loc[
    (nw['num_offenses']==1) &
    (nw['Expungeable'] != 'No') &
    (~nw['Age at Offense'].isnull())
]['Person ID'].nunique()

print("Before filtering on age, there are", eligible_all_ages, "eligible.")

Before filtering on age, there are 3533 eligible.


In [10]:
x = nw.loc[
    (nw['num_offenses']==1) &
    (nw['Expungeable'] != 'No') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21)
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA.",
      "and we include 'attempts' as expungeable.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

There are 525 people with charges eligible for expungement in the NW district of MA. and we include 'attempts' as expungeable.
Of these, 373 occured more than 3 years before 2020-09-01
Of these, 10 occured more than 7 years before 2020-09-01


In [11]:
x = nw.loc[
    (nw['num_offenses']==1) &
    (nw['Expungeable'] == 'Yes') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, and we assume all 'attempts' are not expungeable.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

There are 523 people with charges eligible for expungement in the NW district of MA, and we assume all 'attempts' are not expungeable.
Of these, 371 occured more than 3 years before 2020-09-01
Of these, 10 occured more than 7 years before 2020-09-01


## Question 2: incidents rather than offenses

Luke Schissler

**Question 2a** - If we only change the number of offenses, how many are eligible based on only having one incident (that does not include an ineligible offense)? 

Making the assumption based on the 06/26/20 meeting notes that an 'one incident' is defined as the sum of all charges for an individual on a single day. In theory, someone could probably be charged in two separate instances on a single day, but this is likely rare. 

Another question is whether a single non-expungeable offense under the bundle of the incidence makes the entire incidence non-expungeable. For this analysis, I am proceeding with the assumption that one non-expungeable incident in the batch makes the entire thing non-expungeable.

In [34]:
incidents = nw.groupby('Person ID')['Offense Date'].nunique()

#If an offense is not expungeable, or an attempt, mark corresponding incident as unexpungeable 
def generate_incident_hash(dic):
    for index, row in nw.iterrows():
        if row['Expungeable'] in ['No', 'Attempt']:
            dic[(row['Person ID'], row["Offense Date"])] = 'No'
    return dic


#If an offense is not expungeable, mark corresponding incident as unexpungeable 
def generate_attempt_incident_hash(dic):
    for index, row in nw.iterrows():
        if row['Expungeable'] == 'No':
            dic[(row['Person ID'], row['Offense Date'])] ='No'
    return dic

#If any offense is related to murder or sex, mark that incident as unexpungeable
def generate_murder_sex_hash(dic):
    for index, row in nw.iterrows():
        if row['murder'] == 1 or row['sex'] == 1:
            dic[(row['Person ID'], row['Offense Date'])] = 1
    return dic

#If any offense in an incident has a guilty, mark that incident as unexpungeable 
def generate_guilty_hash(dic):
    for index, row in nw.iterrows():
        if row['Disposition'] in ['Guilty', 'Guilty Filed', 'Guilty on Lesser Included Offense', '']:
            dic[(row['Person ID'], row['Offense Date'])] = 'Guilty'
    return dic

incident_dict = generate_incident_hash({})
incident_dict_attempt = generate_attempt_incident_hash({})
murder_sex_dict = generate_murder_sex_hash({})
guilty_dic = generate_guilty_hash({})

def find_inc(per_id):
    return incidents[per_id]

nw['Incidents'] = np.vectorize(find_inc)(nw['Person ID']) #Incidents column is number of unique offense dates per Person
nw['Incident Expungeable'] = np.vectorize(lambda x, y : incident_dict.get((x, y), 'Yes'))(nw['Person ID'], nw['Offense Date'])
nw['Incident Attempt Expungeable'] = np.vectorize(lambda x, y : incident_dict_attempt.get((x, y), 'Yes'))(nw['Person ID'], nw['Offense Date'])
nw['Incident Murder/Sex'] = np.vectorize(lambda x, y : murder_sex_dict.get((x, y), 0))(nw['Person ID'], nw['Offense Date'])
nw['Incident Guilty'] = np.vectorize(lambda x, y : guilty_dic.get((x, y), 'Not Guilty'))(nw['Person ID'], nw['Offense Date'])

In [44]:
# Find how many people whose offenses fall into a single incident, where all the offenses are expungeable 
# (Not including attempts). 

x = nw.loc[
    (nw['Incidents'] == 1) &
    (nw['Incident Expungeable'] == 'Yes') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has a single incident" +
      " in which all the offenses are expungeable" + 
     " if we consider attempts to be unexpungeable.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

#print(x.sample(n=5)) # grab a random sample to verify

There are 1191 people with charges eligible for expungement in the NW district of MA, if the person has a single incident in which all the offenses are expungeable if we consider attempts to be unexpungeable.
Of these, 826 occured more than 3 years before 2020-09-01
Of these, 21 occured more than 7 years before 2020-09-01


In [45]:
# Find how many people whose offenses fall into a single incident, where all the offenses are expungeable 
# (Including Attempts). 

x = nw.loc[
    (nw['Incidents'] == 1) &
    (nw['Incident Attempt Expungeable'] == 'Yes') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has a single incident," +
      " in which all the offenses are expungeable and attempts are considered expungeable.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 1198 people with charges eligible for expungement in the NW district of MA, if the person has a single incident, in which all the offenses are expungeable and attempts are considered expungeable.
Of these, 832 occured more than 3 years before 2020-09-01
Of these, 21 occured more than 7 years before 2020-09-01




**Question 2b** -  How many would be eligible if we only limited to only exclude sex-based offenses or murder?

Assuming here we are not limiting based on expungability, but only on if the person has 1 incident, it is related to neither murder nor sex, and the person is under 21. 

In [46]:
x = nw.loc[
    (nw['murder'] == 0) &
    (nw['sex'] == 0) &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has charges unrelated to" +
      " murder or sex.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 2750 people with charges eligible for expungement in the NW district of MA, if the person has charges unrelated to murder or sex.
Of these, 2189 occured more than 3 years before 2020-09-01
Of these, 67 occured more than 7 years before 2020-09-01




In [85]:
x = nw.loc[
    (nw['Incidents'] == 1) &
    (nw['Incident Murder/Sex'] == 0) &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has a single incident" +
      "in which all offenses are unrelated to sex or murder")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 1704 people with charges eligible for expungement in the NW district of MA, if the person has a single incidentin which all offenses are unrelated to sex or murder
Of these, 1204 occured more than 3 years before 2020-09-01
Of these, 29 occured more than 7 years before 2020-09-01




## Question 3: Verdict as Determiner

**Question 3a** -  How many would be eligible based on who has not been found guilty (given current offenses that are eligible for expungement)?

Defining 'guilty' as one of these three dispositions: Guilty, Guilty Filed, Guilty on Lesser Included Offense

In [28]:
# Find how many people are eligble for expungement if they have a verdict other than guilty and their offense is 
# expungeable (Attempts not included)

x = nw.loc[
    (nw['Expungeable'] == 'Yes') &
    (~nw['Disposition'].isnull()) &
    (nw['Disposition'] != 'Guilty') &
    (nw['Disposition'] != 'Guilty Filed') &
    (nw['Disposition'] != 'Guilty on Lesser Included Offense') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has an expungeable" +
      "offense and has a verdict other than guilty.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 2333 people with charges eligible for expungement in the NW district of MA, if the person has an expungeable offense and has a verdict other than guilty.
Of these, 1822 occured more than 3 years before 2020-09-01
Of these, 38 occured more than 7 years before 2020-09-01




In [32]:
# Find how many people are eligble for expungement if they have a verdict other than guilty and their offense is 
# expungeable (Attempts included)

x = nw.loc[
    (nw['Expungeable'] != 'No') &
    (~nw['Disposition'].isnull()) &
    (nw['Disposition'] != 'Guilty') &
    (nw['Disposition'] != 'Guilty Filed') &
    (nw['Disposition'] != 'Guilty on Lesser Included Offense') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has an expungeable" + 
      "offense and has a verdict other than guilty.  Attempts considered expungeable.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 2336 people with charges eligible for expungement in the NW district of MA, if the person has an expungeable offense (including CMR) and has a verdict other than guilty.
Of these, 1825 occured more than 3 years before 2020-09-01
Of these, 38 occured more than 7 years before 2020-09-01




In [37]:
# Find how many people are eligble for expungement if they have a single incident, in which all the offenses are 
# expungeable, and none have a guilty verdict. Attempts are not included. 

x = nw.loc[
    (nw['Incidents'] == 1) &
    (nw['Incident Expungeable'] == 'Yes') &
    (nw['Incident Guilty'] == 'Not Guilty') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21)
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has a single" +
      "incident, where all charges are expungeable, and all charge have a disposition other than guilty.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 1174 people with charges eligible for expungement in the NW district of MA, if the person has a singleincident, where all charges are expungeable, and all charge have a disposition other than guilty.
Of these, 814 occured more than 3 years before 2020-09-01
Of these, 21 occured more than 7 years before 2020-09-01




In [36]:
# Find how many people are eligble for expungement if they have a single incident, in which all the offenses are 
# expungeable, and none have a guilty verdict. Attempts are included. 

x = nw.loc[
    (nw['Incidents'] == 1) &
    (nw['Incident Attempt Expungeable'] == 'Yes') &
    (nw['Incident Guilty'] == 'Not Guilty') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21)
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has a single" +
      "incident, where all charges are expungeable, and all charge have a disposition other than guilty. Attempts" + 
     "considered expungeable.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 1180 people with charges eligible for expungement in the NW district of MA, if the person has an expungeableoffense and has a verdict other than guilty.
Of these, 819 occured more than 3 years before 2020-09-01
Of these, 21 occured more than 7 years before 2020-09-01




**Question 3b** -  How many would be eligible based on who has not been found guilty, except murder or sex offenses?


In [43]:
x = nw.loc[
    (nw['murder'] == 0) &
    (nw['sex'] == 0) &
    (~nw['Disposition'].isnull()) &
    (nw['Disposition'] != 'Guilty') &
    (nw['Disposition'] != 'Guilty Filed') &
    (nw['Disposition'] != 'Guilty on Lesser Included Offense') &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has an offense" +
      " unrelated to sex or murder and has a verdict other than guilty.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 2656 people with charges eligible for expungement in the NW district of MA, if the person has an offense unrelated to sex or murder and has a verdict other than guilty.
Of these, 2113 occured more than 3 years before 2020-09-01
Of these, 55 occured more than 7 years before 2020-09-01




In [42]:
#Find out how many people would be eligible if they have a single incident, in which all the charges are unrelated to
#sex or murder, and all charges have a disposition other than guilty. Attempts not included. 

x = nw.loc[
    (nw['Incidents'] == 1) &
    (nw['Incident Guilty'] == 'Not Guilty') &
    (nw['Incident Murder/Sex'] == 0) &
    (~nw['Age at Offense'].isnull()) &
    (nw['Age at Offense']<21) 
]

People_eligible = x['Person ID'].nunique()

greater3 = x.loc[
    (x['years_since_offense'] > 3)
]['Person ID'].nunique()

greater7 = x.loc[
    (x['years_since_offense'] > 7)
]['Person ID'].nunique()

print("There are", People_eligible,
      "people with charges eligible for expungement in the NW district of MA, if the person has a single incident" +
      " in which all the charges are unrelated to sex or murder and have a non-guilty verdict.")
print("Of these,", greater3, "occured more than 3 years before", reference_date)
print("Of these,", greater7, "occured more than 7 years before", reference_date)

print("\n")
#print(x.sample(n=3)) # grab a random sample to verify

There are 1666 people with charges eligible for expungement in the NW district of MA, if the person has a single incident in which all the charges are unrelated to sex or murder and have a non-guilty verdict.
Of these, 1180 occured more than 3 years before 2020-09-01
Of these, 27 occured more than 7 years before 2020-09-01


