# Clean Slate: Estimating offenses eligible for expungement under varying conditions
> Prepared by [Laura Feeney](https://github.com/laurafeeney) for Code for Boston's [Clean Slate project](https://github.com/codeforboston/clean-slate).

## Summary
This notebook takes somewhat processed data from the Middlesex DA and attempts to answer how many individuals may be eligible for expungement under varying conditions.

This dataset does not contain any information to identify specific individuals across multiple cases. We can see what charges are heard in Juvenile court, but we do not otherwise have an indicator of age. 

So, we can provide a count and % of incidents heard in Juvenile court that are expungeable. 

### Original Questions

1. How many people (under age 21) are eligible for expungement today? This would be people with only **one charge** that is not part of the list of ineligible offenses (per section 100J). 


2. How many people (under age 21) would be eligible based on only having **one incident** (which could include multiple charges) that are not part of the list of ineligible offenses?
 - How many people (under age 21) would be eligible based on only having **one incident** if only sex-based offenses or murder were excluded from expungement?
 

3. How many people (under age 21) would be eligible based on who has **not been found guilty** (given current offenses that are eligible for expungement)?
 - How many people (under age 21) would be eligible based on who has **not been found guilty** for all offenses except for murder or sex-based offenses?

-----

### Step 0
Import data, programs, etc.

-----

In [1]:
import pandas as pd
pd.set_option("display.max_rows", 200)
import numpy as np
import regex as re
import glob, os
import datetime 
from datetime import date 
from collections import defaultdict, Counter

In [2]:
# processed individual-level data from MS district with expungability.

ms = pd.read_csv('../../data/processed/merged_ms.csv', encoding='cp1252',
                    dtype={'Analysis notes':str, 'extra_criteria':str, 'Expungeable': str}, low_memory=False) 

ms['Expungeable'].value_counts(dropna=False)

Yes         273094
No          112554
NA - CMR      5047
Attempt       1519
NaN            383
m                8
Name: Expungeable, dtype: int64

In [3]:
ms['offenses_per_case']=ms.groupby('Case Number')['Case Number'].transform('count')
ms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392605 entries, 0 to 392604
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Case Number              392605 non-null  object 
 1   Offense Date             392578 non-null  object 
 2   Court Location           392605 non-null  object 
 3   Charge                   392605 non-null  object 
 4   Charge/Crime Type        392605 non-null  object 
 5   Disposition Description  392605 non-null  object 
 6   CMRoffense               392605 non-null  bool   
 7   Chapter                  392605 non-null  object 
 8   Section                  389306 non-null  object 
 9   Paragraph                309266 non-null  object 
 10  JuvenileC                392605 non-null  bool   
 11  years_since_offense      392578 non-null  float64
 12  sex                      360336 non-null  float64
 13  murder                   360336 non-null  float64
 14  Expu

In [4]:
# Only indication of juvenile is if tried in juvenile court. Looks like no cases are heard in 2 courts (presumably would get 
# a different case number)
ms['juvenile'] = ms.groupby('Case Number')['JuvenileC'].transform('max')
pd.crosstab(ms['JuvenileC'], ms['juvenile'])

juvenile,False,True
JuvenileC,Unnamed: 1_level_1,Unnamed: 2_level_1
False,378306,0
True,0,14299


## Step 1
Add a few additional variables needed, look at summary stats

There are many offenses that are violations of the Code of Massachusetts Regulations (CMR) rather than a criminal offense. These include things like some driving or boating infractions (e.g., not having headlights on), or not having a hunting/fishing license. It's not clear whether these should be included at all. We'll run the analysis first including them at all stages, and then excluding them at all stages. 

In [5]:
### dates ###

reference_date = datetime.date(2020, 9, 1) # using "today.date() wouldn't be stable"

ms['Offense Date'] = pd.to_datetime(ms['Offense Date']).dt.date

ms[ms['Offense Date']<datetime.date(1950, 1, 1)]

print("The earliest offense date is", min(ms['Offense Date']))
print("The max offense date is", max(ms['Offense Date']), "\n")

print(ms['years_since_offense'].describe())

The earliest offense date is 1951-06-30
The max offense date is 2019-12-30 

count    392578.000000
mean          4.259063
std           2.434936
min           0.673973
25%           2.635616
50%           4.243836
75%           5.649315
max          69.221918
Name: years_since_offense, dtype: float64


In [6]:
# CMR offenses -- Drop all CMR offenses and Drop CMR-related columns

print(f'There are {ms.shape[0]} total offenses including CMR.')

ms = ms.loc[ms['CMRoffense'] == False]
ms = ms.drop(columns = ['CMRoffense'])

print(f'After we drop CMR, there are {ms.shape[0]} total offenses.')

# Check that the 'expungeable' column no longer has CMRs 
print(ms['Expungeable'].value_counts())
print("\n", ms['Expungeable'].value_counts(normalize=True))

There are 392605 total offenses including CMR.
After we drop CMR, there are 387558 total offenses.
Yes        273094
No         112554
Attempt      1519
m               8
Name: Expungeable, dtype: int64

 Yes        0.705350
No         0.290706
Attempt    0.003923
m          0.000021
Name: Expungeable, dtype: float64


In [7]:
#Data prep.
# We only have Case Number, and cases are all for an offense on the same date. 

# If an incident includes one offense that is not expungeable, we mark the entire incident as not expungeable.
#Attempts *are not* considered expungeable in this one. 
ms['Exp'] = ms['Expungeable']=="Yes"
ms['Inc_Expungeable_Attempts_Not'] = ms.groupby(['Case Number'])['Exp'].transform('min')

# If an incident includes one offense that is not expungeable, we mark the entire incident as not expungeable.
#Attempts *are* considered expungeable in this one. 
ms['ExpAtt'] = (ms['Expungeable']=="Yes") | (ms['Expungeable']=="Attempt")
ms['Inc_Expungeable_Attempts_Are'] = ms.groupby(['Case Number'])['ExpAtt'].transform('min')

# If an incident includes an offense that is a murder and/or sex crime, we code the whole incident as regarding
# murder and/or sex.
ms['sm'] = (ms['sex'] == 1) | (ms['murder'] ==1)
ms['Incident_Murder_Sex'] = ms.groupby(['Case Number'])['sm'].transform('max')

#unneeded calculation columns
ms = ms.drop(columns=['Exp', 'sm'])

In [8]:
### No indicator for unique individuals. Only proxy is case number. This will mean any estimates are well over estimated.
Number_Cases = ms['Case Number'].nunique()
Number_Cases_Juvenile = ms[ms['juvenile']==True]['Case Number'].nunique()
Percent_Juvenile = "{:.2%}".format(Number_Cases_Juvenile/Number_Cases)
print('There are', Number_Cases, 'unique cases in the Middlesex file. Of those,', Number_Cases_Juvenile,
      'ie' , Percent_Juvenile,
     'were brought to juvenile court.')

There are 163727 unique cases in the Middlesex file. Of those, 5816 ie 3.55% were brought to juvenile court.


In [9]:
## dispositions
dispos = ms['Disposition Description'].unique()
#print(sorted(dispos))

guilty_dispos = ['DELINQUENT BENCH TRIAL', 'DELINQUENT CHANGE OF PLEA', 
                'DELINQUENT CHANGE OF PLEA LESSER OFFENSE', 'DELINQUENT JURY TRIAL',
                'GUILTY BENCH TRIAL', 'GUILTY BENCH TRIAL LESSER INCLUDED',
                'GUILTY CHANGE OF PLEA', 'GUILTY CHANGE OF PLEA LESSER OFFENSE', 
                'GUILTY FILED', 'GUILTY FINES', 'GUILTY JURY TRIAL', 
                'GUILTY JURY TRIAL LESSER INCLUDED', 
                'Guilty Jury Trial (and Bench) Lesser Included', 'RESPONSIBLE']
ms['guilty'] = ms['Disposition Description'].isin(guilty_dispos)
#ms.loc[ms['Disposition Description'].isnull(), 'guilty'] = None 
ms['Incident_Guilty'] = ms.groupby(['Case Number', 'Offense Date'])['guilty'].transform('max')
print(ms.guilty.value_counts(normalize=True))
print(ms.Incident_Guilty.value_counts(normalize=True))

False    0.781426
True     0.218574
Name: guilty, dtype: float64
False    0.678596
True     0.321404
Name: Incident_Guilty, dtype: float64


In [10]:
a = ms['Disposition Description'].value_counts().rename_axis('unique_values').to_frame('counts')
b = ms['Disposition Description'].value_counts(normalize=True).rename_axis('unique_values').to_frame('percent')*100
disp_stats = pd.concat([a, b], axis=1)

disp_stats['cumulative percent'] = disp_stats.percent.cumsum()
print('top 10 dispositions for all cases')
disp_stats[0:10]

# The top 10 dispositions account for 90% of all dispositions

top 10 dispositions for all cases


Unnamed: 0_level_0,counts,percent,cumulative percent
unique_values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DISMISSED W/O PREJUDICE,102614,26.477069,26.477069
GUILTY CHANGE OF PLEA,47114,12.156632,38.633701
CONTINUED W/O FINDING,42796,11.042476,49.676178
NOT RESPONSIBLE,37155,9.586952,59.26313
DISMISSED BY FINES,31854,8.219157,67.482287
RESPONSIBLE,23818,6.145661,73.627947
NOLLE PROSEQUI,22360,5.769459,79.397406
PRE-TRIAL PROBATION,20293,5.236119,84.633526
DISMISSED W/O PREJUDICE LACK OF PROSECUTION,11428,2.94872,87.582246
DISMISSED ON COURT COSTS,10689,2.758039,90.340285


In [11]:
a = ms['Disposition Description'].loc[ms['juvenile']==True].value_counts().rename_axis('unique_values').to_frame('counts')
b = ms['Disposition Description'].loc[ms['juvenile']==True].value_counts(normalize=True).rename_axis('unique_values').to_frame('percent')*100
disp_stats = pd.concat([a, b], axis=1)

disp_stats['cumulative percent'] = disp_stats.percent.cumsum()
print('top 10 dispositions for all cases in juvenile court')
disp_stats[0:10]

top 10 dispositions for all cases in juvenile court


Unnamed: 0_level_0,counts,percent,cumulative percent
unique_values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DISMISSED W/O PREJUDICE,4287,30.021008,30.021008
PRE-TRIAL PROBATION,3402,23.823529,53.844538
CONTINUED W/O FINDING,2025,14.180672,68.02521
DISMISSED PRIOR TO ARRAIGNMENT,745,5.217087,73.242297
DELINQUENT CHANGE OF PLEA,694,4.859944,78.102241
NOT RESPONSIBLE,401,2.808123,80.910364
NOLLE PROSEQUI,342,2.394958,83.305322
DISMISSED BY COURT (PRIOR TO ARRAIGNMENT),339,2.37395,85.679272
DISMISSED W/O PREJUDICE LACK OF PROSECUTION,310,2.170868,87.85014
GUILTY CHANGE OF PLEA,221,1.547619,89.397759


In [12]:
a = ms['Disposition Description'].loc[ms['juvenile']==True].value_counts()


## Question 1
- How many cases include only 1 offense, heard in a Juvenile court, and the charge is not part of the list of ineligible offenses from section 100J. 

We don't have misdemeanor / felony info, so will show the # that are more than 3 or more than 7 years from today.

----

In [13]:
def date_range(x):
    greater3 = x.loc[(x['years_since_offense'] > 3)]['Case Number'].nunique()
    greater7 = x.loc[(x['years_since_offense'] > 7)]['Case Number'].nunique()

    print(greater3, "occured more than 3 years before", reference_date)
    print(greater7, "occured more than 7 years before", reference_date)
    

In [14]:
x = ms.loc[
    (ms['offenses_per_case']==1) &
    (ms['Expungeable'] != 'No') &
    (ms['juvenile'] == True)
]

People_eligible = x['Case Number'].nunique()

print(f"There are {People_eligible} cases with charges eligible for expungement in the Middlesex district of MA") 
print(f"if we include 'attempts' as expungeable. This is {People_eligible/Number_Cases:.2%} of all cases. \n", 
     f"This is {People_eligible/Number_Cases_Juvenile: .2%} of juvenile cases")   
      
date_range(x)

x['Disposition Description'].value_counts(dropna=False)[0:10]

There are 1676 cases with charges eligible for expungement in the Middlesex district of MA
if we include 'attempts' as expungeable. This is 1.02% of all cases. 
 This is  28.82% of juvenile cases
1254 occured more than 3 years before 2020-09-01
36 occured more than 7 years before 2020-09-01


DISMISSED W/O  PREJUDICE                       580
PRE-TRIAL PROBATION                            370
CONTINUED W/O FINDING                          186
DISMISSED PRIOR TO ARRAIGNMENT                 157
DISMISSED BY COURT (PRIOR TO ARRAIGNMENT)       78
DISMISSED BY FINES                              43
DELINQUENT CHANGE OF PLEA                       30
GENERAL CONTINUANCE                             29
NOT RESPONSIBLE                                 22
DISMISSED W/O PREJUDICE LACK OF PROSECUTION     21
Name: Disposition Description, dtype: int64

## Question 2
- How many people (under age 21) would be eligible based on only having one incident (which could include multiple charges) that are not part of the list of ineligible offenses?


*We cannot answer this -- we do not have a person-level identifier or any proxy for an identifier. Instead, can answer what percent of incidents heard in juvenile court are eligible, based on none of the charges being on the list of ineligible offenses* 


In [15]:
x = ms.loc[
    (ms['Inc_Expungeable_Attempts_Are'] == True) &
    (ms['juvenile'] == True)
]

People_eligible = x['Case Number'].nunique()

print(f"There are {People_eligible} cases with all charges eligible for expungement in the Middlesex district of MA") 
print(f"if we include 'attempts' as expungeable. This is {People_eligible/Number_Cases:.2%} of all cases. \n",
     f"This is {People_eligible/Number_Cases_Juvenile: .2%} of juvenile cases")   

date_range(x)

x['Disposition Description'].value_counts(dropna=False)[0:10]

There are 3210 cases with all charges eligible for expungement in the Middlesex district of MA
if we include 'attempts' as expungeable. This is 1.96% of all cases. 
 This is  55.19% of juvenile cases
2462 occured more than 3 years before 2020-09-01
59 occured more than 7 years before 2020-09-01


PRE-TRIAL PROBATION                            2185
DISMISSED W/O  PREJUDICE                       2167
CONTINUED W/O FINDING                          1166
DISMISSED PRIOR TO ARRAIGNMENT                  460
DELINQUENT CHANGE OF PLEA                       329
NOT RESPONSIBLE                                 194
DISMISSED BY COURT (PRIOR TO ARRAIGNMENT)       172
DISMISSED BY FINES                              112
GUILTY CHANGE OF PLEA                            99
DISMISSED W/O PREJUDICE LACK OF PROSECUTION      94
Name: Disposition Description, dtype: int64

### Question 3
- How many people (under age 21) would be eligible based on who has not been found guilty (given current offenses that are eligible for expungement)?

Need more work on this -- there are many dispositions that are null, and many others that indicate a case in process, transferred, etc, that would not indicate "guilty", but also would not indicate "not guilty"

*Because we do not have an individual identifier, this is just a sub-set of Question 2. This will remove any incidents where at least 1 offense had a disposition indicating guilty (looks like its mostly taking form delinquent change of plea or guilty change of plea). If we had an indicator of individuals across offenses, this might increase the number of people eligible for expungement, because it would waive the single offense/incident criterion. In this case, it reduces the number of incidents eligible, because it restricts to only those not found guilty.*

In [16]:
x = ms.loc[
    (ms['Inc_Expungeable_Attempts_Are']) &
    (ms['juvenile'] == True) &
    (ms['Incident_Guilty'] != True)
]

People_eligible = x['Case Number'].nunique()

print(f"There are {People_eligible} cases with charges eligible for expungement in the Middlesex district of MA, where",
     "no charges in the incident had a guilty disposition") 
print(f"we include 'attempts' as expungeable. This is {People_eligible/Number_Cases:.2%} of all cases. \n", 
     f"This is {People_eligible/Number_Cases_Juvenile: .2%} of juvenile cases")   
      
date_range(x)

x['Disposition Description'].value_counts(dropna=False)[0:10]

There are 2969 cases with charges eligible for expungement in the Middlesex district of MA, where no charges in the incident had a guilty disposition
we include 'attempts' as expungeable. This is 1.81% of all cases. 
 This is  51.05% of juvenile cases
2261 occured more than 3 years before 2020-09-01
56 occured more than 7 years before 2020-09-01


PRE-TRIAL PROBATION                            2184
DISMISSED W/O  PREJUDICE                       2115
CONTINUED W/O FINDING                          1139
DISMISSED PRIOR TO ARRAIGNMENT                  459
NOT RESPONSIBLE                                 186
DISMISSED BY COURT (PRIOR TO ARRAIGNMENT)       172
DISMISSED BY FINES                              105
DISMISSED W/O PREJUDICE LACK OF PROSECUTION      92
GENERAL CONTINUANCE                              90
NOLLE PROSEQUI                                   85
Name: Disposition Description, dtype: int64

## Question 2b
- How many people (under age 21) would be eligible based on only having one incident if only sex-based offenses or murder were excluded from expungement?


*We cannot answer this -- we do not have a person-level identifier or any proxy for an identifier. Instead, can answer what percent of incidents heard in juvenile court are eligible, based on none of the charges in the incident being related to sex or murder* 

In [17]:
x = ms.loc[
    (ms['Incident_Murder_Sex'] == False) &
    (ms['juvenile'] == True)
]

People_eligible = x['Case Number'].nunique()

print(f"There are {People_eligible} cases where no cases are related to murder or sex in the Middlesex district of MA") 
print(f"This is {People_eligible/Number_Cases:.2%} of all cases. \n",
     f"This is {People_eligible/Number_Cases_Juvenile: .2%} of juvenile cases")   

date_range(x)

x['Disposition Description'].value_counts(dropna=False)[0:10]

There are 5732 cases where no cases are related to murder or sex in the Middlesex district of MA
This is 3.50% of all cases. 
 This is  98.56% of juvenile cases
4281 occured more than 3 years before 2020-09-01
82 occured more than 7 years before 2020-09-01


DISMISSED W/O  PREJUDICE                       4237
PRE-TRIAL PROBATION                            3371
CONTINUED W/O FINDING                          2019
DISMISSED PRIOR TO ARRAIGNMENT                  742
DELINQUENT CHANGE OF PLEA                       665
NOT RESPONSIBLE                                 394
DISMISSED BY COURT (PRIOR TO ARRAIGNMENT)       338
DISMISSED W/O PREJUDICE LACK OF PROSECUTION     305
NOLLE PROSEQUI                                  302
GUILTY CHANGE OF PLEA                           208
Name: Disposition Description, dtype: int64

### Question 3b
- How many people (under age 21) would be eligible based on who has not been found guilty for all offenses except for murder or sex-based offenses?

*Because we do not have an individual identifier, this is just a sub-set of Question 2b. This will remove any incidents where at least 1 offense had a disposition indicating guilty (looks like its mostly taking form delinquent change of plea or guilty change of plea). If we had an indicator of individuals across offenses, this might increase the number of people eligible for expungement, because it would waive the single offense/incident criterion. In this case, it reduces the number of incidents eligible, because it restricts to only those not found guilty.*

In [18]:
x = ms.loc[
    (ms['Incident_Murder_Sex'] == False) &
    (ms['juvenile'] == True) &
    (ms['Incident_Guilty'] != True)
]

People_eligible = x['Case Number'].nunique()

print(f"There are {People_eligible} cases where no cases are related to murder or sex in the Middlesex district of MA",
     "and no offenses had a disposition indicating guilty") 
print(f"This is {People_eligible/Number_Cases:.2%} of all cases. \n",
     f"This is {People_eligible/Number_Cases_Juvenile: .2%} of juvenile cases")   

date_range(x)

x['Disposition Description'].value_counts(dropna=False)[0:10]

There are 5246 cases where no cases are related to murder or sex in the Middlesex district of MA and no offenses had a disposition indicating guilty
This is 3.20% of all cases. 
 This is  90.20% of juvenile cases
3899 occured more than 3 years before 2020-09-01
75 occured more than 7 years before 2020-09-01


DISMISSED W/O  PREJUDICE                       4099
PRE-TRIAL PROBATION                            3365
CONTINUED W/O FINDING                          1942
DISMISSED PRIOR TO ARRAIGNMENT                  736
NOT RESPONSIBLE                                 364
DISMISSED BY COURT (PRIOR TO ARRAIGNMENT)       332
DISMISSED W/O PREJUDICE LACK OF PROSECUTION     300
NOLLE PROSEQUI                                  268
GENERAL CONTINUANCE                             175
DISMISSED ON COURT COSTS                        173
Name: Disposition Description, dtype: int64