# Clean Slate: Estimating offenses eligible for expungement under varying conditions
> Prepared by [Laura Feeney](https://github.com/laurafeeney) for Code for Boston's [Clean Slate project](https://github.com/codeforboston/clean-slate).

## Purpose
This notebook cleans and processes data from the Northwestern, Suffolk, and Middlesex districts to facilitate the creation of graphs, tables, and visualizations in Excel and other platforms. 

This draws on the cleaning steps taken in [How many expungable Middlesex](https://github.com/codeforboston/clean-slate/blob/master/analyses/notebooks/How%20many%20expungable-Middlesex.ipynb), [How many expungeable (which is about Northwestern](https://github.com/codeforboston/clean-slate/blob/master/analyses/notebooks/How%20many%20expungable.ipynb), and [How many expungeable Suffolk](https://github.com/codeforboston/clean-slate/blob/master/analyses/notebooks/How%20many%20expungeable-Suffolk.ipynb). It makes minimal changes and attempts to collapse into loops when possible to be sure the same steps are taken on all data sets. However, due to some differences between data sets (for example, Suffolk does not have an individual identifier), this is not always possible.  

### Step 0
Import data, programs, etc.

-----

In [1]:
import pandas as pd
pd.set_option('display.max.columns', None)
pd.set_option('display.max_colwidth', None)
import numpy as np
import regex as re
import glob, os
import datetime 
from datetime import date 

In [2]:
def names():
    nw.name = 'Northwestern'
    ms.name = 'Middlesex'
    sf.name = 'Suffolk'

In [3]:
# Import dataframes from lightly-processed individual-level data from Northwestern ("NW"),
# Suffolk ("Suff"), and Middlesex ("MS") districts. These have already been merged with expungability indicators.

nw = pd.read_csv('../../data/processed/merged_nw.csv', encoding='utf8',
                    dtype={'Analysis notes':str, 'extra_criteria':str, 'Expungeable': str}) 

ms = pd.read_csv('../../data/processed/merged_ms.csv', encoding='utf8',
                    dtype={'Analysis notes':str, 'extra_criteria':str, 'Expungeable': str}, low_memory=False) 

sf = pd.read_csv('../../data/processed/merged_suff.csv', encoding='utf8')


In [4]:
names()
for x in [nw, ms, sf] :
    print(x.name, '\n===================================================\n')
    print(x.info(), '\n')
    

Northwestern 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75725 entries, 0 to 75724
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Person ID               75725 non-null  object 
 1   Court                   75725 non-null  object 
 2   Offense Date            74915 non-null  object 
 3   Age at Offense          74783 non-null  float64
 4   Filed                   75725 non-null  object 
 5   Status                  75725 non-null  object 
 6   Count                   75725 non-null  int64  
 7   Charge                  75725 non-null  object 
 8   Disposition             72259 non-null  object 
 9   Dispo Date              71881 non-null  object 
 10  CMRoffense              75725 non-null  object 
 11  Chapter                 72457 non-null  object 
 12  Section                 72588 non-null  object 
 13  sex                     75725 non-null  int64  
 14  murder                 

In [5]:
# drop columns that are mostly null

ms = ms.rename(columns={"Disposition Description": "Disposition"})
names()

col_to_drop = ['extra_criteria', 'Reason not expungeable', 'Analysis notes']
for x in [nw, ms, sf] :
    print(x.name, '\n===================================================\n')
    x.drop(columns = col_to_drop,  errors = "ignore", inplace=True)
    print(x.info(), '\n')
    


Northwestern 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75725 entries, 0 to 75724
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Person ID       75725 non-null  object 
 1   Court           75725 non-null  object 
 2   Offense Date    74915 non-null  object 
 3   Age at Offense  74783 non-null  float64
 4   Filed           75725 non-null  object 
 5   Status          75725 non-null  object 
 6   Count           75725 non-null  int64  
 7   Charge          75725 non-null  object 
 8   Disposition     72259 non-null  object 
 9   Dispo Date      71881 non-null  object 
 10  CMRoffense      75725 non-null  object 
 11  Chapter         72457 non-null  object 
 12  Section         72588 non-null  object 
 13  sex             75725 non-null  int64  
 14  murder          75725 non-null  int64  
 15  Expungeable     75724 non-null  object 
dtypes: float64(1), int64(3), object(12)
memory usage: 9.2+ MB
Non

## Step 0.5 - Prepare data

- Prepare dates, date since offense
- Drop CMR offenses
- Generate indicators for incidents, and code incidents as expungeable, sex-related etc
- Generate indicator for found guilty / not found guilty 

### Dates and ages. Drop negative and null ages, look at date ranges

**Northwestern**
The file name suggests this should be from 2014-2018: 

Prosecution Northwestern DA 2014-2018 RAW DATA - 2014 to early 2018.csv, and 
Prosecution Northwestern DA 2014-2018 RAW DATA - to end of 2018.csv

**Middlesex**
Downloaded from https://www.middlesexda.com/public-information/pages/prosecution-data-and-statistics, with description, "The following is data from our Damion Case Management System pertaining to prosecution statistics for the time period from January 1, 2014, through January 1, 2020."

**Suffolk**
Don't have access to the original file name, though from the file dates the majority of charges occurred between 2012 - 2020, though the oldest date back to 1935. It is possible that some of the oldest outlier dates are errors in the original data.

In [6]:
names()
reference_date = datetime.date(2020, 9, 1) # using "today.date() wouldn't be stable"

#convert to datetime
for x in [nw, ms, sf] :
    x['Offense Date'] = pd.to_datetime(x['Offense Date']).dt.date
    x['years_since_offense'] = round((reference_date - x['Offense Date'])/pd.Timedelta(1, 'D')/365,1)

# Only NW has an age indicator. For this district, dropping any rows with age <0 or null age. 
number_invalid = len(nw['Age at Offense'][(nw['Age at Offense']<=0) | (nw['Age at Offense'].isnull()) ])
print("Dropping", number_invalid, "rows in NW where age is <0 or null.")
nw = nw[(nw['Age at Offense']>0) & ~(nw['Age at Offense'].isnull()) ]


#Middlesex, Suffolk Dropping rows where date is null
for x in [ms, sf] :
    print("Dropping", len(x[x['Offense Date'].isnull()]), "rows in", x.name, "where date is null")   

    if x.name == "Middlesex":
        ms = x[~x['Offense Date'].isnull()].copy()    
    else:
        sf = x[~x['Offense Date'].isnull()].copy()  



Dropping 1128 rows in NW where age is <0 or null.
Dropping 27 rows in Middlesex where date is null
Dropping 2393 rows in Suffolk where date is null


**CMR** : There are many offenses that are violations of the Code of Massachusetts Regulations (CMR) rather than a criminal offense. These include things like some driving or boating infractions (e.g., not having headlights on), or not having a hunting/fishing license. Per conversations with Sana, dropping all CMR offenses.

In [7]:
# CMR: Drop all CMR offenses and Drop CMR-related columns
names()

for x in [nw, ms, sf] :
    print(x.name, "\n===================")
    print(f'There are {x.shape[0]} total offenses including CMR.')
    print(x['CMRoffense'].value_counts(dropna=False))
    if x.name == "Middlesex":
        ms = x.loc[(x['CMRoffense'] == False)]
        ms = ms.drop(columns = ['CMRoffense'])
    elif x.name == "Suffolk":
        sf = x.loc[(x['CMRoffense'] == 'no')]
        sf = sf.drop(columns = ['CMRoffense'])
    else: 
        nw = x.loc[(x['CMRoffense'] == 'no')]
        nw = nw.drop(columns = ['CMRoffense'])

    print(f'After we drop CMR, there are {x.shape[0]} total offenses.\n')


Northwestern 
There are 74597 total offenses including CMR.
no     74164
yes      433
Name: CMRoffense, dtype: int64
After we drop CMR, there are 74597 total offenses.

Middlesex 
There are 392578 total offenses including CMR.
False    387531
True       5047
Name: CMRoffense, dtype: int64
After we drop CMR, there are 392578 total offenses.

Suffolk 
There are 300877 total offenses including CMR.
no     298438
yes      2439
Name: CMRoffense, dtype: int64
After we drop CMR, there are 300877 total offenses.



### Incidents vs offenses
An offense is a particular instance of a violation of law. An incident, also known as a case, is a set of offenses allegedly committed by an individual at a particular date / time. For example: Stealing a packet of gum would be an offense. If you stole a packet of gum and also assaulted the sales clerk in the process, you would have at least 2 offenses (theft and assault) within a single incident. If you assault multiple people, you may have multiple offenses of 'assault' within the same incident. It is technically possible to have two incidents on the same day (imagine, for example, robbing a store, taking a break, and then mugging someone unrelated).  

In Northwestern, we assume any offense committed on the same date by the same Person pertains to one incident. 
In Middlesex, we were given a Case Number which groups offenses within the same case (same as incident for our purposes). 

For some aspects of MA expungement law, eligibility for expungement depends on the number of offenses or the number of unique incidents. Within an incident, some offenses may be more or less "serious" than others; however, we are currently assuming that if any offense within the incident is not eligible for expungment, all offenses within that incident also become ineligible. 

In [8]:
#Incidents - Northwestern

#Number of incidents (separate offense dates per same ID)
nw['Incidents'] = (nw.groupby('Person ID')['Offense Date'].transform('nunique'))

In [9]:
names()

for x in [nw, sf] :
    print(x.name, "\n===================")
    
    # Add column: Total Offenses
    
    x['Total Offenses per Person'] = x.groupby('Person ID')['Person ID'].transform('count')
    print(x['Total Offenses per Person'].describe(), "\n")

    # Add column: Incidents
    x['Incidents per Person'] = x.groupby(['Person ID'])['Offense Date'].transform('nunique')
    print(x['Incidents per Person'].describe(), "\n")
    
print(ms.name, "\n===================")
ms['Offenses_per_case']=ms.groupby('Case Number')['Case Number'].transform('count')
print(ms['Offenses_per_case'].describe(), "\n")

Northwestern 
count    74164.000000
mean         9.309638
std         14.502255
min          1.000000
25%          3.000000
50%          5.000000
75%         11.000000
max        202.000000
Name: Total Offenses per Person, dtype: float64 

count    74164.000000
mean         2.864759
std          2.504922
min          1.000000
25%          1.000000
50%          2.000000
75%          4.000000
max         22.000000
Name: Incidents per Person, dtype: float64 

Suffolk 
count    298438.000000
mean          9.153861
std          17.742343
min           1.000000
25%           2.000000
50%           5.000000
75%          11.000000
max         352.000000
Name: Total Offenses per Person, dtype: float64 

count    298438.000000
mean          3.272452
std           3.868787
min           1.000000
25%           1.000000
50%           2.000000
75%           4.000000
max          45.000000
Name: Incidents per Person, dtype: float64 

Middlesex 
count    387531.000000
mean          4.895203
std       

### Attempts
Some offenses in the files are listed as "attempts" to commit an offense. Eligibilty for expungement depends upon whether the attempted offense itself would be eligible, but often we do not have  sufficient information to identify the attempted offense. 

To expedite analysis, we have assumed 'attempts' are elibile. They are highly infrequent, so reversing this decision would not have a major influence on the rsults. 

In [10]:
# If an incident includes one offense that is not expungeable, we mark the entire incident as not expungeable.

names()

for x in [nw, ms, sf] :
    print(x.name, "\n===================")
    print(x.Expungeable.value_counts(dropna = False), "\n")

Northwestern 
Yes        54147
No         19798
Attempt      219
Name: Expungeable, dtype: int64 

Middlesex 
Yes        273067
No         112554
Attempt      1519
NaN           383
m               8
Name: Expungeable, dtype: int64 

Suffolk 
Yes              214012
No                81049
m                  1902
Attempt            1474
NotApplicable         1
Name: Expungeable, dtype: int64 



In [11]:
for x in [nw, ms, sf] :
    print(x.name, "\n===================")
    
    attempts = len(x[x['Expungeable'] == 'Attempt'])
    total = len(x)
    print('There are', attempts, 'charges marked as attempts in', x.name, 'comprising', '{:.2%}'.format(attempts/total), 'of the data.')
    
    # New column = 1 if expungeable or attempt
    x['ExpAtt'] = (x['Expungeable']=="Yes") | (x['Expungeable']=="Attempt")
    
    # Column for sex or murder related offenses
    x['sm'] = (x['sex'] == 1) | (x['murder'] ==1)

Northwestern 
There are 219 charges marked as attempts in Northwestern comprising 0.30% of the data.
Middlesex 
There are 1519 charges marked as attempts in Middlesex comprising 0.39% of the data.
Suffolk 
There are 1474 charges marked as attempts in Suffolk comprising 0.49% of the data.


In [12]:
# code entire incident as expungeable or not, based on whether it contains a non-expungeable offense

#Northwestern
nw['Inc_Expungeable_Attempts_Are'] = nw.groupby(['Person ID', 'Offense Date'])['ExpAtt'].transform('min')

#Middlesex
ms['Inc_Expungeable_Attempts_Are'] = ms.groupby(['Case Number'])['ExpAtt'].transform('min')

#Suffolk
#No individual identifier in Suffolk

#drop unneeded calculation columns
nw = nw.drop(columns=['ExpAtt', 'sm'])
ms = ms.drop(columns=['ExpAtt', 'sm'])

### Dispositions and Guilty
Referencing this sheet to determine which to code as not found guilty vs found guilty.
https://docs.google.com/spreadsheets/d/1axzGGxgQFPwpTw7EbBlC519L43fOkqC5/edit#gid=487812267

There is a lot of variation in the dispositions used and whether they actually indicate a finding. For example some dispositions in Middlesex indidicate that a case was transferred to another district or court -- this has no relationship to guilty/not guilty. 

We assign a binary variable indicating whether an offense was "found guilty" based on being in a list of dispositions indicating guilty or responsible, below. Any offense with a disposition not in this list -- even if it is a transfer or other obvious phrase that indicates a case is still in progress -- is considered "not found guilty". 

In [13]:
names()

for x in [nw, ms, sf] :
    print(x.name, "\n===================")
    print(sorted(x['Disposition'].loc[x['Disposition'].notnull()].unique()), "\n")
    no_dispo =  x['Disposition'].isna().sum()
    total = len(x)
    print('There are', '{:,}'.format(no_dispo), 'charges without a disposition in', x.name, 'comprising', '{:.2%}'.format(no_dispo/total), 'of the data. \n')
    


Northwestern 
['Accord/Satisfaction', 'Agreed Plea', 'CLOSED-INDICTED', 'CLOSED-NO CHARGES', 'Case Transferred', 'Charge Handled as a Civil Charge', 'Continued w/o Finding', 'Continued/Valor Act', "DA's Complaint", 'DYS Committed', 'Delinquent', 'Delinquent Filed', 'Directed Verdict', 'Dismissed', 'Dismissed - Lack of Prosecution', 'Dismissed Prior to Arraignment', 'Dismissed at Request of Comm', 'Dismissed by Court', 'Dismissed on Payment', 'Dismissed prior to complaint', 'District Court Dispo', 'Found Incompetent', 'Guilty', 'Guilty Filed', 'Guilty on Lesser Included Offense', 'NGI', 'No Time to Reach', 'Nolle Prosequi', 'Not Guilty', 'Not Guilty by Reason of Mental Illness', 'Not Responsible', 'Required Finding of Not Guilty', 'Responsible', 'Responsible Filed', 'Unagreed Plea', 'Valor Act Dispo', 'Youthful Offender', 'c276s87 finding'] 

There are 3,285 charges without a disposition in Northwestern comprising 4.43% of the data. 

Middlesex 
['BOUND OVER/PROBABLE CAUSE FOUND', 'CONT

In [14]:
# Assign a binary guilty / not guilty disposition. 

guilty_dispos_nw = ['Agreed Plea', 'c276s87 finding', 
                 'Delinquent', 'Delinquent Filed', 'Guilty', 'Guilty Filed', 
                 'Guilty on Lesser Included Offense',
                'Responsible', 'Responsible Filed', 
                 'Unagreed Plea', 'Youthful Offender']
nw['guilty'] = nw['Disposition'].isin(guilty_dispos_nw)


guilty_dispos_ms = ['DELINQUENT BENCH TRIAL', 'DELINQUENT CHANGE OF PLEA', 
                'DELINQUENT CHANGE OF PLEA LESSER OFFENSE', 'DELINQUENT JURY TRIAL',
                'GUILTY BENCH TRIAL', 'GUILTY BENCH TRIAL LESSER INCLUDED',
                'GUILTY CHANGE OF PLEA', 'GUILTY CHANGE OF PLEA LESSER OFFENSE', 
                'GUILTY FILED', 'GUILTY FINES', 'GUILTY JURY TRIAL', 
                'GUILTY JURY TRIAL LESSER INCLUDED', 
                'Guilty Jury Trial (and Bench) Lesser Included', 'RESPONSIBLE']

ms['guilty'] = ms['Disposition'].isin(guilty_dispos_ms)

#Suffolk is more complicated: Two columns for dispositions

guilty_disposition_reasons = ['Guilty - Committed', 'Guilty - Probation', 'Dismissed for Agreed Plea', 
                    'Guilty - Fine', 'Guilty - Suspended Sentence', 'Guilty', 'Guilty - Filed', 
                    'Guilty - Split Sentence', 'Responsible - Fine C277S70', 'Guilty - Lesser Offense', 
                    'Delinquent - Committed', 'Delinquent - Probabtion', 'Delinquent - Filed', 
                    'Delinquent - Fine', 'Delinquent - Suspended', 'Delinquent']
guilty_dispositions = ['Plea']

sf['guilty'] = sf['Description Disposition Reason'].isin(guilty_disposition_reasons)
# When disposition reason is missing and disposition is not, use disposition to determine guilty or nonguilty
sf.loc[(sf['Description Disposition Reason'].isnull() & (sf['Disposition'].notnull())), 'guilty'] = 0
sf.loc[(sf['Description Disposition Reason'].isnull() & (sf['Disposition'].notnull()) & (sf['Disposition'].isin(guilty_dispositions))), 'guilty'] = 1


#### missing disposition
Some offenses have no associated disposition data. By coding the 'guilty' indicator (which is otherwise binary / boolean) as -1 for these missing cases, we can group them with non-guilty offenses in the analysis by grouping by case/person and offense date and transforming to the 'max' of the _guilty_ indicator. To group them with guilty instead, switch the -1 below to a 2.

In [15]:
sf.loc[(sf['Description Disposition Reason'].isnull() & (sf['Disposition'].isnull())), 'Guilty'] = -1


# Guilty is 2 if there is no disposition reason at all. By grouping by person/case and offense date and transforming
# to the max, we include any offenses with a missing disposition as 'guilty'. 
# This is a shorthand way of excluding offenses with a missing disposition from the list of those "not found guilty"

for x in [nw, ms] :
    x.loc[x.Disposition.isnull(), 'guilty'] = -1 

nw['Incident_Guilty_or_missing'] = nw.groupby(['Person ID', 'Offense Date'])['guilty'].transform('max')
ms['Incident_Guilty_or_missing'] = ms.groupby(['Case Number', 'Offense Date'])['guilty'].transform('max')

# Output files

In [17]:
names()
for x in [nw, ms, sf] :
    print(x.name, '\n===================================================\n')
    print(x.info(), '\n')
    

Northwestern 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74164 entries, 0 to 75724
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Person ID                     74164 non-null  object 
 1   Court                         74164 non-null  object 
 2   Offense Date                  74164 non-null  object 
 3   Age at Offense                74164 non-null  float64
 4   Filed                         74164 non-null  object 
 5   Status                        74164 non-null  object 
 6   Count                         74164 non-null  int64  
 7   Charge                        74164 non-null  object 
 8   Disposition                   70879 non-null  object 
 9   Dispo Date                    70513 non-null  object 
 10  Chapter                       70945 non-null  object 
 11  Section                       71074 non-null  object 
 12  sex                           74164 non-null 

In [18]:
nw.to_csv('../../data/cleaned/clean_northwestern.csv', index=False)
sf.to_csv('../../data/cleaned/clean_suffolk.csv', index=False)
ms.to_csv('../../data/cleaned/clean_middlesex.csv', index=False)