# Clean Slate: Estimating offenses eligible for expungement under varying conditions
> Prepared by [Laura Feeney](https://github.com/laurafeeney) for Code for Boston's [Clean Slate project](https://github.com/codeforboston/clean-slate).

## Purpose
This notebook is an alternative to [MA_Data-2_mergecharges](https://github.com/codeforboston/clean-slate/blob/master/analyses/notebooks/MA_Data-2_MergeCharges.ipynb). That notebook creates duplicates in the charges because chapter & section do not uniquely identify expungability. Ideally we would create an updated "Master Crime List" and update the full data flow. This is a crutch in the meantime, since we have expungability for all offenses for the NW and suffolk data that we currently have. 

This notebook will also clean up the individual level data.

-----

In [1]:
import pandas as pd
import numpy as np
import regex as re
import glob, os

In [2]:
# individual-level data from NW district. This is as raw as possible.

nw_ind = pd.read_csv('../../data/raw/nw.csv', encoding='cp1252') 
nw_ind.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75725 entries, 0 to 75724
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Person ID       75725 non-null  object 
 1   Court           75725 non-null  object 
 2   Offense Date    74915 non-null  object 
 3   Age at Offense  74783 non-null  float64
 4   Filed           75725 non-null  object 
 5   Status          75725 non-null  object 
 6   Count           75725 non-null  int64  
 7   Charge          75725 non-null  object 
 8   Disposition     72259 non-null  object 
 9   Dispo Date      71881 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 5.8+ MB


#### Add column for CMR
Create a column indicating if the offense is against the Code of Massachusetts Regulations (CMR) rather than a criminal violation. These are things like traffic violations, watershed violations, headlights issues, aftermarket tinting on a car, not having a hunting license, car registration issues etc. Unclear whether these should be included in the analysis



In [3]:
nw_ind['CMRoffense'] = None
nw_ind.loc[nw_ind['Charge'].str.contains("CMR"), 'CMRoffense'] = 'yes'
nw_ind.CMRoffense.fillna("no", inplace=True)
nw_ind.loc[nw_ind['CMRoffense']=='yes']['Charge'].value_counts()

STATE HWAYÂ—TRAFFIC VIOLATION * 720 CMR Â§9.06                   179
MDC WATERSHEDÂ—NON-MV VIOLATION 350 CMR Â§11.09                   61
HEADLIGHTS, FAIL DIM * 540 CMR Â§22.05(2)                         47
REGISTRATION STICKER MISSING * 540 CMR Â§2.05(6)(a)               36
STATE HWAYÂ—SIGNAL/SIGN/MARKINGS VIOL * 720 CMR Â§9.06            25
MDC WATERSHEDÂ—MV VIOLATION  350 CMR Â§11.09                      19
STATE HWAYÂ—WRONG WAY * 720 CMR Â§9.05                            13
STATE HWAYÂ—TRAFFIC VIOLATION * 720 CMR Â§9.07                     9
MOTOR CARRIER SAFETY VIOLATION 540 CMR Â§14.03                     9
AFTERMARKET LIGHTING, NONCOMPLIANT * 540 CMR Â§22.07               8
MOTOR VEH INSPECTION STATION VIOLATION 540 CMR Â§4.00              7
MBOAT OPERATION VIOLATION 323 CMR Â§2.07                           6
FISH/WILDLIFEÂ—HUNT/FISH VIOL 321 CMR Â§3.00                       5
NUMBER PLATE, MISUSE DEALER/REPAIR 540 CMR Â§18.04(2)              4
HEADLIGHTS, ALTERNATING FLASHING *

### read in datasets with expungability info

prosecution_charges_detailed is a processed file with some expungement information. This file was based on the prosecution_northwestern and prosecution_suff originally created by [MA_Data-2_mergecharges](https://github.com/codeforboston/clean-slate/blob/master/analyses/notebooks/MA_Data-2_MergeCharges.ipynb). We did further processing in R in order to clean up the expungability column and remove duplicates.

It also has an extra_criteria column to show what, beyond chapter & section, is needed to determine expungability, and dummy vars for sex and murder offenses to help with later analysis.

In [4]:
PCD = pd.read_csv('../../data/processed/prosecution_charges_detailed.csv', encoding='cp1252') 
PCD.rename(columns={"Expungeable.":"Expungeable"}, inplace=True)
columns = ['Charge', 'Chapter', 'Section', 'Expungeable', 'sex', 'murder', 'extra_criteria']
PCD = PCD[columns]
PCD.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1650 entries, 0 to 1649
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Charge          1650 non-null   object
 1   Chapter         1555 non-null   object
 2   Section         1562 non-null   object
 3   Expungeable     1189 non-null   object
 4   sex             1650 non-null   int64 
 5   murder          1650 non-null   int64 
 6   extra_criteria  41 non-null     object
dtypes: int64(2), object(5)
memory usage: 90.4+ KB


In [5]:
# This sheet fills in the missing information from the PCD file. This was manually filled in based on examination of the charges
# vs the statute, and confirmed in conversation with Sana from cfjj in August 2020

addtl_exp = pd.read_csv('../../data/raw/missing_expungeability_08-02.csv', encoding='cp1252') 
addtl_exp.rename(columns={"Expungeable.":"Expungeable"}, inplace=True)
columns = ['Charge', 'Expungeable', 
           'Reason not expungeable', 'Analysis notes']
addtl_exp = addtl_exp[columns]
addtl_exp['Expungeable'].replace({'yes': 'Yes', 'no': 'No', 'na': 'NotApplicable'}, inplace=True)

## Merge NW data with expungability

This step will be the most messy. Hopefully some updates to the data pipeline will be made and this will be simplified.

1. merge individual level data (nw_ind) with the prosecution_charges_detailed (PCD) which has most data filled in 
2. merge the resulting data set with the manually filled information on expungability (addtl_exp)

At some point both 1 and 2 will be replaced by mering individual data with the Master Crime List, once that is the most up to date source of which offenses may be expunged.
Note, we cannot use the Suffolk county data to answer the questions because it does not have age. 

In [6]:
nw_merged = nw_ind.merge(PCD, on='Charge', how='left')
nw_merged = nw_merged.merge(addtl_exp, on='Charge', how='left')

# combine the expungability information from the two data sets

nw_merged['Expungeable'] = nw_merged['Expungeable_x']
nw_merged.Expungeable.fillna(nw_merged.Expungeable_y, inplace=True)
nw_merged = nw_merged.drop(columns = ['Expungeable_x', 'Expungeable_y'])
nw_merged['Expungeable'].value_counts(dropna=False)

Yes              55053
No               20007
NotApplicable      439
NaN                226
Name: Expungeable, dtype: int64

In [7]:
pd.crosstab(nw_merged['Expungeable'], nw_merged['CMRoffense'])

CMRoffense,no,yes
Expungeable,Unnamed: 1_level_1,Unnamed: 2_level_1
No,20007,0
NotApplicable,0,439
Yes,55052,1


In [8]:
nw_merged.loc[(nw_merged['Expungeable'] == "Yes") & (nw_merged['CMRoffense'] == "yes"), ['Expungeable']] = "NA"

# what is still missing?

nw_merged.loc[nw_merged.Expungeable.isnull()]['Charge'].value_counts(dropna = False)

ATTEMPT TO COMMIT CRIME c274 Â§6              218
BURGLARY, UNARMED & ASSAULT c266 Â§14           6
ATTEMPT TO COMMIT CRIME, HABITUAL c274 Â§6      2
Name: Charge, dtype: int64

In [9]:
nw_merged.loc[(nw_merged['Charge'] == "BURGLARY, UNARMED & ASSAULT c266 Â§14"),['Expungeable']] = "No" # not expungealbe

# Don't have enough info on what crime was attempted in order to determine expungability. These aren't a high %, and
# many may be purged out through the other reasons an offense may not be expungable. 

nw_merged.loc[(nw_merged['Charge'] == "ATTEMPT TO COMMIT CRIME c274 Â§6"),['Expungeable']] = "Attempt" # need more info 
nw_merged.loc[(nw_merged['Charge'] == "ATTEMPT TO COMMIT CRIME, HABITUAL c274 Â§6"),['Expungeable']] = "Attempt" # need more info
nw_merged['Expungeable'].value_counts(dropna=False)

Yes              55052
No               20013
NotApplicable      439
Attempt            220
NA                   1
Name: Expungeable, dtype: int64

In [10]:
nw_merged.to_csv('../../data/processed/merged_nw.csv', index=False)

## Merge suffolk data with expungability

Repeat the steps above. 
Note, we cannot use the Suffolk county data to answer the questions because it does not have age. 

In [11]:
# individual-level data from NW district. This is as raw as possible.

suff_ind = pd.read_csv('../../data/raw/suff.csv', encoding='cp1252') 
suff_ind.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303270 entries, 0 to 303269
Data columns (total 12 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   Person ID                       303270 non-null  object
 1   Court                           303270 non-null  object
 2   Offense Date                    300877 non-null  object
 3   Filed                           303270 non-null  object
 4   Status                          303270 non-null  object
 5   Count                           303270 non-null  int64 
 6   Charge                          303270 non-null  object
 7   Type Crime                      303270 non-null  object
 8   Code Ucc Ctgry                  262955 non-null  object
 9   Disposition                     251710 non-null  object
 10  Description Disposition Reason  212684 non-null  object
 11  Dispo Date                      251300 non-null  object
dtypes: int64(1), object(11)
memory

In [12]:
suff_ind['CMRoffense'] = None
suff_ind.loc[suff_ind['Charge'].str.contains("CMR"), 'CMRoffense'] = 'yes'
suff_ind.CMRoffense.fillna("no", inplace=True)
#suff_ind.loc[suff_ind['CMRoffense']=='yes']['Charge'].value_counts()

In [13]:
suff_merged = suff_ind.merge(PCD, on='Charge', how='left')
suff_merged = suff_merged.merge(addtl_exp, on='Charge', how='left')

# combine the expungability information from the two data sets

suff_merged['Expungeable'] = suff_merged['Expungeable_x']
suff_merged.Expungeable.fillna(suff_merged.Expungeable_y, inplace=True)
suff_merged = suff_merged.drop(columns = ['Expungeable_x', 'Expungeable_y'])
suff_merged['Expungeable'].value_counts(dropna=False)

Yes              215374
No                82003
NotApplicable      2446
m                  1909
NaN                1538
Name: Expungeable, dtype: int64

In [14]:
pd.crosstab(suff_merged['Expungeable'], suff_merged['CMRoffense'])

CMRoffense,no,yes
Expungeable,Unnamed: 1_level_1,Unnamed: 2_level_1
No,82003,0
NotApplicable,1,2445
Yes,215374,0
m,1909,0


In [15]:
# what is still missing?

suff_merged.loc[suff_merged.Expungeable.isnull()]['Charge'].value_counts(dropna = False)

ATTEMPT TO COMMIT CRIME c274 Â§6         1503
BURGLARY, UNARMED & ASSAULT c266 Â§14      35
Name: Charge, dtype: int64

In [16]:
suff_merged.loc[(suff_merged['Charge'] == "BURGLARY, UNARMED & ASSAULT c266 Â§14"),['Expungeable']] = "No" # not expungealbe
suff_merged.loc[(suff_merged['Charge'] == "ATTEMPT TO COMMIT CRIME c274 Â§6"),['Expungeable']] = "Attempt" # need more info 
suff_merged['Expungeable'].value_counts(dropna=False)

Yes              215374
No                82038
NotApplicable      2446
m                  1909
Attempt            1503
Name: Expungeable, dtype: int64

In [17]:
suff_merged.to_csv('../../data/processed/merged_suff.csv', index=False)