# Clean Slate: MA Data
> Prepared by [Dawn Graham](https://github.com/dawngraham) for Code for Boston's [Clean Slate project](https://github.com/codeforboston/clean-slate).

## Purpose
Citizens for Juvenile Justice received MA prosecution data thanks to the ACLU.

The purpose of this notebook is to get a list of all charges from both Northwestern and Suffolk data, then map expungement eligibility to each.

The resulting datasets will be output as a separate .csv that can be save and shared in the Clean Slate GitHub repo.

-----


## Northwestern DA Prosecution Data
### Import data

In [1]:
import pandas as pd
import numpy as np
import regex as re
import glob, os

In [2]:
nw = pd.read_csv('../data/raw/nw.csv')
nw.head()

Unnamed: 0,Person ID,Offense Date,Age at Offense,Filed,Status,Count,Charge,Disposition,Dispo Date
0,NW-0,2011-09-13,21.0,2014-10-28,Closed,1,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
1,NW-0,2011-09-13,21.0,2014-10-28,Closed,2,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
2,NW-0,2011-09-13,21.0,2014-10-28,Closed,3,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
3,NW-0,2011-09-13,21.0,2014-10-28,Closed,4,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
4,NW-0,2011-09-13,21.0,2014-10-28,Closed,5,ARSON OF DWELLING HOUSE c266 §1,Not Guilty,2016-03-30


In [3]:
suff = pd.read_csv('../data/raw/suff.csv')
suff.head()

Unnamed: 0,Person ID,Offense Date,Filed,Status,Count,Charge,Type Crime,Code Ucc Ctgry,Disposition,Description Disposition Reason,Dispo Date
0,SF-0,2015-11-04,2016-01-01,CL,1,"DRUG, DISTRIBUTE CLASS A, SUBSQ.OFF. c94C §32(b)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
1,SF-0,2015-11-04,2016-01-01,CL,2,"COCAINE, DISTRIBUTE, SUBSQ.OFF. c94C §32A(d)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
2,SF-0,2015-11-04,2016-01-01,CL,3,"DRUG, POSSESS TO DISTRIB CLASS A, SUBSQ. c94C ...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
3,SF-0,2015-11-04,2016-01-01,CL,4,"POSSESS TO DISTRIBUTE COCAINE, SUBSEQUENT. c94...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
4,SF-1,2014-10-23,2016-01-01,CL,1,A&B ON +60/DISABLED c265 §13K/F,AS,,Verdict - Jury Trial,Not Guilty,2016-08-02


----
## Get all charges
Show all unique charges for both Northwestern and Suffolk data.

In [4]:
# Get counts for unique charges in each dataset
nwcharges = nw['Charge'].value_counts().rename_axis('Charge').reset_index(name='NW Counts')
sfcharges = suff['Charge'].value_counts().rename_axis('Charge').reset_index(name='SF Counts')

In [5]:
# Combine counts
charges = nwcharges.merge(sfcharges, on='Charge', how='outer')

# Fill null values with 0 and convert to integer
charges[['NW Counts', 'SF Counts']] = charges[['NW Counts', 'SF Counts']].fillna(0).astype(int)

In [6]:
# Use regex to create new columns for Charge Description, Chapter, and Section
charges['Description'] = None
charges['Chapter'] = None
charges['Section'] = None

for i in range(len(charges)):
    try:
        charges.loc[i, 'Description'] = re.search('.+?(?=\sc)', charges.iloc[i]['Charge'])[0].upper()
    except:
        charges.loc[i, 'Description'] = charges.iloc[i]['Charge'].upper()
        
    try:
        charges.loc[i, 'Chapter'] = re.search('(?<=[cC])\d.*?(?=[\s§s/S$])|(?<=c\.\s)\d.*?(?=[\s§s/S$])|(?<!)\d.*?(?=[\s§s/S])|(?<=c)\d.*?(?=$)|(?<=\s)\d.*?(?=\sCMR)', charges.iloc[i]['Charge'])[0]
    except:
        charges.loc[i, 'Chapter'] = None
        
    try:
        charges.loc[i, 'Section'] = re.search('(?<=s\.\s)(\d.*)|(?<=§)(\d.*)|(?<=/)(\d.*)|(?<=s)(\d.*)|(?<=S\s)(\d.*)|(?<=S)(\d.*)', charges.iloc[i]['Charge'])[0]
    except:
        charges.loc[i, 'Section'] = None

In [7]:
# Order by Chapter and Section
charges = charges.sort_values(by=['Chapter', 'Section']).reset_index(drop=True)

In [8]:
# View records that still need to have Charge extracted
charges[charges['Chapter'].isnull() & charges['Section'].notnull()]

Unnamed: 0,Charge,NW Counts,SF Counts,Description,Chapter,Section
1555,UNLAWFUL POSSESSION FIREARM - HABITUAL 269/10(a),1,0,UNLAWFUL POSSESSION FIREARM - HABITUAL 269/10(A),,10(a)
1556,POSSESSION OR CONTROL OF INCENDIARY DEVICE OR ...,6,0,POSSESSION OR CONTROL OF INCENDIARY DEVICE OR ...,,102(c)
1557,PHOTOGRAPH SEXUAL OR INTIMATE PARTS W/OUT CONS...,3,0,PHOTOGRAPH SEXUAL OR INTIMATE PARTS W/OUT CONS...,,105(b)
1558,Assault and Battery (HABITUAL) 265/13A(a),1,0,ASSAULT AND BATTERY (HABITUAL) 265/13A(A),,13A(a)
1559,RECKLESS ENDANGERMENT OF CHILD 265§13L,117,0,RECKLESS ENDANGERMENT OF CHILD 265§13L,,13L
1560,OP MV W/ LICENSE REVOKED-HABITUAL TRAFFIC OFFE...,1,0,OP MV W/ LICENSE REVOKED-HABITUAL TRAFFIC OFFE...,,23
1561,"Ignition Interlock For Another, Bypass 90/24U(...",2,0,"IGNITION INTERLOCK FOR ANOTHER, BYPASS 90/24U(...",,24U(a)(1)
1562,SNOW/REC VEH - REFUSE STOP FOR POLICE 90B/26(c),2,0,SNOW/REC VEH - REFUSE STOP FOR POLICE 90B/26(C),,26(c)
1563,"SNOW/REC VEH - PUBLIC PROPERTY, ON 90B/26(e)",1,0,"SNOW/REC VEH - PUBLIC PROPERTY, ON 90B/26(E)",,26(e)
1564,TRAFFICKING COCAINE SECOND OR SUBSEQUENT OFFEN...,0,1,TRAFFICKING COCAINE SECOND OR SUBSEQUENT OFFEN...,,32(E)(a)


### Preview

In [9]:
charges.head(10)

Unnamed: 0,Charge,NW Counts,SF Counts,Description,Chapter,Section
0,"LOTTERY TICKET, UTTER OR PASS FALSE c10 §30",4,0,"LOTTERY TICKET, UTTER OR PASS FALSE",10,30
1,PEDDLING WITHOUT A LICENSE c101 §14,0,7,PEDDLING WITHOUT A LICENSE,101,14
2,PEDDLING VIOLATION c101 §14,0,3,PEDDLING VIOLATION,101,14
3,"BOAT, TRESPASS ON c102 §1A",0,2,"BOAT, TRESPASS ON",102,1A
4,AIR POLLUTION ORDER VIOL c111 §142A,1,0,AIR POLLUTION ORDER VIOL,111,142A
5,"INSPECTION CERTIFICATE, IMPROPER MV c111 §142M",0,2,"INSPECTION CERTIFICATE, IMPROPER MV",111,142M
6,TRASH TREATMENT FACILITY REGULATION VIOL c111 ...,4,0,TRASH TREATMENT FACILITY REGULATION VIOL,111,150A
7,"ALCOHOL DETOX PROG, UNLICENSED/DENY INSPECTION...",0,1,"ALCOHOL DETOX PROG, UNLICENSED/DENY INSPECTION...",111B,6
8,"PROFESSIONAL LIC SUSPENDED,PRACTICE WITH c112 §65",0,2,"PROFESSIONAL LIC SUSPENDED,PRACTICE WITH",112,65
9,"NURSING, UNAUTH PRACTICE OF PRACTICAL c112 §80A",6,0,"NURSING, UNAUTH PRACTICE OF PRACTICAL",112,80A


----

## Map expungement eligibility

### Import data from Master List
This is the data from the `Added FBI Cat. and Expunge` tab of the **Master Crime List offense with Expunge categories** spreadsheet provided by CfJJ.

In [10]:
expunge = pd.read_csv('../data/raw/ExpungeCategories.csv')

# Get only needed columns
expunge = expunge[['Untruncated Offense', 'Expungeable?']]

# Standardize values for 'no'
expunge['Expungeable?'] = expunge['Expungeable?'].str.strip().replace('NO', 'No')

In [11]:
# Use regex to create new columns for Charge Description, Chapter, and Section
expunge['Description'] = None
expunge['Chapter'] = None
expunge['Section'] = None

for i in range(len(expunge)):
    try:
        expunge.loc[i, 'Description'] = re.search('.+?(?=\sCh.\s)', expunge.iloc[i]['Untruncated Offense'])[0].upper()
    except:
        expunge.loc[i, 'Description'] = expunge.iloc[i]['Untruncated Offense'].upper()
        
    try:
        expunge.loc[i, 'Chapter'] = re.search('(?<=Ch.\s)\d.*?(?=\sS)', expunge.iloc[i]['Untruncated Offense'])[0]
    except:
        expunge.loc[i, 'Chapter'] = None
        
    try:
        expunge.loc[i, 'Section'] = re.search('(?<=\sS\s)(\d.*)', expunge.iloc[i]['Untruncated Offense'])[0]
    except:
        expunge.loc[i, 'Section'] = None

In [12]:
# Merge with charges
expungeable = charges.merge(expunge, on=['Chapter', 'Section'], how='left')

# Get only records with a value for Chapter or Section
expungeable = expungeable[expungeable['Chapter'].notnull() | expungeable['Section'].notnull()].drop_duplicates()

#Preview
expungeable.head()

Unnamed: 0,Charge,NW Counts,SF Counts,Description_x,Chapter,Section,Untruncated Offense,Expungeable?,Description_y
0,"LOTTERY TICKET, UTTER OR PASS FALSE c10 §30",4,0,"LOTTERY TICKET, UTTER OR PASS FALSE",10,30,,,
1,PEDDLING WITHOUT A LICENSE c101 §14,0,7,PEDDLING WITHOUT A LICENSE,101,14,,,
2,PEDDLING VIOLATION c101 §14,0,3,PEDDLING VIOLATION,101,14,,,
3,"BOAT, TRESPASS ON c102 §1A",0,2,"BOAT, TRESPASS ON",102,1A,,,
4,AIR POLLUTION ORDER VIOL c111 §142A,1,0,AIR POLLUTION ORDER VIOL,111,142A,,,


In [13]:
# Append Expungeable? to charges
charges = charges.merge(expungeable[['Charge', 'Expungeable?']].drop_duplicates(), on='Charge', how='left')

# Check counts
charges['Expungeable?'].value_counts(dropna=False)

NaN    915
Yes    492
No     284
Name: Expungeable?, dtype: int64

In [14]:
# Preview
charges.head()

Unnamed: 0,Charge,NW Counts,SF Counts,Description,Chapter,Section,Expungeable?
0,"LOTTERY TICKET, UTTER OR PASS FALSE c10 §30",4,0,"LOTTERY TICKET, UTTER OR PASS FALSE",10,30,
1,PEDDLING WITHOUT A LICENSE c101 §14,0,7,PEDDLING WITHOUT A LICENSE,101,14,
2,PEDDLING VIOLATION c101 §14,0,3,PEDDLING VIOLATION,101,14,
3,"BOAT, TRESPASS ON c102 §1A",0,2,"BOAT, TRESPASS ON",102,1A,
4,AIR POLLUTION ORDER VIOL c111 §142A,1,0,AIR POLLUTION ORDER VIOL,111,142A,


### Map expungement eligibility to Northwestern data

In [15]:
nw = nw.merge(charges[['Charge', 'Expungeable?']], on='Charge', how='left')

# Preview
nw.head()

Unnamed: 0,Person ID,Offense Date,Age at Offense,Filed,Status,Count,Charge,Disposition,Dispo Date,Expungeable?
0,NW-0,2011-09-13,21.0,2014-10-28,Closed,1,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30,No
1,NW-0,2011-09-13,21.0,2014-10-28,Closed,2,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30,No
2,NW-0,2011-09-13,21.0,2014-10-28,Closed,3,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30,No
3,NW-0,2011-09-13,21.0,2014-10-28,Closed,4,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30,No
4,NW-0,2011-09-13,21.0,2014-10-28,Closed,5,ARSON OF DWELLING HOUSE c266 §1,Not Guilty,2016-03-30,Yes


In [16]:
# Check counts
nw['Expungeable?'].value_counts(dropna=False)

Yes    41702
NaN    20770
No     15793
Name: Expungeable?, dtype: int64

### Map expungement eligibility to Suffolk data

In [17]:
suff = suff.merge(charges[['Charge', 'Expungeable?']], on='Charge', how='left')

# Preview
suff.head()

Unnamed: 0,Person ID,Offense Date,Filed,Status,Count,Charge,Type Crime,Code Ucc Ctgry,Disposition,Description Disposition Reason,Dispo Date,Expungeable?
0,SF-0,2015-11-04,2016-01-01,CL,1,"DRUG, DISTRIBUTE CLASS A, SUBSQ.OFF. c94C §32(b)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes
1,SF-0,2015-11-04,2016-01-01,CL,2,"COCAINE, DISTRIBUTE, SUBSQ.OFF. c94C §32A(d)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes
2,SF-0,2015-11-04,2016-01-01,CL,3,"DRUG, POSSESS TO DISTRIB CLASS A, SUBSQ. c94C ...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes
3,SF-0,2015-11-04,2016-01-01,CL,4,"POSSESS TO DISTRIBUTE COCAINE, SUBSEQUENT. c94...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes
4,SF-1,2014-10-23,2016-01-01,CL,1,A&B ON +60/DISABLED c265 §13K/F,AS,,Verdict - Jury Trial,Not Guilty,2016-08-02,


In [18]:
# Check counts
suff['Expungeable?'].value_counts(dropna=False)

Yes    176762
No      70339
NaN     69047
Name: Expungeable?, dtype: int64

-----
## Export new data files
These will be saved in the `clean-slate/data/processed/` folder.

In [19]:
nw.to_csv('../data/processed/prosecution_northwestern.csv', index=False)
suff.to_csv('../data/processed/prosecution_suffolk.csv', index=False)
charges.to_csv('../data/processed/prosecution_charges.csv', index=False)

-----
## Considerations & Possible Next Steps
- This did not include a closer look, cleaning, or exploratory analysis of the data.
- Date issues in the Northwestern dataset should be addressed for questions related to age. We can't do a count of cases where the age < 21 because it will include issues from invalid values of Date of Birth, Offense Date, or both. My recommendation is to exclude the rows where 'Age at Offense' is < 1 before getting counts where age < 21.
- Suffolk data does not include any indicators of age.
- See the `Get all charges` section above. Some charges still need to have the value for Chapter extracted.
- There are still missing values for `Expungeable?` for both NW & Suffolk datasets that will require a closer look and comparison with the Master List data.