# Clean Slate: MA Data
> Prepared by [Dawn Graham](https://github.com/dawngraham) for Code for Boston's [Clean Slate project](https://github.com/codeforboston/clean-slate).

## Purpose
Citizens for Juvenile Justice received MA prosecution data thanks to the ACLU.

The purpose of this notebook is to anonymize the data and provide only fields required to answer the following questions (updated 6/28/20):

1. How many people (under age 21) are eligible for expungement today? This would be people with only **one charge** that is not part of the list of ineligible offenses (per section 100J). 


2. How many people (under age 21) would be eligible based on only having **one incident** (which could include multiple charges) that are not part of the list of ineligible offenses?
 - How many people (under age 21) would be eligible based on only having **one incident** if only sex-based offenses or murder were excluded from expungement?
 

3. How many people (under age 21) would be eligible based on who has **not been found guilty** (given current offenses that are eligible for expungement)?
 - How many people (under age 21) would be eligible based on who has **not been found guilty** for all offenses except for murder or sex-based offenses?


Prior questions:
- How many people (under age 21) are ineligible because they have a charge that is on the list of ineligible offenses (per section 100J)?
- How many people (under age 21) have only one offense on their record?
- How many people have only expungable offenses on their record but have more than 1 of them?

The resulting datasets will be output as a separate .csv that can be save and shared in the Clean Slate GitHub repo.

-----


## Northwestern DA Prosecution Data
### Import data

In [1]:
import pandas as pd
import numpy as np
import regex as re
import glob, os

In [2]:
nw2014 = pd.read_csv('../data/Prosecution Northwestern DA 2014-2018 RAW DATA - 2014 to early 2018.csv')
nw2018 = pd.read_csv('../data/Prosecution Northwestern DA 2014-2018 RAW DATA - to end of 2018.csv')

# Combine the two datasets
nw = pd.concat([nw2014, nw2018])

  interactivity=interactivity, compiler=compiler, result=result)


### Quick look at data

In [3]:
# Get names, # of non-null values, and the data types for each column
nw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75725 entries, 0 to 10429
Data columns (total 40 columns):
Filed:                                    75725 non-null object
Age at Case Filing                        75725 non-null int64
Status:                                   75725 non-null object
Court:                                    75725 non-null object
Date of Birth:                            75595 non-null object
Gender:                                   74003 non-null object
Race:                                     68741 non-null object
Ethnicity:                                21321 non-null object
Count:                                    75725 non-null int64
Charge:                                   75725 non-null object
Disposition:                              72259 non-null object
Dispo Date:                               71900 non-null object
Offense Date:                             74944 non-null object
Bail Requested                            60 non-null object


In [4]:
# Get the # of null values in each column
nw.isnull().sum()

Filed:                                         0
Age at Case Filing                             0
Status:                                        0
Court:                                         0
Date of Birth:                               130
Gender:                                     1722
Race:                                       6984
Ethnicity:                                 54404
Count:                                         0
Charge:                                        0
Disposition:                                3466
Dispo Date:                                 3825
Offense Date:                                781
Bail Requested                             75665
Bail Set                                   74746
Imposed Facility:                          68952
Sentence Type:                             69467
Sentence Date:                             70059
Term Min:                                  73191
Term Min Unit:                             73191
Term Max:           

### Keep only desired columns
Because many of the columns consist primarily of null values and also do not provide info relevant to our questions, they will not be useful for any analysis. We will only keep the columns with relevant info.

In [5]:
# Save dataframe with only columns we want to keep
columns = ['Date of Birth:', 'Offense Date:', 'Filed:', 'Age at Case Filing', 'Status:',
       'Gender:', 'Race:', 'Ethnicity:', 'Count:', 'Charge:', 'Disposition:',
       'Dispo Date:', 'Sentence Type:', 'Sentence Date:']

nw = nw[columns]

# Remove ":" from end of column names
nw.columns = nw.columns.str.rstrip(':')

In [6]:
# Convert dates to datetime values
nw['Filed'] = pd.to_datetime(nw['Filed'], errors='coerce').dt.date
nw['Date of Birth'] = pd.to_datetime(nw['Date of Birth'], errors='coerce').dt.date
nw['Dispo Date'] = pd.to_datetime(nw['Dispo Date'], errors='coerce').dt.date
nw['Offense Date'] = pd.to_datetime(nw['Offense Date'], errors='coerce').dt.date
nw['Sentence Date'] = pd.to_datetime(nw['Sentence Date'], errors='coerce').dt.date

### Get Age at Offense Date
Because 'Date of Birth:' and 'Offense Date:' have some NULL records, this will not provide an age for all records.

In [7]:
# Subtract "Date of Birth" from "Offense Date" to get "Age at Offense"
nw['Age at Offense'] = (nw['Offense Date'] - nw['Date of Birth'])/pd.Timedelta(1, 'Y')
nw['Age at Offense'] = np.floor(nw['Age at Offense']).astype('Int64')

#### Date Issues
186 records have invalid values for Date of Birth, Offense Date, or both.

In [8]:
dateissues = nw.loc[nw['Age at Offense'] < 1,['Date of Birth', 'Offense Date', 'Age at Offense',
                                              'Age at Case Filing', 'Filed']]

In [9]:
dateissues.sort_values(by='Age at Offense').head()

Unnamed: 0,Date of Birth,Offense Date,Age at Offense,Age at Case Filing,Filed
61884,1992-03-08,1750-12-03,-242,26,2018-01-19
61883,1992-03-08,1750-12-03,-242,26,2018-01-19
845,1979-12-08,1882-11-30,-98,34,2014-01-23
846,1979-12-08,1882-11-30,-98,34,2014-01-23
847,1979-12-08,1882-11-30,-98,34,2014-01-23


### Create unique ID
We will create a unique "Person ID" for each distinct combination of Date of Birth, Gender, Race, and Ethnicity. While it is possible that one person could have different distinct values (for example, if a different value was marked for Race at different times for some reason) or that more than one person could have the same distinct values, this will give us a close approximation that will let us link multiple records belonging to one person.

In [10]:
# Create a unique ID for each distinct combination
identifiers = ['Date of Birth', 'Gender', 'Race', 'Ethnicity']

unique = nw[identifiers].drop_duplicates().reset_index(drop=True)
unique['Person ID'] = unique.index

# Add a prefix to help distinguish between datasets
unique['Person ID'] = 'NW-' + unique['Person ID'].astype(str)

# Add unique ID to dataset
nw = nw.merge(unique, on=identifiers)

In [11]:
# Save dataset without identifiers
nw = nw[['Person ID', 'Offense Date', 'Age at Offense', 'Filed', 'Status',
         'Count', 'Charge', 'Disposition', 'Dispo Date']]

### Preview

In [12]:
nw.head()

Unnamed: 0,Person ID,Offense Date,Age at Offense,Filed,Status,Count,Charge,Disposition,Dispo Date
0,NW-0,2011-09-13,21,2014-10-28,Closed,1,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
1,NW-0,2011-09-13,21,2014-10-28,Closed,2,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
2,NW-0,2011-09-13,21,2014-10-28,Closed,3,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
3,NW-0,2011-09-13,21,2014-10-28,Closed,4,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
4,NW-0,2011-09-13,21,2014-10-28,Closed,5,ARSON OF DWELLING HOUSE c266 §1,Not Guilty,2016-03-30


-------
## Suffolk County Prosecution Data 2013 - 2018

In [13]:
# Concatenate all Suffolk County files
suff = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', '../data/Suffolk County Prosecution*.csv'))), sort=False)

  exec(code_obj, self.user_global_ns, self.user_ns)


### Quick look at the data

In [14]:
# Get names, # of non-null values, and the data types for each column
suff.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 303270 entries, 0 to 53717
Data columns (total 42 columns):
Case #                            303270 non-null object
Defendant #                       303270 non-null int64
Count #                           303270 non-null int64
Description Charge                303270 non-null object
Type Crime                        303270 non-null object
Code Ucc Ctgry                    262955 non-null object
Date Crime                        301015 non-null object
Date Filed                        303270 non-null object
Date Dspstn                       251489 non-null object
Description Dspstn                251710 non-null object
Code Dspstn                       251710 non-null object
Description Disposition Reason    181924 non-null object
Type Chrg Orgn                    205 non-null object
SC Docket                         0 non-null float64
DC Docket                         0 non-null float64
Race/Eth                          259048 non-nul

In [15]:
# Get the # of null values in each column
suff.isnull().sum()

Case #                                 0
Defendant #                            0
Count #                                0
Description Charge                     0
Type Crime                             0
Code Ucc Ctgry                     40315
Date Crime                          2255
Date Filed                             0
Date Dspstn                        51781
Description Dspstn                 51560
Code Dspstn                        51560
Description Disposition Reason    121346
Type Chrg Orgn                    303065
SC Docket                         303270
DC Docket                         303270
Race/Eth                           44222
Gender                              3051
Code Ofcr Agncy                    27656
Court                                  0
Judge                              72790
Nmbr Day Jail Rcmnd                    0
Nmbr Day Jl Rcmnd Unt              17525
Nmbr Day Jl Rcmnd Min                  0
Nmbr Day Jl Rcmnd Min Unt          17510
Nmbr Day Jl Imps

### Combine 'Description Disposition Reason' and 'Description Reason'
These contain the same information - likely due to one file having a different field name.

In [16]:
suff['Description Disposition Reason'].unique()

array(['Guilty - Lesser Offense', 'Not Guilty', nan, 'Guilty - Committed',
       'Guilty - Probation', 'DWOP - no evidence', 'DWOP - no victim',
       'Dismissed by Commonwealth', 'Dismissed to Subsequent Indictment',
       'Guilty - Filed', 'Dismissed for Agreed Plea',
       'Dismissed by Court', 'Guilty', 'DWOP - no witness',
       'Guilty - Split Sentence', 'Dismissed for Community Service',
       'CWF (ASF)', 'Dismissed WO Prejudice', 'Guilty - Fine',
       'Dismissed WO Prosecution', 'Guilty - Suspended Sentence',
       'defense motion for dismissal allowed',
       'Dismissed Upon Payment of Court Costs',
       'Dismissed Prior to Arraignment', 'Delinquent - Committed',
       'Duplicate Charge', 'DWOP - no police', 'No Probable Cause',
       'Directed Verdict Not Guilty', 'Dismissed Defendant Deceased',
       'Pretrial Probation', 'Remanded To Clerks Hearing',
       'motion to suppress evidence allowed',
       'Responsible - Fine C277S70', 'Dismissed Transfer to BJC

In [17]:
suff['Description Reason'].unique()

array([nan, 'Dismissed Prior to Arraignment', 'DWOP - no victim',
       'Dismissed Upon Payment of Court Costs', 'CWF (ASF)',
       'Dismissed for Agreed Plea', 'Dismissed WO Prosecution',
       'Guilty - Probation', 'Dismissed by Commonwealth',
       'Guilty - Filed', 'Dismissed by Court',
       'Dismissed to Subsequent Indictment',
       'defense motion for dismissal allowed',
       'Dismissed for Community Service', 'Guilty - Committed',
       'Dismissed WO Prejudice', 'Guilty - Suspended Sentence', 'Guilty',
       'DWOP - no evidence', 'Not Guilty', 'DWOP - no witness',
       'Guilty - Split Sentence', 'Guilty - Fine', 'DWOP - no police',
       'Dismissed for Time Served in Custody', 'Guilty - Lesser Offense',
       'No Probable Cause', 'Dismissed Defendant Deceased',
       'Duplicate Charge', 'Delinquent - Committed', 'Pretrial Probation',
       'motion to suppress evidence allowed',
       'Directed Verdict Not Guilty', 'Lack of Jurisdiction',
       'Delinquent - P

In [18]:
# Fill missing values in 'Description Disposition Reason' with values from 'Description Reason'
suff['Description Disposition Reason'] = suff['Description Disposition Reason'].fillna(suff['Description Reason'])

### Create Unique ID
This dataset already has 'Defendant #', which is a unique ID for each person. To help protect privacy, we will generate a new one to use.

In [19]:
# Create a unique ID for each distinct combination
identifiers = ['Defendant #']

unique = suff[identifiers].drop_duplicates().reset_index(drop=True)
unique['Person ID'] = unique.index

# Add a prefix to help distinguish between datasets
unique['Person ID'] = 'SF-' + unique['Person ID'].astype(str)

# Add unique ID to dataset
suff = suff.merge(unique, on=identifiers)

### Keep only desired columns
Because many of the columns consist primarily of null values and also do not provide info relevant to our questions, they will not be useful for any analysis. We will only keep the columns with relevant info.

In [20]:
# Save dataframe with only columns we want to keep
columns = ['Person ID', 'Date Crime', 'Date Filed', 'Status', 'Count #', 'Description Charge',
           'Type Crime', 'Code Ucc Ctgry', 'Description Dspstn', 'Description Disposition Reason', 'Date Dspstn']

suff = suff[columns]

In [21]:
# Rename columns to match names used in nw dataset
newnames={'Date Crime': 'Offense Date',
          'Date Filed': 'Filed',
          'Count #': 'Count',
          'Description Charge': 'Charge',
          'Description Dspstn': 'Disposition',
          'Date Dspstn': 'Dispo Date'
         }

suff = suff.rename(columns=newnames)

In [22]:
# Convert dates to datetime values
suff['Offense Date'] = pd.to_datetime(suff['Offense Date'], errors='coerce').dt.date
suff['Filed'] = pd.to_datetime(suff['Filed'], errors='coerce').dt.date
suff['Dispo Date'] = pd.to_datetime(suff['Dispo Date'], errors='coerce').dt.date

### Preview

In [23]:
suff.head()

Unnamed: 0,Person ID,Offense Date,Filed,Status,Count,Charge,Type Crime,Code Ucc Ctgry,Disposition,Description Disposition Reason,Dispo Date
0,SF-0,2015-11-04,2016-01-01,CL,1,"DRUG, DISTRIBUTE CLASS A, SUBSQ.OFF. c94C §32(b)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
1,SF-0,2015-11-04,2016-01-01,CL,2,"COCAINE, DISTRIBUTE, SUBSQ.OFF. c94C §32A(d)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
2,SF-0,2015-11-04,2016-01-01,CL,3,"DRUG, POSSESS TO DISTRIB CLASS A, SUBSQ. c94C ...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
3,SF-0,2015-11-04,2016-01-01,CL,4,"POSSESS TO DISTRIBUTE COCAINE, SUBSEQUENT. c94...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
4,SF-1,2014-10-23,2016-01-01,CL,1,A&B ON +60/DISABLED c265 §13K/F,AS,,Verdict - Jury Trial,Not Guilty,2016-08-02


----
## Get all charges
Show all unique charges for both Northwestern and Suffolk data.

In [24]:
# Get counts for unique charges in each dataset
nwcharges = nw['Charge'].value_counts().rename_axis('Charge').reset_index(name='NW Counts')
sfcharges = suff['Charge'].value_counts().rename_axis('Charge').reset_index(name='SF Counts')

In [25]:
# Combine counts
charges = nwcharges.merge(sfcharges, on='Charge', how='outer')

# Fill null values with 0 and convert to integer
charges[['NW Counts', 'SF Counts']] = charges[['NW Counts', 'SF Counts']].fillna(0).astype(int)

In [26]:
# Use regex to create new columns for Charge Description, Chapter, and Section
charges['Description'] = None
charges['Chapter'] = None
charges['Section'] = None

for i in range(len(charges)):
    try:
        charges.loc[i, 'Description'] = re.search('.+?(?=\sc)', charges.iloc[i]['Charge'])[0].upper()
    except:
        charges.loc[i, 'Description'] = charges.iloc[i]['Charge'].upper()
        
    try:
        charges.loc[i, 'Chapter'] = re.search('(?<=[cC])\d.*?(?=[\s§s/S$])|(?<=c\.\s)\d.*?(?=[\s§s/S$])|(?<!)\d.*?(?=[\s§s/S])|(?<=c)\d.*?(?=$)|(?<=\s)\d.*?(?=\sCMR)', charges.iloc[i]['Charge'])[0]
    except:
        charges.loc[i, 'Chapter'] = None
        
    try:
        charges.loc[i, 'Section'] = re.search('(?<=s\.\s)(\d.*)|(?<=§)(\d.*)|(?<=/)(\d.*)|(?<=s)(\d.*)|(?<=S\s)(\d.*)|(?<=S)(\d.*)', charges.iloc[i]['Charge'])[0]
    except:
        charges.loc[i, 'Section'] = None

In [27]:
# Order by Chapter and Section
charges = charges.sort_values(by=['Chapter', 'Section']).reset_index(drop=True)

In [28]:
# View records that still need to have Charge extracted
charges[charges['Chapter'].isnull() & charges['Section'].notnull()]

Unnamed: 0,Charge,NW Counts,SF Counts,Description,Chapter,Section
1555,UNLAWFUL POSSESSION FIREARM - HABITUAL 269/10(a),1,0,UNLAWFUL POSSESSION FIREARM - HABITUAL 269/10(A),,10(a)
1556,POSSESSION OR CONTROL OF INCENDIARY DEVICE OR ...,6,0,POSSESSION OR CONTROL OF INCENDIARY DEVICE OR ...,,102(c)
1557,PHOTOGRAPH SEXUAL OR INTIMATE PARTS W/OUT CONS...,3,0,PHOTOGRAPH SEXUAL OR INTIMATE PARTS W/OUT CONS...,,105(b)
1558,Assault and Battery (HABITUAL) 265/13A(a),1,0,ASSAULT AND BATTERY (HABITUAL) 265/13A(A),,13A(a)
1559,RECKLESS ENDANGERMENT OF CHILD 265§13L,117,0,RECKLESS ENDANGERMENT OF CHILD 265§13L,,13L
1560,OP MV W/ LICENSE REVOKED-HABITUAL TRAFFIC OFFE...,1,0,OP MV W/ LICENSE REVOKED-HABITUAL TRAFFIC OFFE...,,23
1561,"Ignition Interlock For Another, Bypass 90/24U(...",2,0,"IGNITION INTERLOCK FOR ANOTHER, BYPASS 90/24U(...",,24U(a)(1)
1562,SNOW/REC VEH - REFUSE STOP FOR POLICE 90B/26(c),2,0,SNOW/REC VEH - REFUSE STOP FOR POLICE 90B/26(C),,26(c)
1563,"SNOW/REC VEH - PUBLIC PROPERTY, ON 90B/26(e)",1,0,"SNOW/REC VEH - PUBLIC PROPERTY, ON 90B/26(E)",,26(e)
1564,TRAFFICKING COCAINE SECOND OR SUBSEQUENT OFFEN...,0,1,TRAFFICKING COCAINE SECOND OR SUBSEQUENT OFFEN...,,32(E)(a)


### Preview

In [29]:
charges.head(10)

Unnamed: 0,Charge,NW Counts,SF Counts,Description,Chapter,Section
0,"LOTTERY TICKET, UTTER OR PASS FALSE c10 §30",4,0,"LOTTERY TICKET, UTTER OR PASS FALSE",10,30
1,PEDDLING WITHOUT A LICENSE c101 §14,0,7,PEDDLING WITHOUT A LICENSE,101,14
2,PEDDLING VIOLATION c101 §14,0,3,PEDDLING VIOLATION,101,14
3,"BOAT, TRESPASS ON c102 §1A",0,2,"BOAT, TRESPASS ON",102,1A
4,AIR POLLUTION ORDER VIOL c111 §142A,1,0,AIR POLLUTION ORDER VIOL,111,142A
5,"INSPECTION CERTIFICATE, IMPROPER MV c111 §142M",0,2,"INSPECTION CERTIFICATE, IMPROPER MV",111,142M
6,TRASH TREATMENT FACILITY REGULATION VIOL c111 ...,4,0,TRASH TREATMENT FACILITY REGULATION VIOL,111,150A
7,"ALCOHOL DETOX PROG, UNLICENSED/DENY INSPECTION...",0,1,"ALCOHOL DETOX PROG, UNLICENSED/DENY INSPECTION...",111B,6
8,"PROFESSIONAL LIC SUSPENDED,PRACTICE WITH c112 §65",0,2,"PROFESSIONAL LIC SUSPENDED,PRACTICE WITH",112,65
9,"NURSING, UNAUTH PRACTICE OF PRACTICAL c112 §80A",6,0,"NURSING, UNAUTH PRACTICE OF PRACTICAL",112,80A


----

## Map expungement eligibility

### Import data from Master List
This is the data from the `Added FBI Cat. and Expunge` tab of the **Master Crime List offense with Expunge categories** spreadsheet provided by CfJJ.

In [30]:
expunge = pd.read_csv('../data/ExpungeCategories.csv')

# Get only needed columns
expunge = expunge[['Untruncated Offense', 'Expungeable?']]

# Standardize values for 'no'
expunge['Expungeable?'] = expunge['Expungeable?'].str.strip().replace('NO', 'No')

In [31]:
# Use regex to create new columns for Charge Description, Chapter, and Section
expunge['Description'] = None
expunge['Chapter'] = None
expunge['Section'] = None

for i in range(len(expunge)):
    try:
        expunge.loc[i, 'Description'] = re.search('.+?(?=\sCh.\s)', expunge.iloc[i]['Untruncated Offense'])[0].upper()
    except:
        expunge.loc[i, 'Description'] = expunge.iloc[i]['Untruncated Offense'].upper()
        
    try:
        expunge.loc[i, 'Chapter'] = re.search('(?<=Ch.\s)\d.*?(?=\sS)', expunge.iloc[i]['Untruncated Offense'])[0]
    except:
        expunge.loc[i, 'Chapter'] = None
        
    try:
        expunge.loc[i, 'Section'] = re.search('(?<=\sS\s)(\d.*)', expunge.iloc[i]['Untruncated Offense'])[0]
    except:
        expunge.loc[i, 'Section'] = None

In [32]:
# Merge with charges
expungeable = charges.merge(expunge, on=['Chapter', 'Section'], how='left')

# Get only records with a value for Expungeable? and Chapter
expungeable = expungeable[expungeable['Expungeable?'].notnull() & expungeable['Chapter'].notnull()]

In [33]:
#Preview
expungeable.head()

Unnamed: 0,Charge,NW Counts,SF Counts,Description_x,Chapter,Section,Untruncated Offense,Expungeable?,Description_y
16,"CHILD UNDER 10, ABANDON c119 §39",2,4,"CHILD UNDER 10, ABANDON",119,39,"Child under 10, abandon Ch. 119 S 39",Yes,"CHILD UNDER 10, ABANDON"
19,CONTRIBUTE TO DELINQUENCY OF CHILD c119 §63,14,10,CONTRIBUTE TO DELINQUENCY OF CHILD,119,63,Contribute to delinquency of child Ch. 119 S 63,Yes,CONTRIBUTE TO DELINQUENCY OF CHILD
20,CIVIL RIGHTS ORDER VIOLATION c12 §11J,0,1,CIVIL RIGHTS ORDER VIOLATION,12,11J,Civil rights order violation Ch. 12 S 11J,Yes,CIVIL RIGHTS ORDER VIOLATION
21,CIVIL RIGHTS ORDER VIOLATION c12 §11J,0,1,CIVIL RIGHTS ORDER VIOLATION,12,11J,Civil rights order violation with injury Ch. ...,Yes,CIVIL RIGHTS ORDER VIOLATION WITH INJURY
25,A&B ON CORRECTIONAL OFFICER c127 §38B,25,117,A&B ON CORRECTIONAL OFFICER,127,38B,A&B on correction officer Ch. 127 S 38B,Yes,A&B ON CORRECTION OFFICER


### Map expungement eligibility to Northwestern data

In [34]:
nw = nw.merge(expungeable[['Charge', 'Expungeable?']], on='Charge', how='left')

# Preview
nw.head()

Unnamed: 0,Person ID,Offense Date,Age at Offense,Filed,Status,Count,Charge,Disposition,Dispo Date,Expungeable?
0,NW-0,2011-09-13,21,2014-10-28,Closed,1,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30,No
1,NW-0,2011-09-13,21,2014-10-28,Closed,2,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30,No
2,NW-0,2011-09-13,21,2014-10-28,Closed,3,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30,No
3,NW-0,2011-09-13,21,2014-10-28,Closed,4,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30,No
4,NW-0,2011-09-13,21,2014-10-28,Closed,5,ARSON OF DWELLING HOUSE c266 §1,Not Guilty,2016-03-30,Yes


In [35]:
# Check counts
nw['Expungeable?'].value_counts(dropna=False)

Yes    192738
No      74802
NaN     20770
Name: Expungeable?, dtype: int64

### Map expungement eligibility to Suffolk data

In [36]:
suff = suff.merge(expungeable[['Charge', 'Expungeable?']], on='Charge', how='left')

# Preview
suff.head()

Unnamed: 0,Person ID,Offense Date,Filed,Status,Count,Charge,Type Crime,Code Ucc Ctgry,Disposition,Description Disposition Reason,Dispo Date,Expungeable?
0,SF-0,2015-11-04,2016-01-01,CL,1,"DRUG, DISTRIBUTE CLASS A, SUBSQ.OFF. c94C §32(b)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes
1,SF-0,2015-11-04,2016-01-01,CL,2,"COCAINE, DISTRIBUTE, SUBSQ.OFF. c94C §32A(d)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes
2,SF-0,2015-11-04,2016-01-01,CL,2,"COCAINE, DISTRIBUTE, SUBSQ.OFF. c94C §32A(d)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes
3,SF-0,2015-11-04,2016-01-01,CL,2,"COCAINE, DISTRIBUTE, SUBSQ.OFF. c94C §32A(d)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes
4,SF-0,2015-11-04,2016-01-01,CL,3,"DRUG, POSSESS TO DISTRIB CLASS A, SUBSQ. c94C ...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31,Yes


In [37]:
# Check counts
suff['Expungeable?'].value_counts(dropna=False)

Yes    724768
No     266851
NaN     69047
Name: Expungeable?, dtype: int64

-----
## Export new data files
These will be saved in the `clean-slate/data/processed/` folder.

In [38]:
nw.to_csv('../data/processed/prosecution_northwestern.csv', index=False)
suff.to_csv('../data/processed/prosecution_suffolk.csv', index=False)
charges.to_csv('../data/processed/prosecution_charges.csv', index=False)

-----
## Considerations & Possible Next Steps
- This did not include a closer look, cleaning, or exploratory analysis of the data.
- Date issues in the Northwestern dataset should be addressed for questions related to age. We can't do a count of cases where the age < 21 because it will include issues from invalid values of Date of Birth, Offense Date, or both. My recommendation is to exclude the rows where 'Age at Offense' is < 1 before getting counts where age < 21.
- Suffolk data does not include any indicators of age.
- See the `Get all charges` section above. Some charges still need to have the value for Chapter extracted.
- There are still missing values for `Expungeable?` for both NW & Suffolk datasets that will require a closer look and comparison with the Master List data.