# Clean Slate: MA Data
> Prepared by [Dawn Graham](https://github.com/dawngraham) for Code for Boston's [Clean Slate project](https://github.com/codeforboston/clean-slate).

## Purpose
Citizens for Juvenile Justice received MA prosecution data thanks to the ACLU.

The purpose of this notebook is to anonymize the data and provide only fields required to answer the following questions (updated 6/28/20):

1. How many people (under age 21) are eligible for expungement today? This would be people with only **one charge** that is not part of the list of ineligible offenses (per section 100J). 


2. How many people (under age 21) would be eligible based on only having **one incident** (which could include multiple charges) that are not part of the list of ineligible offenses?
 - How many people (under age 21) would be eligible based on only having **one incident** if only sex-based offenses or murder were excluded from expungement?
 

3. How many people (under age 21) would be eligible based on who has **not been found guilty** (given current offenses that are eligible for expungement)?
 - How many people (under age 21) would be eligible based on who has **not been found guilty** for all offenses except for murder or sex-based offenses?


Prior questions:
- How many people (under age 21) are ineligible because they have a charge that is on the list of ineligible offenses (per section 100J)?
- How many people (under age 21) have only one offense on their record?
- How many people have only expungable offenses on their record but have more than 1 of them?

The resulting datasets will be output as a separate .csv that can be save and shared in the Clean Slate GitHub repo.

-----


## Northwestern DA Prosecution Data
### Import data

In [1]:
import pandas as pd
import numpy as np
import regex as re
import glob, os

In [2]:
nw2014 = pd.read_csv('../data/Prosecution Northwestern DA 2014-2018 RAW DATA - 2014 to early 2018.csv')
nw2018 = pd.read_csv('../data/Prosecution Northwestern DA 2014-2018 RAW DATA - to end of 2018.csv')

# Combine the two datasets
nw = pd.concat([nw2014, nw2018])

  interactivity=interactivity, compiler=compiler, result=result)


### Quick look at data

In [3]:
# Get names, # of non-null values, and the data types for each column
nw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75725 entries, 0 to 10429
Data columns (total 40 columns):
Filed:                                    75725 non-null object
Age at Case Filing                        75725 non-null int64
Status:                                   75725 non-null object
Court:                                    75725 non-null object
Date of Birth:                            75595 non-null object
Gender:                                   74003 non-null object
Race:                                     68741 non-null object
Ethnicity:                                21321 non-null object
Count:                                    75725 non-null int64
Charge:                                   75725 non-null object
Disposition:                              72259 non-null object
Dispo Date:                               71900 non-null object
Offense Date:                             74944 non-null object
Bail Requested                            60 non-null object


In [4]:
# Get the # of null values in each column
nw.isnull().sum()

Filed:                                         0
Age at Case Filing                             0
Status:                                        0
Court:                                         0
Date of Birth:                               130
Gender:                                     1722
Race:                                       6984
Ethnicity:                                 54404
Count:                                         0
Charge:                                        0
Disposition:                                3466
Dispo Date:                                 3825
Offense Date:                                781
Bail Requested                             75665
Bail Set                                   74746
Imposed Facility:                          68952
Sentence Type:                             69467
Sentence Date:                             70059
Term Min:                                  73191
Term Min Unit:                             73191
Term Max:           

### Keep only desired columns
Because many of the columns consist primarily of null values and also do not provide info relevant to our questions, they will not be useful for any analysis. We will only keep the columns with relevant info.

In [5]:
# Save dataframe with only columns we want to keep
columns = ['Court:', 'Date of Birth:', 'Offense Date:', 'Filed:', 'Age at Case Filing', 'Status:',
       'Gender:', 'Race:', 'Ethnicity:', 'Count:', 'Charge:', 'Disposition:',
       'Dispo Date:', 'Sentence Type:', 'Sentence Date:']

nw = nw[columns]

# Remove ":" from end of column names
nw.columns = nw.columns.str.rstrip(':')

In [6]:
# Convert dates to datetime values
nw['Filed'] = pd.to_datetime(nw['Filed'], errors='coerce').dt.date
nw['Date of Birth'] = pd.to_datetime(nw['Date of Birth'], errors='coerce').dt.date
nw['Dispo Date'] = pd.to_datetime(nw['Dispo Date'], errors='coerce').dt.date
nw['Offense Date'] = pd.to_datetime(nw['Offense Date'], errors='coerce').dt.date
nw['Sentence Date'] = pd.to_datetime(nw['Sentence Date'], errors='coerce').dt.date

### Get Age at Offense Date
Because 'Date of Birth:' and 'Offense Date:' have some NULL records, this will not provide an age for all records.

In [7]:
# Subtract "Date of Birth" from "Offense Date" to get "Age at Offense"
nw['Age at Offense'] = (nw['Offense Date'] - nw['Date of Birth'])/pd.Timedelta(1, 'Y')
nw['Age at Offense'] = np.floor(nw['Age at Offense']).astype('Int64')

#### Date Issues
186 records have invalid values for Date of Birth, Offense Date, or both.

In [8]:
dateissues = nw.loc[nw['Age at Offense'] < 1,['Date of Birth', 'Offense Date', 'Age at Offense',
                                              'Age at Case Filing', 'Filed']]

In [9]:
dateissues.sort_values(by='Age at Offense').head()

Unnamed: 0,Date of Birth,Offense Date,Age at Offense,Age at Case Filing,Filed
61884,1992-03-08,1750-12-03,-242,26,2018-01-19
61883,1992-03-08,1750-12-03,-242,26,2018-01-19
845,1979-12-08,1882-11-30,-98,34,2014-01-23
846,1979-12-08,1882-11-30,-98,34,2014-01-23
847,1979-12-08,1882-11-30,-98,34,2014-01-23


### Create unique ID
We will create a unique "Person ID" for each distinct combination of Date of Birth, Gender, Race, and Ethnicity. While it is possible that one person could have different distinct values (for example, if a different value was marked for Race at different times for some reason) or that more than one person could have the same distinct values, this will give us a close approximation that will let us link multiple records belonging to one person.

In [10]:
# Create a unique ID for each distinct combination
identifiers = ['Date of Birth', 'Gender', 'Race', 'Ethnicity']

unique = nw[identifiers].drop_duplicates().reset_index(drop=True)
unique['Person ID'] = unique.index

# Add a prefix to help distinguish between datasets
unique['Person ID'] = 'NW-' + unique['Person ID'].astype(str)

# Add unique ID to dataset
nw = nw.merge(unique, on=identifiers)

In [11]:
# Save dataset without identifiers
nw = nw[['Person ID', 'Court', 'Offense Date', 'Age at Offense', 'Filed', 'Status',
         'Count', 'Charge', 'Disposition', 'Dispo Date']]

### Preview

In [12]:
nw.head()

Unnamed: 0,Person ID,Court,Offense Date,Age at Offense,Filed,Status,Count,Charge,Disposition,Dispo Date
0,NW-0,Franklin Superior Court,2011-09-13,21,2014-10-28,Closed,1,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
1,NW-0,Franklin Superior Court,2011-09-13,21,2014-10-28,Closed,2,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
2,NW-0,Franklin Superior Court,2011-09-13,21,2014-10-28,Closed,3,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
3,NW-0,Franklin Superior Court,2011-09-13,21,2014-10-28,Closed,4,ASSAULT TO MURDER c265 §15,Not Guilty,2016-03-30
4,NW-0,Franklin Superior Court,2011-09-13,21,2014-10-28,Closed,5,ARSON OF DWELLING HOUSE c266 §1,Not Guilty,2016-03-30


-------
## Suffolk County Prosecution Data 2013 - 2018

In [13]:
# Concatenate all Suffolk County files
suff = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', '../data/Suffolk County Prosecution*.csv'))), sort=False)

  exec(code_obj, self.user_global_ns, self.user_ns)


### Quick look at the data

In [14]:
# Get names, # of non-null values, and the data types for each column
suff.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 303270 entries, 0 to 53717
Data columns (total 42 columns):
Case #                            303270 non-null object
Defendant #                       303270 non-null int64
Count #                           303270 non-null int64
Description Charge                303270 non-null object
Type Crime                        303270 non-null object
Code Ucc Ctgry                    262955 non-null object
Date Crime                        301015 non-null object
Date Filed                        303270 non-null object
Date Dspstn                       251489 non-null object
Description Dspstn                251710 non-null object
Code Dspstn                       251710 non-null object
Description Disposition Reason    181924 non-null object
Type Chrg Orgn                    205 non-null object
SC Docket                         0 non-null float64
DC Docket                         0 non-null float64
Race/Eth                          259048 non-nul

In [15]:
# Get the # of null values in each column
suff.isnull().sum()

Case #                                 0
Defendant #                            0
Count #                                0
Description Charge                     0
Type Crime                             0
Code Ucc Ctgry                     40315
Date Crime                          2255
Date Filed                             0
Date Dspstn                        51781
Description Dspstn                 51560
Code Dspstn                        51560
Description Disposition Reason    121346
Type Chrg Orgn                    303065
SC Docket                         303270
DC Docket                         303270
Race/Eth                           44222
Gender                              3051
Code Ofcr Agncy                    27656
Court                                  0
Judge                              72790
Nmbr Day Jail Rcmnd                    0
Nmbr Day Jl Rcmnd Unt              17525
Nmbr Day Jl Rcmnd Min                  0
Nmbr Day Jl Rcmnd Min Unt          17510
Nmbr Day Jl Imps

### Combine 'Description Disposition Reason' and 'Description Reason'
These contain the same information - likely due to one file having a different field name.

In [16]:
suff['Description Disposition Reason'].unique()

array(['Guilty - Lesser Offense', 'Not Guilty', nan, 'Guilty - Committed',
       'Guilty - Probation', 'DWOP - no evidence', 'DWOP - no victim',
       'Dismissed by Commonwealth', 'Dismissed to Subsequent Indictment',
       'Guilty - Filed', 'Dismissed for Agreed Plea',
       'Dismissed by Court', 'Guilty', 'DWOP - no witness',
       'Guilty - Split Sentence', 'Dismissed for Community Service',
       'CWF (ASF)', 'Dismissed WO Prejudice', 'Guilty - Fine',
       'Dismissed WO Prosecution', 'Guilty - Suspended Sentence',
       'defense motion for dismissal allowed',
       'Dismissed Upon Payment of Court Costs',
       'Dismissed Prior to Arraignment', 'Delinquent - Committed',
       'Duplicate Charge', 'DWOP - no police', 'No Probable Cause',
       'Directed Verdict Not Guilty', 'Dismissed Defendant Deceased',
       'Pretrial Probation', 'Remanded To Clerks Hearing',
       'motion to suppress evidence allowed',
       'Responsible - Fine C277S70', 'Dismissed Transfer to BJC

In [17]:
suff['Description Reason'].unique()

array([nan, 'Dismissed Prior to Arraignment', 'DWOP - no victim',
       'Dismissed Upon Payment of Court Costs', 'CWF (ASF)',
       'Dismissed for Agreed Plea', 'Dismissed WO Prosecution',
       'Guilty - Probation', 'Dismissed by Commonwealth',
       'Guilty - Filed', 'Dismissed by Court',
       'Dismissed to Subsequent Indictment',
       'defense motion for dismissal allowed',
       'Dismissed for Community Service', 'Guilty - Committed',
       'Dismissed WO Prejudice', 'Guilty - Suspended Sentence', 'Guilty',
       'DWOP - no evidence', 'Not Guilty', 'DWOP - no witness',
       'Guilty - Split Sentence', 'Guilty - Fine', 'DWOP - no police',
       'Dismissed for Time Served in Custody', 'Guilty - Lesser Offense',
       'No Probable Cause', 'Dismissed Defendant Deceased',
       'Duplicate Charge', 'Delinquent - Committed', 'Pretrial Probation',
       'motion to suppress evidence allowed',
       'Directed Verdict Not Guilty', 'Lack of Jurisdiction',
       'Delinquent - P

In [18]:
# Fill missing values in 'Description Disposition Reason' with values from 'Description Reason'
suff['Description Disposition Reason'] = suff['Description Disposition Reason'].fillna(suff['Description Reason'])

### Create Unique ID
This dataset already has 'Defendant #', which is a unique ID for each person. To help protect privacy, we will generate a new one to use.

In [19]:
# Create a unique ID for each distinct combination
identifiers = ['Defendant #']

unique = suff[identifiers].drop_duplicates().reset_index(drop=True)
unique['Person ID'] = unique.index

# Add a prefix to help distinguish between datasets
unique['Person ID'] = 'SF-' + unique['Person ID'].astype(str)

# Add unique ID to dataset
suff = suff.merge(unique, on=identifiers)

### Keep only desired columns
Because many of the columns consist primarily of null values and also do not provide info relevant to our questions, they will not be useful for any analysis. We will only keep the columns with relevant info.

In [20]:
# Save dataframe with only columns we want to keep
columns = ['Person ID', 'Court', 'Date Crime', 'Date Filed', 'Status', 'Count #', 'Description Charge',
           'Type Crime', 'Code Ucc Ctgry', 'Description Dspstn', 'Description Disposition Reason', 'Date Dspstn']

suff = suff[columns]

In [21]:
# Rename columns to match names used in nw dataset
newnames={'Date Crime': 'Offense Date',
          'Date Filed': 'Filed',
          'Count #': 'Count',
          'Description Charge': 'Charge',
          'Description Dspstn': 'Disposition',
          'Date Dspstn': 'Dispo Date'
         }

suff = suff.rename(columns=newnames)

In [22]:
# Convert dates to datetime values
suff['Offense Date'] = pd.to_datetime(suff['Offense Date'], errors='coerce').dt.date
suff['Filed'] = pd.to_datetime(suff['Filed'], errors='coerce').dt.date
suff['Dispo Date'] = pd.to_datetime(suff['Dispo Date'], errors='coerce').dt.date

### Preview

In [23]:
suff.head()

Unnamed: 0,Person ID,Court,Offense Date,Filed,Status,Count,Charge,Type Crime,Code Ucc Ctgry,Disposition,Description Disposition Reason,Dispo Date
0,SF-0,SUP,2015-11-04,2016-01-01,CL,1,"DRUG, DISTRIBUTE CLASS A, SUBSQ.OFF. c94C §32(b)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
1,SF-0,SUP,2015-11-04,2016-01-01,CL,2,"COCAINE, DISTRIBUTE, SUBSQ.OFF. c94C §32A(d)",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
2,SF-0,SUP,2015-11-04,2016-01-01,CL,3,"DRUG, POSSESS TO DISTRIB CLASS A, SUBSQ. c94C ...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
3,SF-0,SUP,2015-11-04,2016-01-01,CL,4,"POSSESS TO DISTRIBUTE COCAINE, SUBSEQUENT. c94...",DR,F,Plea,Guilty - Lesser Offense,2017-01-31
4,SF-1,SUP,2014-10-23,2016-01-01,CL,1,A&B ON +60/DISABLED c265 §13K/F,AS,,Verdict - Jury Trial,Not Guilty,2016-08-02


-----
## Export new data files
These will be saved in the `clean-slate/data/raw/` folder.

In [24]:
nw.to_csv('../data/raw/nw.csv', index=False)
suff.to_csv('../data/raw/suff.csv', index=False)