# Clean Slate: Estimating offenses eligible for expungement under varying conditions
> Prepared by [Laura Feeney](https://github.com/laurafeeney) for Code for Boston's [Clean Slate project](https://github.com/codeforboston/clean-slate).

## Purpose & Notes
This notebook begins to process the Middlesex DA data. This data was sourced from the Middlesex DA website: https://www.middlesexda.com/public-information/pages/prosecution-data-and-statistics

Description from website: "The following is data from our Damion Case Management System pertaining to prosecution statistics for the time period from January 1, 2014, through January 1, 2020."

The download is available as an Excel file. Opening excel in Python was too slow, so I manually converted it to csv, and imported via csv.

Note: This did not have age or DOB. 

The Middlesex DA site says this should be prosecutions for 2014 - 2019. However, not all offense dates nor all disposition dates are within this timeline.

-----

In [1]:
import pandas as pd
pd.set_option("display.max_rows", 200)
import numpy as np
import regex as re
import glob, os
import datetime 
from datetime import date 

#print(os.getcwd())
os.chdir("../../data/raw")
#print(os.getcwd())

In [2]:
#ms_raw = pd.read_excel('damion_database_2014-2019_6.30.xlsx') # too slow to do this way
ms_raw = pd.read_csv('damion_database_2014-2019_6.30.csv') 
columns = ['Case Number', 'Offense Date', 'Date of Filing', 'Court Location', 
           'Charge/Crime Code', 'Charge/Crime Description', 'Charge/Crime Type',
           'Disposition Code', 'Disposition Description', 'Disposition Date']
ms = ms_raw[columns].replace()
ms.head()

Unnamed: 0,Case Number,Offense Date,Date of Filing,Court Location,Charge/Crime Code,Charge/Crime Description,Charge/Crime Type,Disposition Code,Disposition Description,Disposition Date
0,14-01-479818,12/30/2013,1/2/2014,SOM,90/23/D,"LICENSE SUSPENDED, OP MV WITH c90 §23",Drugs/Distribution/Possession with Intent,GLF,GUILTY FILED,6/2/2014
1,14-01-479818,12/30/2013,1/2/2014,SOM,94C/32C/C,"DRUG, POSSESS TO DISTRIB CLASS D c94C §32C(a)",Drugs/Distribution/Possession with Intent,NOP,NOLLE PROSEQUI,6/2/2014
2,14-01-479819,12/31/2013,1/2/2014,SOM,94C/34/C,"DRUG, POSSESS CLASS B c94C §34",Drugs/Possession,DWO,DISMISSED W/O PREJUDICE,5/9/2014
3,14-01-479819,12/31/2013,1/2/2014,SOM,90/17/A,SPEEDING * c90 §17,Drugs/Possession,RES,RESPONSIBLE,5/9/2014
4,14-01-479819,12/31/2013,1/2/2014,SOM,89/4A,MARKED LANES VIOLATION * c89 §4A,Drugs/Possession,RES,RESPONSIBLE,5/9/2014


### Cleaning and variable prep

In [3]:
ms.rename(columns={"Charge/Crime Description":"Charge"}, inplace=True)

# Label CMR offenses (Code of Mass Regulations)
ms['CMRoffense'] = None
ms.loc[ms['Charge'].str.contains("CMR"), 'CMRoffense'] = 'yes'
ms.CMRoffense.fillna("no", inplace=True)

#Extract Chapter, Section, and Paragrah (I think the third one would be paragraph? It isn't always populated)
chsec = ms['Charge/Crime Code'].str.split("/", n = 2, expand = True) 
ms['Chapter'] = chsec[0]
ms['Section'] = chsec[1]
ms['Paragraph'] = chsec[2]

# Remove weird A character, and create a version with no spaces and no extra characters. This file has different spacing 
# than NW and Suffolk or Master Crime List descriptions.

ms['Charge'] = ms['Charge'].map(lambda x: x.replace('Â',""))
ms['Charge_alnum'] = ms['Charge'].str.replace(r'\W+', '', )

# Proxy for age -- using a juvenile court
ms['JuvenileC'] =  "no"
ms.loc[ms['Court Location'].str.contains("JU"), 'JuvenileC'] = "yes" 

### dates. Supposed to be 2014-2019

In [4]:
reference_date = datetime.date(2020, 9, 1) # using "today.date() wouldn't be stable"

ms['Offense Date'] = pd.to_datetime(ms['Offense Date'], errors='coerce').dt.date
ms['years_since_offense'] = (reference_date - ms['Offense Date'])/pd.Timedelta(1, 'D')/365

print("The earliest offense date is:    ", min(ms['Offense Date']))
print("The max offense date is:         ", max(ms['Offense Date']), "\n")

print("Distribution of years since offense:", "\n", ms['years_since_offense'].describe(), "\n")

before_2013 = ms['Case Number'][ms['Offense Date']<datetime.date(2013,1,1)].nunique()
before_2014 = ms['Case Number'][ms['Offense Date']<datetime.date(2014,1,1)].nunique()
after_2014 = ms['Case Number'][ms['Offense Date']>=datetime.date(2014,1,1)].nunique()
after_2013 = ms['Case Number'][ms['Offense Date']>=datetime.date(2013,1,1)].nunique()


print("There are", before_2014, "cases with offense date prior to Jan 1, 2014",
     "and", before_2013, "cases before 2013")

print("Percent of cases before 2014:",round(before_2014*100/after_2014,2), "\n"
     "Percent before 2013:", round(before_2013*100/after_2013,2))

The earliest offense date is:     1951-06-30
The max offense date is:          2019-12-30 

Distribution of years since offense: 
 count    343072.000000
mean          4.251889
std           2.426659
min           0.673973
25%           2.619178
50%           4.230137
75%           5.641096
max          69.221918
Name: years_since_offense, dtype: float64 

There are 9965 cases with offense date prior to Jan 1, 2014 and 2963 cases before 2013
Percent of cases before 2014: 6.48 
Percent before 2013: 1.84


In [5]:
ms.to_csv('../../data/raw/ms.csv', index=False)