Purpose:
* Create a single mapping of criminal charges and characteristics, including:
    * Charge: _A text-based description of the crime. This is recorded different in different districts._ 
    * Massachusetts General Law (Where in MGL is this crime codified)
        * Chapter
        * Section
        * Paragraph (if applicable)
    * Misdemeanor / Felony
    * Sex offense per 100J
    * Murder offense
    * Additional criteria required to determine expungement eligibility
    
The additional criteria may be incomplete and subject to revisions, as it does not exist in a 'coded' state and will likely need to be extracted using regex.

This file will start with the charges from the Master Crime List, and will be built out over time to include alternative or additional charges from each prosecution dataset from each district. 

This process may need to be re-run as new datasets reveal new charges, new text descriptions of charges, new charge-extra criteria combinations, etc.

In [1]:
import pandas as pd
import regex as re

### Import data from Master List
This is the data from the `Added FBI Cat. and Expunge` tab of the **[Master Crime List offense with Expunge categories](https://docs.google.com/spreadsheets/d/11iD3ilejUW28NE6DdUaUkkp3PoPauhCj/edit#gid=579055210)** spreadsheet provided by CfJJ.

In [2]:
MCL = pd.read_csv('../data/raw/ExpungeCategories.csv')
# rename columns
MCL = MCL.rename(columns={"Now Legal?\n": "Now Legal", "Expungeable?": "Expungeable", 
                    "If no, why not?" :"Why not eligible", "Untruncated Offense" : "MCL_Offense"})
MCL.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1037 entries, 0 to 1036
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Code                 1037 non-null   int64  
 1   MCL_Offense          1037 non-null   object 
 2   FBI Category         1037 non-null   object 
 3   MOST_SER_RANK        1025 non-null   float64
 4   OFFENSE_TYPE         1037 non-null   object 
 5   GRID                 1037 non-null   object 
 6   Penalty Type         1037 non-null   object 
 7   Mandatory Gun        15 non-null     object 
 8   Police Notification  35 non-null     object 
 9   SORB                 53 non-null     object 
 10  SDP                  53 non-null     object 
 11  Now Legal            0 non-null      float64
 12  Expungeable          1037 non-null   object 
 13  Why not eligible     318 non-null    object 
dtypes: float64(2), int64(1), object(11)
memory usage: 113.5+ KB


In [3]:
# Standardize values for 'no'
MCL['Expungeable'] = MCL['Expungeable'].str.strip().replace('NO', 'No')

In [4]:
# check that values for other columns are fairly consistent
for column in MCL.columns[7:]:
    print(MCL[column].value_counts(), '\n')


Yes                                14
Only if YO w/ an adult sentence     1
Name: Mandatory Gun, dtype: int64 

Yes    35
Name: Police Notification, dtype: int64 

Yes    53
Name: SORB, dtype: int64 

Yes    53
Name: SDP, dtype: int64 

Series([], Name: Now Legal, dtype: int64) 

Yes    719
No     318
Name: Expungeable, dtype: int64 

Ch 265 Felony                      151
178C Sex Offense                    30
Death or serious injury             20
Armed w dangerous weapon            18
Death or serious bodily injury      17
Ch 90 S 24 Violation                16
Ch 269 S10 a-d violation            15
S 121-131Q of Ch 140                14
Elderly/Disabled                     8
Armed w a dangerous weapon           6
Ch 6 S 178C Sex Offense              4
Ch 269 S 10E violation               3
Ch 123A S 1 Offense                  3
Order pursuant Ch 209A               3
Elderly                              3
Order Pursuan Ch 258E                1
Order pursuant Ch 258E               1


In [5]:
# view some offenses. next we are going to splice these into descriptin, chapter, section, and paragraph.

pd.options.display.max_rows = 999
pd.options.display.max_colwidth = 500
MCL[['MCL_Offense']][:100]

Unnamed: 0,MCL_Offense
0,A&b
1,A&B Ch. 265 S 13A(a)
2,A&b on a corrections officer
3,A&b on a public servant
4,A&B on child to coerce criminal conspiracy Ch. 265 S 44
5,"A&B on child to coerce criminal conspiracy, subsq. off. Ch. 265 S 44"
6,"A&B on child under 14, bodily injury Ch. 265 S 13J"
7,"A&B on child under 14, substantial bodily injury Ch. 265 S 13J"
8,A&b on child with injury
9,A&b on child with substantial injury


In [6]:
# Use regex to create new columns for Charge Description, Chapter, and Section
MCL['Description'] = None
MCL['Chapter'] = None
MCL['Section'] = None

for i in range(len(MCL)):
    try:
        MCL.loc[i, 'Description'] = re.search('.+?(?=\sCh.\s)', MCL.iloc[i]['MCL_Offense'])[0].upper()
    except:
        MCL.loc[i, 'Description'] = MCL.iloc[i]['MCL_Offense'].upper()
        
    try:
        MCL.loc[i, 'Chapter'] = re.search('(?<=Ch.\s)\d.*?(?=\sS)', MCL.iloc[i]['MCL_Offense'])[0]
    except:
        MCL.loc[i, 'Chapter'] = None
        
    try:
        MCL.loc[i, 'Section'] = re.search('(?<=\sS\s)(\d.*)', MCL.iloc[i]['MCL_Offense'])[0]
    except:
        MCL.loc[i, 'Section'] = None
        
MCL['Paragraph'] = None
section_paragraph = MCL['Section'].str.split("(", n = 1, expand = True) 
MCL['Section'] = section_paragraph[0]
MCL['Paragraph'] = section_paragraph[1]
MCL['Paragraph'] = '(' + MCL['Paragraph']

In [15]:

pd.options.display.max_rows = 999
pd.options.display.max_colwidth = 100
MCL[['Description', 'Chapter', 'Section', 'Paragraph']][:100]


Unnamed: 0,Description,Chapter,Section,Paragraph
0,A&B,,,
1,A&B,265,13A,(a)
2,A&B ON A CORRECTIONS OFFICER,,,
3,A&B ON A PUBLIC SERVANT,,,
4,A&B ON CHILD TO COERCE CRIMINAL CONSPIRACY,265,44,
5,"A&B ON CHILD TO COERCE CRIMINAL CONSPIRACY, SUBSQ. OFF.",265,44,
6,"A&B ON CHILD UNDER 14, BODILY INJURY",265,13J,
7,"A&B ON CHILD UNDER 14, SUBSTANTIAL BODILY INJURY",265,13J,
8,A&B ON CHILD WITH INJURY,,,
9,A&B ON CHILD WITH SUBSTANTIAL INJURY,,,


In [19]:
MCL['Paragraph'] = MCL['Paragraph'].fillna('x')

In [23]:
x = MCL.groupby(['Chapter', 'Section', 'Paragraph', 'Expungeable']).size().unstack(fill_value=0)
x

Unnamed: 0_level_0,Unnamed: 1_level_0,Expungeable,No,Yes
Chapter,Section,Paragraph,Unnamed: 3_level_1,Unnamed: 4_level_1
119,1,x,0,3
119,39,x,0,1
119,63,x,0,1
12,11J,x,0,2
120,26,x,0,1
127,38B,x,0,1
131,43,x,0,1
131,58,x,0,1
131,66,x,0,1
131,67,x,0,1


In [24]:
x.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 401 entries, ('119', '1', 'x') to ('94C', '40', 'x')
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   No      401 non-null    int64
 1   Yes     401 non-null    int64
dtypes: int64(2)
memory usage: 10.3+ KB


In [22]:
MCL.groupby(['Chapter', 'Section', 'Paragraph', 'Penalty Type']).size().unstack(fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Penalty Type,CIV,Felony,Misdemeanor
Chapter,Section,Paragraph,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
119,1,x,0,0,3
119,39,x,0,0,1
119,63,x,0,0,1
12,11J,x,0,1,1
120,26,x,0,0,1
127,38B,x,0,1,0
131,43,x,0,0,1
131,58,x,0,0,1
131,66,x,0,0,1
131,67,x,0,0,1


In [None]:
MCL.to_csv('MCLtest.csv', index=False)

PCD = pd.read_csv('../data/processed/prosecution_charges_detailed.csv', encoding='cp1252') 
PCD.rename(columns={"Expungeable.":"Expungeable"}, inplace=True)
columns = ['Charge', 'Chapter', 'Section', 'Expungeable']
PCD = PCD[columns]

df = MCL.merge(PCD, on=['Chapter', 'Section'], how='left', indicator=True)
df._merge.value_counts()

df = df[['MCL_Offense', 'Chapter', 'Section', 'Paragraph', 'Expungeable_x', 'Charge',  'Expungeable_y']]
df = df.drop_duplicates()

df.to_csv('mergetest.csv', index=False)

The merge by chapter and section is very coarse. Dawn didn't keep the original charge from the MCL. Took the MCL, dropped the text description -- stripped down to chapter, section, expungeable. Dropped duplicates. While many chapter+sections have only expungeable or only nonexpungeable offenses, there are chapters+sections that encompass both expungeable and nonexpungeable offenses.
See x = MCL.groupby(['Chapter', 'Section', 'Paragraph', 'Expungeable']).size().unstack(fill_value=0)

Same for misdemeanor/felony.

We identified problems where the same prosecution_ data row linked to both "yes" and "no" for expungeable. 

Who's to say whether the matches that have only one expungeable value per chapter+section were actually correct? 

## next steps

validate current mapping:

1. get mapping of unique chapter/section/paragraph/expungeable
1. get mapping of unique chapter/section/paragraph/misdemeanor_felony
1. counts of most common chapter/section -- verify top % manually

try new methods:

1. see how the new complaint file stands up -- what is the match rate of this to the prosecution data, matching by string?
1. try to build a crosswalk of chapters and sections that are not expungeable, building directly from the 100J statutes
1. try chapter/section/paragarph/extra_criteria

or, stick with the mapping created in MA_Data-2_MergeCharges_alt, and just build from there

1. merge in misdemeanor/felony, merging on the text description