# Clean Slate: Labelling Nonexpungeable Sex Offenses

> Prepared by Joel Tennyson for Code for Boston's [Clean Slate Project](https://github.com/codeforboston/clean-slate)

# Purpose
Whether or not a charge relates to a non-expungeable sex offense is deeply relevant to parts of the data analysis we are performing to answer the questions posed by Citizens for Juvenile Justice. 

Previous efforts to label these sex offenses in our data have relied solely on keyword matching. This notebook explores the impact of a new labelling strategy that incorporates offense chapter and section data as well as keyword matching.

**This notebook is used to apply the new sex offense labelling strategy to the merged_nw.csv, merged_suff.csv, and merged_ms.csv files.** Only values in the 'sex' column are updated; the files are otherwise unchanged.

Mass general laws chapter 276 article 100J indicates chapter/section pairings that are likely (but not certain) to indicate nonexpungeable sex offenses in subsections 6, 7, 8, and 10. The associated chapter and section pairings are found in [this crosswalk document.](https://docs.google.com/spreadsheets/d/1eM7lXkUKruWl9cRg20vtXEfoD47CD2KGPPPMg_0AoH4/edit#gid=1023772964)


---

# Import Libraries and Fetch Data

Pandas dataframes are created from the Northwest, Suffolk, and Middlesex datasets. A listing of chapter/section pairings is created from the sex offense crosswalk document.

In [1]:
import requests
import pandas as pd
import os

In [2]:
# Create dataframes for Northwest, Suffolk, and Middlesex
nw = pd.read_csv('../../data/processed/merged_nw.csv', encoding='cp1252')
sf = pd.read_csv('../../data/processed/merged_suff.csv', encoding='cp1252')
ms = pd.read_csv('../../data/processed/merged_ms.csv', encoding='cp1252')
pd.set_option("display.max.columns", None)
print("Dataframes created for Northwest, Suffolk, and Middlesex")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Dataframes created for Northwest, Suffolk, and Middlesex


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
# Create the list of chaper/section pairings related to nonexpungable sex offenses
# Each nonexpungeable sex offense has three values: chapter, section, and boolean for 'only on repeat offense'
download_url = 'https://docs.google.com/spreadsheets/d/1eM7lXkUKruWl9cRg20vtXEfoD47CD2KGPPPMg_0AoH4/gviz/tq?usp=sharing&tqx=out:csv&sheet={Crosswalk}'
target_csv_path = 'so_crosswalk.csv'

response = requests.get(download_url)
response.raise_for_status()
with open(target_csv_path, 'wb') as f:
    f.write(response.content)
# Create dataframe from the crosswalk, then a list containing that data
so = pd.read_csv('so_crosswalk.csv')
sex_offenses = list(so.to_records(index=False))
print('Sex Offenses Crosswalk List:\n', sex_offenses)

# Delete the downloaded crosswalk file from the local directory
os.remove('so_crosswalk.csv')

Sex Offenses Crosswalk List:
 [(265, '13B', 0), (265, '13B1/2', 0), (265, '13B3/4', 1), (265, '13F', 0), (265, '13H', 0), (265, '22', 0), (265, '22A', 0), (265, '22B', 0), (265, '22C', 1), (265, '23', 0), (265, '23A', 0), (265, '23B', 1), (265, '24', 0), (265, '24B', 0), (265, '26', 0), (265, '26C', 0), (265, '26D', 0), (265, '50', 0), (265, '52', 1), (272, '2', 0), (272, '3', 0), (272, '4A', 0), (272, '4B', 0), (272, '16', 0), (272, '17', 0), (272, '28', 0), (272, '29A', 0), (272, '29B', 0), (272, '29C', 0), (272, '30D', 0), (272, '35A', 0), (272, '53', 0), (272, '77C', 0), (277, '39', 0)]


# Determine Offenses With Chapter/Section Pairings Matching the Sex Offense Crosswalk

The chapters and sections of the offenses identified as matching the sex offense crosswalk are listed for manual inspection.

In [4]:
# To help determine which keywords will cover all subsequent charges (i.e. 'subsequent', 'second', etc)
# display list of unique charges associated with crosswalk entrys that are only sex offenses on a subsequent offense
all_datasets = pd.concat([sf, nw, ms])
only_subsequents = all_datasets.iloc[0:0]
for listing in sex_offenses:
  if listing[2] == 1:
    x = all_datasets.loc[
    (all_datasets['Chapter'].notna()) &
    (all_datasets['Chapter'].str.contains(str(listing[0]), case=False)) &
    (all_datasets['Section'].notna()) &
    (all_datasets['Section'].str.contains(str(listing[1]), case=False))
    ]
    only_subsequents = pd.concat([only_subsequents, x]) 

only_subsequents['Charge'].unique()

array(['RAPE OF CHILD WITH FORCE, SUBSQ.OFF. c265 Â§22A',
       'RAPE OF CHILD, STATUTORY, AFTER CERTAIN OFFENSES c265 Â§23B'],
      dtype=object)

In [5]:
# This list is determined by manual inspection of the above, and should be revisited if the underlying data changes significantly
subsequent_keywords = ['SUBSQ.OFF.', 'AFTER CERTAIN OFFENSES']

In [6]:
# This function takes in a dataframe and a crosswalk entry, and returns a slice of that dataframe containing charges that match the crosswalk entry
def match_crosswalk(df, offense):
  return df.loc[
    (df['Chapter'].notna()) &
    (df['Chapter'].str.contains(str(offense[0]), case=False)) &
    (df['Section'].notna()) &
    (df['Section'].str.contains(str(offense[1]), case=False)) &
    (offense[2] == 0 | df['Charge'].str.contains('|'.join(subsequent_keywords), case=False))
  ]

In [7]:
# This function takes in a dataframe and a crosswalk entry, and changes the value of the 'sex' column to 1 wherever a charge matches the crosswalk entry
# and that charge is a sex offense (based on the sex offense keyword list)
def update_sex(df, offense):
  df.loc[
    (df['Chapter'].notna()) &
    (df['Chapter'].str.contains(str(offense[0]), case=False)) &
    (df['Section'].notna()) &
    (df['Section'].str.contains(str(offense[1]), case=False)) &
    (df['Charge'].str.contains('|'.join(so_keywords), case=False)) &
    (offense[2] == 0 | df['Charge'].str.contains('|'.join(subsequent_keywords), case=False)),
    'sex'
  ] = 1

In [8]:
# Create so_sf, a dataframe containing all Suffolk charges that match the chapter/section pairs in the sex offenses list
so_sf = sf.iloc[0:0]

for listing in sex_offenses:
  x = match_crosswalk(sf, listing)
  so_sf = pd.concat([so_sf, x])

so_sf.groupby(['Chapter', 'Section']).size()

Chapter  Section
265      13B          797
         13F           54
         13H          891
         13H/C         11
         22(a)        203
         22(b)        398
         22A          208
         23           456
         23A/A        308
         24           114
         24B           60
         26           672
         26A           28
         26B            1
         50           126
272      12             5
         16           626
         17            16
         2              5
         24            16
         28           176
         29            41
         29A(a)       128
         29A(b)         8
         29B(a)       158
         29B(b)        34
         29C          390
         3              3
         34             4
         35             2
         35A            4
         38             1
         42             6
         42A            1
         43            33
         43A            7
         4A             9
         4B          

In [9]:
# Create so_nw, a dataframe containing all Northwest charges that match the chapter/section pairs in the sex offenses list
so_nw = nw.iloc[0:0]

for listing in sex_offenses:
  x = match_crosswalk(nw, listing)
  so_nw = pd.concat([so_nw, x])

so_nw.groupby(['Chapter', 'Section']).size()

Chapter  Section        
265      13B                 258
         13F                  17
         13H                 168
         22(a)                15
         22(b)                77
         22A                 112
         22B                   8
         23                   62
         23A(a)              148
         23A(b)               42
         24                   10
         24B                  12
         26                   82
         26A                   4
         50(a)                19
265,     22(b)                 1
         23                    1
         23A                   2
272      16                   64
         17                    5
         24                   11
         28                   82
         29                   14
         29A(a)               66
         29A(b)                8
         29B(a)               74
         29B(b)               18
         29C                 292
         3                     2
         35       

In [10]:
# Create so_ms, a dataframe containing all Northwest charges that match the chapter/section pairs in the sex offenses list
so_ms = ms.iloc[0:0]

for listing in sex_offenses:
  x = match_crosswalk(ms, listing)
  so_ms = pd.concat([so_ms, x])

so_ms.groupby(['Chapter', 'Section']).size()

Chapter  Section
265      13B          421
         13B12          4
         13F           56
         13H            8
         22           370
         22B           52
         22C           15
         23            77
         23A           66
         23B            5
         24            92
         24B           58
         26           520
         26A           26
         26C           72
         26D           10
         50           108
272      12            20
         16           462
         17             1
         2              3
         24             8
         28           114
         29            26
         29A           58
         29B           88
         29C          416
         3              6
         34             3
         35            23
         42            24
         42A            1
         43            11
         4A             1
         4B             1
         53         13312
         53A          788
         73          

# Determine Sex Offense Related Keywords
Some of the chapter/section pairings in the sex offense crosswalk can pertain to nonsexual offenses as well.

A careful manual review of the list of unique incidents that match the chapter/section pairings from the sex offense crosswalk sheet is used to determine a list of keywords that cover all of the sex-related offenses, and none of the non-sex related offenses.

Because this is a manual review, significant changes to the underlying data will require the keyword list to be revisited.

In [11]:
so_keywords = ['INDECENT', 'RAPE', 'SEX', 'LEWDNESS', 'OBSCENE', 'PORNOGRAPHY', 'PROSTITUTION', 'LEWD', 'NUDE', 'NIGHTWALKER', 'STREETWALKER', 'SODOMY', 'INCEST', 'UNNATURAL', 'BESTIALITY']

In [12]:
# Display all unique charges from all three datasets that match the chapter/section pairs in the sex offenses list
print('ALL OFFENSES MATCHING THE SEX OFFENSE CROSSWALK (SUFFOLK, NORTHWEST, MIDDLESEX):\n')
so_all = pd.concat([so_sf, so_nw, so_ms])
so_all['Charge'].unique()

ALL OFFENSES MATCHING THE SEX OFFENSE CROSSWALK (SUFFOLK, NORTHWEST, MIDDLESEX):



array(['INDECENT A&B ON CHILD UNDER 14 c265 Ã‚Â§13B',
       'INDECENT A&B ON CHILD UNDER 14, SUBSQ. c265 Ã‚Â§13B',
       'A&B ON RETARDED PERSON c265 Ã‚Â§13F',
       'INDECENT A&B ON RETARDED PERSON c265 Ã‚Â§13F',
       'INDECENT A&B ON PERSON 14 OR OVER c265 Ã‚Â§13H',
       'INDECENT A&B ON DISABLED PERSON OVER 60 c265 Ã‚Â§13H/C',
       'INDECENT A&B ON PERSON 14 OR OVER c. 265 s. 13H',
       'RAPE OF CHILD WITH FORCE c265 Ã‚Â§22A',
       'RAPE, AGGRAVATED c265 Ã‚Â§22(a)', 'RAPE c265 Ã‚Â§22(b)',
       'RAPE, AGGRAVATED FIREARM-ARMED c265 Ã‚Â§22(a)',
       'RAPE c. 265 s. 22(b)', 'RAPE, FIREARM-ARMED c265 Ã‚Â§22(b)',
       'RAPE OF CHILD WITH FORCE c. 265 s. 22A',
       'RAPE, AGGRAVATED c. 265 s. 22(a)',
       'RAPE OF CHILD, STATUTORY c265 Ã‚Â§23',
       'RAPE OF CHILD, AGGRAVATED, TEN YEAR AGE DIFF c265 Ã‚Â§23',
       'RAPE OF CHILD, STATUTORY AGGRAVATED c265 Ã‚Â§23A/A',
       'RAPE OF CHILD, STATUTORY c. 265 s. 23',
       'ASSAULT TO RAPE c265 Ã‚Â§24',
       'ASSA

In [13]:
# Display the offenses that contain at least one of the chosen sex offense keywords (all of these should be sex offenses)
print('SEX OFFENSES MATCHING THE SEX OFFENSE CROSSWALK (SUFFOLK, NORTHWEST, MIDDLESEX):\n')
so_all.loc[
  (so_all['Charge'].str.contains('|'.join(so_keywords), case=False))
]['Charge'].unique()

SEX OFFENSES MATCHING THE SEX OFFENSE CROSSWALK (SUFFOLK, NORTHWEST, MIDDLESEX):



array(['INDECENT A&B ON CHILD UNDER 14 c265 Ã‚Â§13B',
       'INDECENT A&B ON CHILD UNDER 14, SUBSQ. c265 Ã‚Â§13B',
       'INDECENT A&B ON RETARDED PERSON c265 Ã‚Â§13F',
       'INDECENT A&B ON PERSON 14 OR OVER c265 Ã‚Â§13H',
       'INDECENT A&B ON DISABLED PERSON OVER 60 c265 Ã‚Â§13H/C',
       'INDECENT A&B ON PERSON 14 OR OVER c. 265 s. 13H',
       'RAPE OF CHILD WITH FORCE c265 Ã‚Â§22A',
       'RAPE, AGGRAVATED c265 Ã‚Â§22(a)', 'RAPE c265 Ã‚Â§22(b)',
       'RAPE, AGGRAVATED FIREARM-ARMED c265 Ã‚Â§22(a)',
       'RAPE c. 265 s. 22(b)', 'RAPE, FIREARM-ARMED c265 Ã‚Â§22(b)',
       'RAPE OF CHILD WITH FORCE c. 265 s. 22A',
       'RAPE, AGGRAVATED c. 265 s. 22(a)',
       'RAPE OF CHILD, STATUTORY c265 Ã‚Â§23',
       'RAPE OF CHILD, AGGRAVATED, TEN YEAR AGE DIFF c265 Ã‚Â§23',
       'RAPE OF CHILD, STATUTORY AGGRAVATED c265 Ã‚Â§23A/A',
       'RAPE OF CHILD, STATUTORY c. 265 s. 23',
       'ASSAULT TO RAPE c265 Ã‚Â§24',
       'ASSAULT TO RAPE CHILD c265 Ã‚Â§24B',
       'ASSAU

In [14]:
# Display the offenses that contain none of the chosen sex offense keywords (none of these should be sex offenses)
print('NONSEXUAL OFFENSES MATCHING THE SEX OFFENSE CROSSWALK (SUFFOLK, NORTHWEST, MIDDLESEX):\n')
so_all.loc[
  (~so_all['Charge'].str.contains('|'.join(so_keywords), case=False))
]['Charge'].unique()

NONSEXUAL OFFENSES MATCHING THE SEX OFFENSE CROSSWALK (SUFFOLK, NORTHWEST, MIDDLESEX):



array(['A&B ON RETARDED PERSON c265 Ã‚Â§13F', 'KIDNAPPING c265 Ã‚Â§26',
       'KIDNAPPING MINOR BY RELATIVE c265 Ã‚Â§26A',
       'KIDNAPPING, FIREARM-ARMED c265 Ã‚Â§26',
       'KIDNAPPING WITH SERIOUS BODILY INJURY, ARMED c265 Ã‚Â§26',
       'KIDNAPPING OF CHILD c265 Ã‚Â§26',
       'KIDNAPPING & ENDANGER MINOR BY RELATIVE c265 Ã‚Â§26A',
       'KIDNAPPING PERSON IN CUSTODY c265 Ã‚Â§26A',
       'KIDNAPPING FOR EXTORTION, FIREARM-ARMED c265 Ã‚Â§26',
       'DRUG TO CONFINE c265 Ã‚Â§26B',
       'FUNERAL PROCESSION, DISTURB c272 Ã‚Â§42',
       'FUNERAL SERVICE, DISTURB c272 Ã‚Â§42A',
       'DISORDERLY CONDUCT c272 Ã‚Â§53',
       'DISTURBING THE PEACE c272 Ã‚Â§53',
       'NOISY & DISORDERLY HOUSE, KEEP c272 Ã‚Â§53',
       'DISORDERLY CONDUCT ON PUB CONVEY,3RD OFF c272 Ã‚Â§43',
       'DISORDERLY CONDUCT ON PUBLIC CONVEYANCE c272 Ã‚Â§43',
       'SMOKING ON MBTA c272 Ã‚Â§43A',
       'DISORDERLY CONDUCT, SUBSQ. OFF. c272 Ã‚Â§53',
       'DISTURBING THE PEACE, SUBSQ. OFF. c272 Ã‚Â

# Comparison of Previous Sex Offense Labeling Strategy with New Strategy

The Suffolk, Northwest, and Middlesex datasets already contain a boolean column 'sex' to mark nonexpungeable sex offenses. This column was derived using keywords, but without factoring in chapter and section data.

In [15]:
# Create copies of the Suffolk, Northwest, and Middlesex datasets which will have updated 'sex' columns
sf_new = sf.copy()
nw_new = nw.copy()
ms_new = ms.copy()
sf_new.rename(columns={'sex': 'sex_old'}, inplace=True)
nw_new.rename(columns={'sex': 'sex_old'}, inplace=True)
ms_new.rename(columns={'sex': 'sex_old'}, inplace=True)
sf_new.insert(loc=sf_new.columns.get_loc("sex_old"), column='sex', value=0)
nw_new.insert(loc=nw_new.columns.get_loc("sex_old"), column='sex', value=0)
ms_new.insert(loc=ms_new.columns.get_loc("sex_old"), column='sex', value=0)

In [16]:
# Use the sex offenses crosswalk and keyword list to mark nonexpungeable sex offenses
for listing in sex_offenses:
  update_sex(sf_new, listing)
  update_sex(nw_new, listing)
  update_sex(ms_new, listing)

**Manually Address Invalid False Positives**

Some nonexpungeable sex offenses have not been automatically matched with the sex offense crosswalk because of errorneous or omitted chapter and section data. Here, those offenses are manually labeled as nonexpungeable sex offenses. 

Additionally, though charges related to a sex offender failing to register or provide information do not have chapter/section pairs matching entries in the sex offenses crosswalk, it seems certain that all individuals with such a charge must also have at least one charge this is considered a nonexpungeable sex offense. The Clean Slate data team has determined that these sex offender registration charges should be considered as nonexpungeable sex offenses for all data analysis purposes.

In [17]:
invalid_false_positives = ['SEX OFFENDER', 'LEWDNESS, OPEN AND GROSS', 'RAPE', 'SEXUAL INTERCOURSE, INDUCE CHASTE MINOR', 'INDECENT A&B', 'PHOTOGRAPH SEXUAL OR INTIMATE PARTS OF CHILD']

In [18]:
# Display charges that will be manually marked as nonexpungeable sex offenses
x = sf_new.loc[
  (sf_new['sex'] == 0) &
  (sf_new['Charge'].str.contains('|'.join(invalid_false_positives), case=False))
]
print('Suffolk charges to be manually marked as nonexpungeable sex offenses:\n', x.groupby(['Charge']).size().sort_values(ascending=False))

x = nw_new.loc[
  (nw_new['sex'] == 0) &
  (nw_new['Charge'].str.contains('|'.join(invalid_false_positives), case=False))
]
print('\nNorthwest charges to be manually marked as nonexpungeable sex offenses:\n', x.groupby(['Charge']).size().sort_values(ascending=False))

x = ms_new.loc[
  (ms_new['sex'] == 0) &
  (ms_new['Charge'].str.contains('|'.join(invalid_false_positives), case=False))
]
print('\nMiddlesex charges to be manually marked as nonexpungeable sex offenses:\n', x.groupby(['Charge']).size().sort_values(ascending=False))

Suffolk charges to be manually marked as nonexpungeable sex offenses:
 Charge
SEX OFFENDER FAIL TO REGISTER c6 Ã‚Â§178H(a)-(c)                                1074
SEX OFFENDER FAIL TO REGISTER, SUBSQ.OFF. c6 Ã‚Â§178H(a)                         282
SEX OFFENDER FAIL TO REGISTER, LEVEL 2 or 3 c6 Ã‚Â§178H(a)(1)                    233
SEX OFFENDER FAIL TO REGISTER, HOMELESS, 2ND OFF. c6 Ã‚Â§178H(c)                  50
SEX OFFENDER FAIL TO REGISTER, HOMELESS, 3RD OFF. c6 Ã‚Â§178H(c)                  49
LEWDNESS, OPEN AND GROSS, SUBSQ.OFF. c272 Ã‚Â§16                                  21
SEX OFFENDER FAIL TO REGISTER, SUBSQ.OFF. LEVEL 2 OR 3 c6 Ã‚Â§178H(a)             20
RAPE OF CHILD, STATUTORY, SUBSQ.OFF. c265 Ã‚Â§23                                  16
RAPE, SUBSQ.OFF. c265 Ã‚Â§22(b)                                                   15
RAPE, AGGRAVATED, SUBSQ.OFF. c265 Ã‚Â§22(a)                                        8
INDECENT A&B ON PERSON 14 OR OVER AFTER CERTAIN OFFENSES c265 Ã‚Â§13H   

In [19]:
# Mark the above charges as nonexpungeable sex offenses
sf_new.loc[
  (sf_new['sex'] == 0) &
  (sf_new['Charge'].str.contains('|'.join(invalid_false_positives), case=False)),
  'sex'
] = 1

nw_new.loc[
  (nw_new['sex'] == 0) &
  (nw_new['Charge'].str.contains('|'.join(invalid_false_positives), case=False)),
  'sex'
] = 1

ms_new.loc[
  (ms_new['sex'] == 0) &
  (ms_new['Charge'].str.contains('|'.join(invalid_false_positives), case=False)),
  'sex'
] = 1

In [20]:
# Compare sex offense labelling strategies for Suffolk
missing_sex_offenses = sf_new.loc[
  (sf_new['sex'] == 1) &
  (sf_new['sex_old'] == 0)
]

false_positives = sf_new.loc[
  (sf_new['sex'] == 0) &
  (sf_new['sex_old'] == 1)
]

strategy_matches_so = sf_new.loc[
  (sf_new['sex'] == 1) &
  (sf_new['sex_old'] == 1)
]

strategy_matches_not_so = sf_new.loc[
  (sf_new['sex'] == 0) &
  (sf_new['sex_old'] == 0)
]

previous_so_count = len(sf_new.loc[(sf_new['sex_old'] == 1)])
new_so_count = len(sf_new.loc[(sf_new['sex'] == 1)])

print('In Suffolk, the previous labelling strategy found', previous_so_count, 'sex offenses. The new strategy found', new_so_count, 'sex offenses.')
print('Both labelling strategies agree on', len(strategy_matches_so), 'offenses marked as sex offenses, and', len(strategy_matches_not_so), 'offenses not marked as sex offenses.')
print('The new strategy determined that there are', len(missing_sex_offenses), 'offenses not previously marked as sex offenses that should be (missed sex offenses).')
print('The new strategy determined that there are', len(false_positives), 'offenses previously marked as sex offenses that should not be (false positives).')

print('\nSuffolk list of unique charges among the missed sex offenses:\n')
print(missing_sex_offenses.groupby(['Charge']).size().sort_values(ascending=False))

print('\nSuffolk list of unique charges among the false positives:\n')
#print(false_positives['Charge'].unique())
print(false_positives.groupby(['Charge']).size().sort_values(ascending=False))

In Suffolk, the previous labelling strategy found 5771 sex offenses. The new strategy found 7652 sex offenses.
Both labelling strategies agree on 5765 offenses marked as sex offenses, and 295612 offenses not marked as sex offenses.
The new strategy determined that there are 1887 offenses not previously marked as sex offenses that should be (missed sex offenses).
The new strategy determined that there are 6 offenses previously marked as sex offenses that should not be (false positives).

Suffolk list of unique charges among the missed sex offenses:

Charge
INDECENT A&B ON PERSON 14 OR OVER c265 Ã‚Â§13H                           890
INDECENT A&B ON CHILD UNDER 14 c265 Ã‚Â§13B                              784
NIGHTWALKER, COMMON c272 Ã‚Â§53                                           75
OBSCENE MATTER, DISTRIBUTE c272 Ã‚Â§29                                    41
INDECENT A&B ON RETARDED PERSON c265 Ã‚Â§13F                              21
PROSTITUTION, KEEP HOUSE OF c272 Ã‚Â§24              

In [21]:
# Compare sex offense labelling strategies for Northwest
missing_sex_offenses = nw_new.loc[
  (nw_new['sex'] == 1) &
  (nw_new['sex_old'] == 0)
]

false_positives = nw_new.loc[
  (nw_new['sex'] == 0) &
  (nw_new['sex_old'] == 1)
]

strategy_matches_so = nw_new.loc[
  (nw_new['sex'] == 1) &
  (nw_new['sex_old'] == 1)
]

strategy_matches_not_so = nw_new.loc[
  (nw_new['sex'] == 0) &
  (nw_new['sex_old'] == 0)
]

previous_so_count = len(nw_new.loc[(nw_new['sex_old'] == 1)])
new_so_count = len(nw_new.loc[(nw_new['sex'] == 1)])

print('In Northwest, the previous labelling strategy found', previous_so_count, 'sex offenses. The new strategy found', new_so_count, 'sex offenses.')
print('Both labelling strategies agree on', len(strategy_matches_so), 'offenses marked as sex offenses, and', len(strategy_matches_not_so), 'offenses not marked as sex offenses.')
print('The new strategy determined that there are', len(missing_sex_offenses), 'offenses not previously marked as sex offenses that should be (missed sex offenses).')
print('The new strategy determined that there are', len(false_positives), 'offenses previously marked as sex offenses that should not be (false positives).')

print('\nNorthwest list of unique charges among the missed sex offenses:\n')
print(missing_sex_offenses.groupby(['Charge']).size().sort_values(ascending=False))

print('\nNorthwest list of unique charges among the false positives:\n')
#print(false_positives['Charge'].unique())
print(false_positives.groupby(['Charge']).size().sort_values(ascending=False))

In Northwest, the previous labelling strategy found 1092 sex offenses. The new strategy found 1566 sex offenses.
Both labelling strategies agree on 1089 offenses marked as sex offenses, and 74156 offenses not marked as sex offenses.
The new strategy determined that there are 477 offenses not previously marked as sex offenses that should be (missed sex offenses).
The new strategy determined that there are 3 offenses previously marked as sex offenses that should not be (false positives).

Northwest list of unique charges among the missed sex offenses:

Charge
INDECENT A&B ON CHILD UNDER 14 c265 Ã‚Â§13B                               258
INDECENT A&B ON PERSON 14 OR OVER c265 Ã‚Â§13H                            167
OBSCENE MATTER, DISTRIBUTE c272 Ã‚Â§29                                     14
INDECENT A&B ON A PERSON WITH AN INTELLECTUAL DISABILITY  c265 Ã‚Â§13F     12
PROSTITUTION, KEEP HOUSE OF c272 Ã‚Â§24                                    11
Fail to Register as Sex Offender, Level 2 or 3

In [22]:
# Compare sex offense labelling strategies for Middlesex
missing_sex_offenses = ms_new.loc[
  (ms_new['sex'] == 1) &
  (ms_new['sex_old'] == 0)
]

false_positives = ms_new.loc[
  (ms_new['sex'] == 0) &
  (ms_new['sex_old'] == 1)
]

strategy_matches_so = ms_new.loc[
  (ms_new['sex'] == 1) &
  (ms_new['sex_old'] == 1)
]

strategy_matches_not_so = ms_new.loc[
  (ms_new['sex'] == 0) &
  (ms_new['sex_old'] == 0)
]

previous_so_count = len(ms_new.loc[(ms_new['sex_old'] == 1)])
new_so_count = len(ms_new.loc[(ms_new['sex'] == 1)])

print('In Middlesex, the previous labelling strategy found', previous_so_count, 'sex offenses. The new strategy found', new_so_count, 'sex offenses.')
print('Both labelling strategies agree on', len(strategy_matches_so), 'offenses marked as sex offenses, and', len(strategy_matches_not_so), 'offenses not marked as sex offenses.')
print('The new strategy determined that there are', len(missing_sex_offenses), 'offenses not previously marked as sex offenses that should be (missed sex offenses).')
print('The new strategy determined that there are', len(false_positives), 'offenses previously marked as sex offenses that should not be (false positives).')

print('\nMiddlesex list of unique charges among the missed sex offenses:\n')
print(missing_sex_offenses.groupby(['Charge']).size().sort_values(ascending=False))

print('\nMiddlesex list of unique charges among the false positives:\n')
#print(false_positives['Charge'].unique())
print(false_positives.groupby(['Charge']).size().sort_values(ascending=False))

In Middlesex, the previous labelling strategy found 3059 sex offenses. The new strategy found 4884 sex offenses.
Both labelling strategies agree on 3059 offenses marked as sex offenses, and 355960 offenses not marked as sex offenses.
The new strategy determined that there are 1317 offenses not previously marked as sex offenses that should be (missed sex offenses).
The new strategy determined that there are 0 offenses previously marked as sex offenses that should not be (false positives).

Middlesex list of unique charges among the missed sex offenses:

Charge
INDECENT A&B ON PERSON 14 OR OVER c265 Â§13H          785
INDECENT A&B ON CHILD UNDER 14 c265 Â§13B             402
NIGHTWALKER, COMMON c272 Â§53                          30
OBSCENE MATTER, DISTRIBUTE c272 Â§29                   25
UNNATURAL ACT c272 Â§35                                23
PROSTITUTION, PROCURE PERSON TO PRACTICE c272 Â§12     20
INDECENT A&B ON CHILD UNDER 14, SUBSQ. c265 Â§13B      19
PROSTITUTION, KEEP HOUSE OF 

# Apply the New Labelling Strategy to the Data Spreadsheets

The 'sex' column will be updated in these three files: **merged_suff.csv**, **merged_nw.csv**, and **merged_ms.csv**.

No other columns will be altered.

In [23]:
# Drop the uneeded 'sex_old' column from the dataframes
sf_output = sf_new.drop(['sex_old'], axis=1)
nw_output = nw_new.drop(['sex_old'], axis=1)
ms_output = ms_new.drop(['sex_old'], axis=1)

# Save the updated dataframes as csv files, overwriting them in the processed data folder
sf_file = sf_output.to_csv('../../data/processed/merged_suff.csv', index=False)
nw_file = nw_output.to_csv('../../data/processed/merged_nw.csv', index=False)
ms_file = ms_output.to_csv('../../data/processed/merged_ms.csv', index=False)