# Data Cleaning and Formatting of Massachusetts Sex Offender Data

Apurva Raman

In [1]:
import pandas as pd
import numpy as np
from nameparser import HumanName

All data is from the [Massachusetts Sex Offender Registry Board search](https://sorb.chs.state.ma.us/sorbpublic/standardSearchforSexOffenders.action?_p=Kp-9CXg7hgePSFQg2AdJa_qkaBWD68EinvdMMRYNNu4), up to date as of September 29, 2018:


Note that the data here is from sex offenders registered since July 12, 2013, and you apparently need to go to your local police station. 

There is no comprehensive list of all the sex offenders in the state. However, there is a county search. I attempted to scrape the data using requests and Beautiful Soup, but the table is populated in the JS and thus the data is not in the HTML. When trying [Selenium](https://selenium-python.readthedocs.io/) to get the data, I was presented with a captcha, which prevented me from proceeding further (I was a robot). 

I ended up manually getting the data by searching each county and getting the data as a .csv using the [scraper chrome extension](https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en).

I then construct a list of the file names (there's a .csv for each county).

In [2]:
counties = ["barnstable", "berkshire", "bristol", "dukes", "essex", "franklin", "hampden", "hampshire", "middlesex", "nantucket", "norfolk", "plymouth", "suffolk", "worchester"]
file_string = "sex offenders massachusetts - "
files = [file_string + county +".csv" for county in counties ]

In [3]:
print(files)

['sex offenders massachusetts - barnstable.csv', 'sex offenders massachusetts - berkshire.csv', 'sex offenders massachusetts - bristol.csv', 'sex offenders massachusetts - dukes.csv', 'sex offenders massachusetts - essex.csv', 'sex offenders massachusetts - franklin.csv', 'sex offenders massachusetts - hampden.csv', 'sex offenders massachusetts - hampshire.csv', 'sex offenders massachusetts - middlesex.csv', 'sex offenders massachusetts - nantucket.csv', 'sex offenders massachusetts - norfolk.csv', 'sex offenders massachusetts - plymouth.csv', 'sex offenders massachusetts - suffolk.csv', 'sex offenders massachusetts - worchester.csv']


Then I take the files and put them into a list of dataframes, and then concatenated the dataframes in the list.

In [4]:
dfs = []
for f in files:
    dfs.append(pd.read_csv(f, header=None))
    print (dfs[0])

mass_data = pd.concat(dfs, ignore_index=False)


                              0
0               ADAMS, ROBERT H
1               ADAMS, ROBERT H
2              ANDRADE, PHILLIP
3                ANTONE, STEVEN
4              BATTLES, EVERETT
5              BATTLES, EVERETT
6               BELOIN, PETER R
7               BELOIN, PETER R
8            BLOWERS, PHILLIP A
9              BOESSE, EDWARD D
10             BOESSE, EDWARD D
11               BROWN, TYLER T
12               BROWN, TYLER T
13         CALLIONTZIS, PETER C
14             CANAVIN, PETER A
15             CANAVIN, PETER A
16             CANAVIN, PETER A
17   CASANOVA, CHRISTOPHER JOHN
18   CASANOVA, CHRISTOPHER JOHN
19             CASSIDY, SCOTT C
20             CATERINO, ERIC W
21      CHESTNUT, BRIAN JEFFREY
22           CHRISTIAN, PETER A
23             COBB, DAVID OWEN
24             COBB, DAVID OWEN
25               COLON, ALBERTO
26            DAMIANI, ROBERT T
27            DAMIANI, ROBERT T
28         DEANE, PETER CHARLES
29         DEANE, PETER CHARLES
..      

We can see that the data is last name first, all caps, and alphabetized by last name. There are duplicates of the names as well; each sex offender appears once for each of their addresses. I remove the duplicates (although there is a chance I am removing two distinct people with the exact same name, I ignore this for convenience). 

In [5]:
mass_data = mass_data.drop_duplicates()

I then get a Series of the names column of the dataframe.

In [6]:
mass_data.columns = ["names"]

Now I format the names. 

I tried title case on the names, which was okay at first glance, but then I realized it messed up the roman numerals (Firstname M Lastname Iii). 

I didn't particularly feel like writing a parser, and I was pretty sure someone had written a package to handle name formatting. I did a little digging and found [nameparser](https://travis-ci.org/derek73/python-nameparser), which worked wonderfully!

In [7]:
mass_data_series= mass_data["names"]

name_list = [] # Construct a list for the formatted names (I think appending to the Series might be less optimized)
for name in mass_data_series: 
    comma_loc = name.find(",") # Split on commas to find where the last name ends
    last_name = name[:comma_loc] # Slice the string from the beginning to where the comma is to get the last name
    rest_of_name = name[comma_loc+1:] # Slice the string from after the comma to the end of the string
    name = rest_of_name+" "+last_name # Concat the two slices
    name =(HumanName(name)) # Feed it into the glorious HumanName
    name.capitalize(force=True) # Capitalize the name (modifies value in place)
    name_list.append(str(name)) # Append to a list
    
name_list.sort() # Alphabetize by first name

formatted_names= pd.Series(name_list) # Back to a series!

print(formatted_names)

0              Aaron A Lastowski
1                  Aaron A Perry
2                 Aaron G Alpern
3                Aaron J Lussier
4                  Aaron Kincaid
5                Aaron M Navarro
6             Aaron Phillip Hall
7                  Aaron S Emery
8            Aaron Shakai Miller
9                 Aaron Sullivan
10                  Aaron Sutton
11                 Abby M Shocik
12        Abdullah Hakeem Mummin
13      Abdullah Muhammed Sayyid
14                 Abner Brandao
15         Abraham Kasparian III
16                Abraham Molina
17            Adalberto Martinez
18                 Adam D Nelson
19                Adam G Hopkins
20           Adam J Kachadoorian
21               Adam J Medeiros
22              Adam Joel Lancey
23              Adam M Andrejack
24                  Adam Mercado
25                    Adam Moore
26          Adam Nathanial Frueh
27                   Adam R Koch
28                   Adam Rapoza
29             Adam S Polakowski
          

I write the Series of formatted names to a .csv. 

In [9]:
formatted_names.to_csv("sex_offenders.csv")