# Cleaning Names

The purpose of this notebook is to clean the names of individuals. All the problems that we aim to fix in this notebook are listed [here](https://docs.google.com/document/d/1pcSQfWNll6K9tl-_rB4lztN0TsZsclU9vOnbyQob-Zs/edit).

## Cleaning Table Columns

Not all tables share the same columns. Therefore, its important to take time to clean these columns. We do this by standardizing the names of columns across all the debt tables. Next, we add a state column to each table. This column will be useful when we merge all these tables together at the end.

In [None]:
# import all the necessary packages
import pandas as pd 
import numpy as np
import nltk
import re
from nameparser import HumanName

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

In [None]:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

In [None]:
def clean_table(table, drp_cols):
    table.drop(columns=drp_cols, inplace=True, axis=1)
    table.columns = table.columns.to_flat_index() 
    table.rename(columns=lambda x: x[0].lower() + ' | ' + x[1].lower(), inplace=True) # lowercase column titles
    table.rename(columns={'state | ' : 'state'}, inplace=True)
    return table

In [None]:
# handle the liquidated debt certificates first for each file and merge into 1 dataframe
ct_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_CT.xlsx", header=[10,11])
de_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_DE.xlsx", header=[9,10])
ma_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_MA.xlsx", header=[10,11])
nh_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_NH.xlsx", header=[10,11])
nj_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_NJ.xlsx", header=[9,10])
ny_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_NY.xlsx", header=[10,11])
pa_stelle_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_PA_stelle.xlsx", header=[10,11])
pa_story_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_PA_story.xlsx", header=[10,11])
ri_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_RI.xlsx", header=[10,11])

# add a state column to each dataframe
ct_debt['state'] = 'ct'
de_debt['state'] = 'de'
ma_debt['state'] = 'ma'
nh_debt['state'] = 'nh'
nj_debt['state'] = 'nj'
ny_debt['state'] = 'ny'
pa_stelle_debt['state'] = 'pa'
pa_story_debt['state'] = 'pa'
ri_debt['state'] = 'ri'

ny_drp_cols = ['Page', 'JPEG number', 'Number', 'Letter', 'Line Strike Through?']
ny_debt = clean_table(ny_debt, ny_drp_cols)
ct_drp_cols = ['Register Page', 'JPEG number', 'Number', 'Letter', 'Line Strike Thorugh?']
ct_debt = clean_table(ct_debt, ct_drp_cols)
de_drp_cols = ['Unnamed: 0_level_0', 'Unnamed: 1_level_0', 'Unnamed: 2_level_0', ('Amount', 'Strike Through'),
            ('Amount', 'Note')]
de_debt = clean_table(de_debt, de_drp_cols)
ma_drp_cols = ['Unnamed: 0_level_0', 'Unnamed: 1_level_0', 'Unnamed: 2_level_0', 'Unnamed: 3_level_0', 
               ('Time when the Debt\nbecame due', 'Line Strike Thorugh?'),
               ('Time when the Debt\nbecame due', 'Note'),
               ('Time when the Debt\nbecame due', 'Note.1')]
ma_debt = clean_table(ma_debt, ma_drp_cols)
ma_debt.rename(columns={'to whome due | first name':'to whom due | first name'}, inplace=True)
nj_drp_cols = ['Unnamed: 0_level_0', 'Unnamed: 1_level_0', 'Unnamed: 2_level_0', ('Time when the debt became Due', 'Strike Through Number'),
            ('Time when the debt became Due', 'Note'),
            ('Time when the debt became Due', 'Note.1')]
nj_debt = clean_table(nj_debt, nj_drp_cols)
nj_debt.rename(columns={'date of the certificate | title':'to whom due | title', 'date of the certificate | first name':'to whom due | first name',
                    'date of the certificate | last name':'to whom due | last name', 'date of the certificate | title.1':'to whom due | title.1'}, 
               inplace=True)
pa_stelle_drp = ['Register Page', 'JPEG number', 'No.', 'Letter', 'Line Strike Through?']
pa_stelle_debt = clean_table(pa_stelle_debt, pa_stelle_drp)

In [None]:
print(nj_debt.columns)

In [None]:
print(ny_debt.dtypes)

## Company Names

There are multiple kinds of companies. 

```James Vernon & Co.``` These are pretty simple to deal with. If they have '& co' or '& others' anywhere in the string of the first name column, it is most likely a company. Just take the string beforehand. 

In [None]:
# dictionary of manual changes i have to make 
changes = {
    'Henry Mc Clellen & Henry & co' : 'Henry Mc Clellen & Co'
}

In [None]:
def handle_comp_name(row):    
    fname = row['to whom due | first name']
    
    if fname in changes:
        print(fname)
        fname = changes[fname]
    
    fname_c = str(fname).lower()
    if ('& co' in fname_c) or ('& others' in fname_c) or ('& several others' in fname_c):        
        fname_c = fname_c.replace('& co', '').replace('& others', '')
        name = HumanName(fname_c)
        row['to whom due | first name'] = name.first
        row['to whom due | last name'] = name.last
        row['under company'] = True # note that the original debt entry was held by a company 
        print(row)
        
        return row
    
    return row

ny_debt['under company'] = np.nan
ny_debt[['to whom due | first name', 'to whom due | last name', 'under company']] = ny_debt[['to whom due | first name', 
                                                                            'to whom due | last name', 'under company']].apply(lambda row: handle_comp_name(row), axis=1)

## Cleaning Entries with Two Names

There are debt entries that have two names in a single cell: ```NY_2422: Messes Williamson & Beckman```. The plan is to split the name across the first name and last name columns.  

In [None]:
changes = {
    'van zandt & kittletas' : ['', 'van zandt | kittletas'],
    'trustees of & davids church':['trustees of & davids church', '']
}

In [None]:
def handle_two_name(row):
    name = str(row['to whom due | first name']).lower()
    if (' & ' in name) or (' and ' in name):
        person1 = re.split('&|and', name)[0].strip()
        person2 = re.split('&|and', name)[1].strip()
        human_name_1 = HumanName(person1)
        human_name_2 = HumanName(person2)
        
        if name not in changes:
            if human_name_1.first != '' and human_name_2.first != '':
                row['to whom due | first name'] = human_name_1.first + " | " + human_name_2.first
            else: 
                row['to whom due | first name'] = human_name_1.first + human_name_2.first

            if human_name_1.last != '' and human_name_2.last != '':
                row['to whom due | last name'] = human_name_1.last + " | " + human_name_2.last
            else:
                row['to whom due | last name'] = human_name_1.last + human_name_2.last
        else:
            row['to whom due | first name'] = changes[name][0]
            row['to whom due | last name'] = changes[name][1]
        
        ny_debt['multiple persons'] = True
            
        print("old: " + name)
        print("new fn: " + row['to whom due | first name'])
        print("new ln: " + row['to whom due | last name'] +"\n")
        
    return row

ny_debt['multiple persons'] = np.nan
ny_debt.apply(lambda row: handle_two_name(row), axis=1)

## Handle Abbreviations of a Name

There are individuals who have a handwritten abbreviation of a name in their debt entry. Fix these names. There will be a dictionary of abbreviations. Just check if any of the debt entries are in the dictionary and change it if needed. 

In [None]:
abbreviations = {
    'And':'Andrew', 'Ant':'Anthony', 'Bart':'Bartholomew', 'Cha':'Charles', 'Dor':'Dorothy', 'Dot':'Dorothy', 'Doth':'Dorothy',
    'Edw':'Edward', 'Eliz':'Elizabeth', 'Geo':'George', 'H':'Henry', 'Herb':'Herbert', 'Ja':'James', 'Jn':'John', 'Marg':'Margaret', 
    'Mich':'Michael', 'Pat': 'Patrick', 'Rich':'Richard', 'Tho':'Thomas', 'W':'William'
}

In [None]:
def handle_abbreviations(row):
    fn = str(row['to whom due | first name'])
    if fn in abbreviations:
        row['to whom due | first name'] = abbreviations[fn]
    
    return row

# test on new jersey dataset for now 
nj_debt.apply(lambda row: handle_abbreviations(row), axis=1)

## Standardizing Names

Multiple different spellings of a name can be referring to the same identity. We will use a phonetics library and Ancestry to fix this. 

In [None]:
# import necessary fuzzy string libraries 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from phonetics import metaphone
from fuzzywuzzy import fuzz
from jellyfish import soundex

In [None]:
# options
options = Options()
options.add_argument('--headless')
options.add_argument("--window-size=1000,1000")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--no-sandbox')   

In [169]:
def only_f(fn, ln, crow):
    if crow['to whom due | first name'][0] == fn[0] and crow['to whom due | last name'][0] == ln[0]:
        return crow

def fuzzy_similarity(name, row):
    cname = row['to whom due | first name'] + ' ' + row['to whom due | last name']
    code1 = metaphone(name)
    code2 = metaphone(cname)
    ratio = fuzz.ratio(name, cname)
    score = fuzz.ratio(code1, code2)
    if score > 90 and score != 100 and ratio > 90:
        print('Name: ' + name)
        print('Cname: ' + cname)
        print('M Score: ' + str(score))
        print('Ratio: ' + str(ratio))
        print('--------------------------------------------------')
        # print('Name=' + name + ' - CName=' + cname + ' - M Score=' + str(score) + )
    
def determine_similarities(df, row):
    fn = row['to whom due | first name']
    ln = row['to whom due | last name'] 
    onlyfc_df = df.apply(lambda row: only_f(fn, ln, row), axis=1).dropna()
    
    if len(onlyfc_df) > 0:
        onlyfc_df.apply(lambda row: fuzzy_similarity(fn + ' ' + ln, row))
    # print("first name=" + fn + " - last name=" + ln + " - length=" + str(len(onlyfc_df)))
    
ny_debt['to whom due | first name'] = ny_debt['to whom due | first name'].astype(str)
ny_debt['to whom due | last name'] = ny_debt['to whom due | last name'].astype(str)
ny_debt.apply(lambda row: determine_similarities(ny_debt, row), axis=1)

Name: Nathniel Tuttle
Cname: Nathaniel Tulttle
M Score: 93
Ratio: 94
--------------------------------------------------
Name: Benjamin Moore
Cname: Benjamin Morse
M Score: 93
Ratio: 93
--------------------------------------------------
Name: Benjamin Moore
Cname: Benjamin Morse
M Score: 93
Ratio: 93
--------------------------------------------------
Name: William Barker
Cname: William Baker
M Score: 92
Ratio: 96
--------------------------------------------------
Name: Jacob Swartwout
Cname: Jacobus Swartwout
M Score: 93
Ratio: 94
--------------------------------------------------
Name: Jacob Swartwout
Cname: Jacobus Swartwout
M Score: 93
Ratio: 94
--------------------------------------------------
Name: Nathaniel Seely
Cname: Nathaniel Slely
M Score: 92
Ratio: 93
--------------------------------------------------
Name: William Badell
Cname: William Bell
M Score: 91
Ratio: 92
--------------------------------------------------
Name: Inerease Carpenter
Cname: Increase Carpenter
M Score: 9

KeyboardInterrupt: 