# Cleaning Names

The purpose of this notebook is to clean the names of individuals. All the problems that we aim to fix in this notebook are listed [here](https://docs.google.com/document/d/1pcSQfWNll6K9tl-_rB4lztN0TsZsclU9vOnbyQob-Zs/edit).

## Cleaning Table Columns

Not all tables share the same columns. Therefore, its important to take time to clean these columns. We do this by standardizing the names of columns across all the debt tables. Next, we add a state column to each table. This column will be useful when we merge all these tables together at the end.

In [87]:
# import all the necessary packages
import pandas as pd 
import numpy as np
import re
from nameparser import HumanName

In [88]:
# import aggregated debt file
agg_debt = pd.read_csv('data/final_agg_debt.csv')

  agg_debt = pd.read_csv('data/final_agg_debt.csv')


In [89]:
print(agg_debt.dtypes)

Unnamed: 0                                 int64
letter                                    object
date of the certificate | month          float64
date of the certificate | day            float64
date of the certificate | year           float64
to whom due | first name                  object
to whom due | last name                   object
to whom due | title                       object
time when the debt became due | month    float64
time when the debt became due | day       object
time when the debt became due | year      object
amount | dollars                         float64
amount | 90th                             object
line strike through? | yes?              float64
line strike through? | note               object
notes                                     object
state                                     object
org_file                                  object
org_index                                  int64
to whom due | title.1                     object
to whom due | first 

In [90]:
# record changes in this dataframe
name_changes = pd.DataFrame({'title_org': pd.Series(dtype='str'),
                       'title_new': pd.Series(dtype='str'),
                       'first_name_org': pd.Series(dtype='str'),
                       'last_name_org': pd.Series(dtype='str'),
                       'first_name_new': pd.Series(dtype='str'),
                       'last_name_new': pd.Series(dtype='str'),
                       'cleaning case': pd.Series(dtype='int'),
                       'file_loc': pd.Series(dtype='str'),
                       'org_index': pd.Series(dtype='int')})

In [91]:
'''
cleaning case:objective number 

'company':2,
'two names':3,
'abbrev':5,
'standardize':6
'''

"\ncleaning case:objective number \n\n'company':2,\n'two names':3,\n'abbrev':5,\n'standardize':6\n"

In [92]:
def add_changes(title_org, title_new, fn_org, ln_org, fn_new, ln_new, case, file, index):
    name_changes.loc[len(name_changes.index)] = [title_org, title_new, fn_org, ln_org, fn_new, ln_new, case, file, index]

## Company Names

There are multiple kinds of companies. 

```James Vernon & Co.``` These are pretty simple to deal with. If they have '& co' or '& others' anywhere in the string of the first name column, it is most likely a company. Just take the string beforehand. 

In [93]:
# dictionary of manual changes i have to make 
changes = {
    'Henry Mc Clellen & Henry & co' : 'Henry Mc Clellen & Co'
}

In [94]:
def handle_comp_name(row):    
    org_fname = row['to whom due | first name']
    org_lname = row['to whom due | last name']
    fname = row['to whom due | first name']
    
    if fname in changes:
        fname = changes[fname]
    
    fname_c = str(fname).lower()
    if ('& co' in fname_c) or ('& others' in fname_c) or ('& several others' in fname_c):        
        fname_c = fname_c.replace('& co', '').replace('& others', '')
        name = HumanName(fname_c)
        row['to whom due | first name'] = name.first
        row['to whom due | last name'] = name.last
        
        # record change
        add_changes(row['to whom due | title'], row['to whom due | title'], org_fname, org_lname, 
                   row['to whom due | first name'], row['to whom due | last name'], 2, row['org_file'], row['org_index'])
                
        return row
    
    return row

agg_debt = agg_debt.apply(lambda row: handle_comp_name(row), axis=1)

In [95]:
# checkup on name_changes
name_changes.head()

Unnamed: 0,title_org,title_new,first_name_org,last_name_org,first_name_new,last_name_new,cleaning case,file_loc,org_index
0,,,Henry Wisner & Co,,henry,wisner,2,liquidated_debt_certificates_NY.xlsx,491
1,,,Henry Wisner & Co,,henry,wisner,2,liquidated_debt_certificates_NY.xlsx,492
2,,,Henry Wisner & Co,,henry,wisner,2,liquidated_debt_certificates_NY.xlsx,493
3,,,Henry Wisner & Co,,henry,wisner,2,liquidated_debt_certificates_NY.xlsx,494
4,,,Henry Wisner & Co,,henry,wisner,2,liquidated_debt_certificates_NY.xlsx,495


## Cleaning Entries with Two Names

There are debt entries that have two names in a single cell: ```NY_2422: Messes Williamson & Beckman```. The plan is to split the name across the first name and last name columns.  

In [96]:
changes = {
    'van zandt & kittletas' : ['', 'van zandt | kittletas'],
    'trustees of & davids church':['trustees of & davids church', '']
}

In [97]:
def handle_two_name(row):
    org_fn = row['to whom due | first name']
    org_ln = row['to whom due | last name']
    name = str(row['to whom due | first name']).lower()
    if (' & ' in name) or (' and ' in name):
        person1 = re.split('&|and', name)[0].strip()
        person2 = re.split('&|and', name)[1].strip()
        human_name_1 = HumanName(person1)
        human_name_2 = HumanName(person2)
        
        if name not in changes:
            if human_name_1.first != '' and human_name_2.first != '':
                row['to whom due | first name'] = human_name_1.first + " | " + human_name_2.first
            else: 
                row['to whom due | first name'] = human_name_1.first + human_name_2.first

            if human_name_1.last != '' and human_name_2.last != '':
                row['to whom due | last name'] = human_name_1.last + " | " + human_name_2.last
            else:
                row['to whom due | last name'] = human_name_1.last + human_name_2.last
        else:
            row['to whom due | first name'] = changes[name][0]
            row['to whom due | last name'] = changes[name][1]
                
        # record change
        add_changes(row['to whom due | title'], row['to whom due | title'], org_fn, org_ln, 
                   row['to whom due | first name'], row['to whom due | last name'], 3, row['org_file'], row['org_index'])
        
    return row

agg_debt.apply(lambda row: handle_two_name(row), axis=1)

Unnamed: 0.1,Unnamed: 0,letter,date of the certificate | month,date of the certificate | day,date of the certificate | year,to whom due | first name,to whom due | last name,to whom due | title,time when the debt became due | month,time when the debt became due | day,...,amount | 10th,exchange,amount in specie | dollars,amount in specie | cents,amount | 8th,delivered | month,delivered | day,delivered | year,total dollars | notes,total dollars | notes.1
0,11,C,8.0,27.0,1783.0,Elizabeth,Lowell,,4.0,16.0,...,,,,,,,,,,
1,12,L,8.0,27.0,1783.0,Joshua,Brackett,Jun,4.0,16.0,...,,,,,,,,,,
2,13,G,8.0,27.0,1783.0,Joshua,Brackett,,4.0,16.0,...,,,,,,,,,,
3,14,C,9.0,2.0,1783.0,Phillips,White,Esq,4.0,16.0,...,,,,,,,,,,
4,15,B,9.0,2.0,1783.0,William,White,,4.0,16.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203066,94463,N,5.0,26.0,1789.0,William,Smith,,6.0,1.0,...,,,,,,,,,,
203067,94464,P,5.0,28.0,1789.0,James,Odell,,1.0,1.0,...,,,,,,,,,Reg 6768,
203068,94465,H,4.0,28.0,1789.0,hoov | harrison,,,6.0,1.0,...,,,,,,,,,,
203069,94466,,,,,,,,,,...,,,,,,,,,,


In [98]:
# checkup on name_changes
name_changes.tail()

Unnamed: 0,title_org,title_new,first_name_org,last_name_org,first_name_new,last_name_new,cleaning case,file_loc,org_index
2546,,,Isaac & Thoroughgood Smith,,isaac | thoroughgood,smith,3,loan_office_certificates_9_states.xlsx,80909
2547,,,Isaac & Thoroughgood Smith,,isaac | thoroughgood,smith,3,loan_office_certificates_9_states.xlsx,80910
2548,,,Baker Blow & Oldham,,baker | oldham,blow,3,loan_office_certificates_9_states.xlsx,80912
2549,,,Moses Bush & Sons,,moses | sons,bush,3,Marine_Liquidated_Debt_Certificates.xlsx,175
2550,,,Hoov and Harrison,,hoov | harrison,,3,Marine_Liquidated_Debt_Certificates.xlsx,764


## Handle Abbreviations of a Name

There are individuals who have a handwritten abbreviation of a name in their debt entry. Fix these names. There will be a dictionary of abbreviations. Just check if any of the debt entries are in the dictionary and change it if needed. 

In [99]:
abbreviations = {
    'And':'Andrew', 'Ant':'Anthony', 'Bart':'Bartholomew', 'Cha':'Charles', 'Dor':'Dorothy', 'Dot':'Dorothy', 'Doth':'Dorothy',
    'Edw':'Edward', 'Eliz':'Elizabeth', 'Geo':'George', 'H':'Henry', 'Herb':'Herbert', 'Ja':'James', 'Jn':'John', 'Marg':'Margaret', 
    'Mich':'Michael', 'Pat': 'Patrick', 'Rich':'Richard', 'Tho':'Thomas', 'W':'William'
}

In [100]:
def handle_abbreviations(row):
    fn = str(row['to whom due | first name'])
    if fn in abbreviations:
        row['to whom due | first name'] = abbreviations[fn]
        # record changes
        add_changes(row['to whom due | title'], row['to whom due | title'], fn, 
                    row['to whom due | last name'], row['to whom due | first name'], 
                    row['to whom due | last name'], 5, row['org_file'], row['org_index'])
    
    return row

# test on new jersey dataset for now 
agg_debt.apply(lambda row: handle_abbreviations(row), axis=1)

Unnamed: 0.1,Unnamed: 0,letter,date of the certificate | month,date of the certificate | day,date of the certificate | year,to whom due | first name,to whom due | last name,to whom due | title,time when the debt became due | month,time when the debt became due | day,...,amount | 10th,exchange,amount in specie | dollars,amount in specie | cents,amount | 8th,delivered | month,delivered | day,delivered | year,total dollars | notes,total dollars | notes.1
0,11,C,8.0,27.0,1783.0,Elizabeth,Lowell,,4.0,16.0,...,,,,,,,,,,
1,12,L,8.0,27.0,1783.0,Joshua,Brackett,Jun,4.0,16.0,...,,,,,,,,,,
2,13,G,8.0,27.0,1783.0,Joshua,Brackett,,4.0,16.0,...,,,,,,,,,,
3,14,C,9.0,2.0,1783.0,Phillips,White,Esq,4.0,16.0,...,,,,,,,,,,
4,15,B,9.0,2.0,1783.0,William,White,,4.0,16.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203066,94463,N,5.0,26.0,1789.0,William,Smith,,6.0,1.0,...,,,,,,,,,,
203067,94464,P,5.0,28.0,1789.0,James,Odell,,1.0,1.0,...,,,,,,,,,Reg 6768,
203068,94465,H,4.0,28.0,1789.0,Hoov and Harrison,,,6.0,1.0,...,,,,,,,,,,
203069,94466,,,,,,,,,,...,,,,,,,,,,


In [101]:
# checkup on name_changes
name_changes.tail()

Unnamed: 0,title_org,title_new,first_name_org,last_name_org,first_name_new,last_name_new,cleaning case,file_loc,org_index
2643,,,Pat,Quaelly,Patrick,Quaelly,5,loan_office_certificates_9_states.xlsx,61503
2644,,,Rich,Thompson,Richard,Thompson,5,loan_office_certificates_9_states.xlsx,69947
2645,,,Rich,Thompson,Richard,Thompson,5,loan_office_certificates_9_states.xlsx,69948
2646,,,Rich,Thompson,Richard,Thompson,5,loan_office_certificates_9_states.xlsx,69949
2647,,,Rich,Thompson,Richard,Thompson,5,loan_office_certificates_9_states.xlsx,69950


## Standardizing Names

Multiple different spellings of a name can be referring to the same identity. We will use a phonetics library and Ancestry to fix this. 

In [105]:
# import necessary fuzzy string libraries 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.expected_conditions import element_to_be_clickable, presence_of_element_located
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from phonetics import metaphone
from fuzzywuzzy import fuzz
from jellyfish import soundex
import getpass
import time

In [15]:
# options
options = Options()
options.add_argument('--headless')
options.add_argument("--window-size=1000,1000")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--no-sandbox')   
options.add_argument(r'--user-data-dir=C:/Users/david/AppData/Local/Google/Chrome/User Data')

In [16]:
# install driver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
wait = WebDriverWait(driver, 30)

[WDM] - Downloading: 100%|████████████████████████████████████████████████████████| 6.30M/6.30M [00:00<00:00, 37.6MB/s]


In [17]:
# login to emory ancestry 
driver.get('https://guides.libraries.emory.edu/ALE')
login_btn_xpath = '/html/body/main/div/div/div/a'
wait.until(element_to_be_clickable((By.XPATH, login_btn_xpath))).click()

# input login information and click 'login'
netid_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[1]/input'
password_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[2]/input'
username = input('username: ')
password = getpass.getpass(prompt='password: ')
netid_input = wait.until(element_to_be_clickable((By.XPATH, netid_xpath)))
netid_input.click()
netid_input.send_keys(username)
pass_input = wait.until(element_to_be_clickable((By.XPATH, password_xpath)))
pass_input.click()
pass_input.send_keys(password)

login_btn_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[3]/button'
wait.until(element_to_be_clickable((By.XPATH, login_btn_xpath))).click()
time.sleep(1)

driver.get('https://www.ancestrylibrary.com/search/collections/5058/')

username: dcho52
password: ········


In [None]:
# only search name up once 
# what if there are no results
# find unique urls (x)

In [102]:
agg_debt.state.unique()

array(['nh', 'nj', 'ny', 'ma', 'de', 'ct', nan, 'va', 'pa', 'md', 'nc',
       'cs', 'ga', 'ri', 'f'], dtype=object)

In [104]:
# ancestry has unique urls for each state
ancestry_state_urls = {
    'nh':'_new+hampshire-usa_32',
    'ny':'_new+york-usa_35',
    'ma':'_massachusetts-usa_24',
    'ct':'_connecticut-usa_9',
    'pa':'_pennsylvania-usa_41',
    'md':'_maryland-usa_23',
    'nc':'_north+carolina-usa_36',
    'ri':'_rhode+island-usa_42'
}

In [28]:
def access_ancestry(fn1, ln1, fn2, ln2, state):
    name1 = fn1 + ' ' + ln1
    name2 = fn2 + ' ' + ln2
    
    driver.get('https://www.ancestrylibrary.com/search/collections/5058/?name=' + fn1 + '_' + ln1 + '&residence=_new+york-usa_35&residence_x=_1-0')
    
    try:
        result_fn1 = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span/span[1]'))).text
        result_ln1 = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span/span[2]'))).text
        result1 = result_fn1 + ' ' + result_ln1
    except NoSuchElementException:
        result1 = name1
        result2 = name2
    
    print('first ancestry result: ' + result1)
    print(driver.current_url)
    
    driver.get('https://www.ancestrylibrary.com/search/collections/5058/?name=' + fn2 + '_' + ln2 + '&residence=_new+york-usa_35&residence_x=_1-0')
    print(driver.current_url)
    
    try:
        result_fn2 = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span/span[1]'))).text
        result_ln2 = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span/span[2]'))).text
        result2 = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span'))).text
    except NoSuchElementException:
        result1 = name1
        result2 = name2
        
    print('second ancestry result: ' + result2) 
    
    if result1 == name1 and result2 == name1:
        print('result1 == name1 and result2 == name1')
        return (fn1, ln1)
    elif result1 == name2 and result2 == name2:
        print('result1 == name2 and result2 == name2')
        return (fn2, ln2)
    else:
        return (fn1, ln1, fn2, ln2)

def only_f(fn, ln, crow):
    if crow['to whom due | first name'][0] == fn[0] and crow['to whom due | last name'][0] == ln[0]:
        return crow

def fuzzy_similarity(row, row0):
    name = row0['to whom due | first name'] + ' ' + row0['to whom due | last name']
    cname = row['to whom due | first name'] + ' ' + row['to whom due | last name']
    code1 = metaphone(name)
    code2 = metaphone(cname)
    ratio = fuzz.ratio(name, cname)
    score = fuzz.ratio(code1, code2)
    if score > 90 and ratio > 90 and name != cname:
        print('Name: ' + name)
        print('Cname: ' + cname)
        print('M Score: ' + str(score))
        print('Ratio: ' + str(ratio))
        correct_name = access_ancestry(row['to whom due | first name'], row['to whom due | last name'], 
                        row0['to whom due | first name'], row0['to whom due | last name'], abbrev_to_us_state[row['state']])
        
        if len(correct_name) == 2:
            row['to whom due | first name'] = correct_name[0]
            row['to whom due | last name'] = correct_name[1]
            row0['to whom due | first name'] = correct_name[0]
            row0['to whom due | last name'] = correct_name[1]
        else: 
            row['to whom due | first name'] = correct_name[0]
            row['to whom due | last name'] = correct_name[1]
            row0['to whom due | first name'] = correct_name[2]
            row0['to whom due | last name'] = correct_name[3]
        
        print('--------------------------------------------------')
        # print('Name=' + name + ' - CName=' + cname + ' - M Score=' + str(score) + )
    
def determine_similarities(df, row0):
    fn = row0['to whom due | first name']
    ln = row0['to whom due | last name'] 
    onlyfc_df = df.apply(lambda row: only_f(fn, ln, row), axis=1).dropna()
    
    if len(onlyfc_df) > 0:
        onlyfc_df.apply(lambda row: fuzzy_similarity(row, row0))
    
agg_debt['to whom due | first name'] = agg_debt['to whom due | first name'].astype(str)
agg_debt['to whom due | last name'] = agg_debt['to whom due | last name'].astype(str)
agg_debt.apply(lambda row: determine_similarities(agg_debt, row), axis=1)

Name: James Gallaway
Cname: James Galloway
M Score: 100
Ratio: 93
first ancestry result: James Galloway
https://www.ancestrylibrary.com/search/collections/5058/?name=James_Galloway&residence=_new+york-usa_35&residence_x=_1-0
https://www.ancestrylibrary.com/search/collections/5058/?name=James_Gallaway&residence=_new+york-usa_35&residence_x=_1-0
second ancestry result: James Galloway
result1 == name1 and result2 == name1
--------------------------------------------------
Name: James Gallaway
Cname: James Galloway
M Score: 100
Ratio: 93
first ancestry result: James Galloway
https://www.ancestrylibrary.com/search/collections/5058/?name=James_Galloway&residence=_new+york-usa_35&residence_x=_1-0
https://www.ancestrylibrary.com/search/collections/5058/?name=James_Gallaway&residence=_new+york-usa_35&residence_x=_1-0
second ancestry result: James Galloway
result1 == name1 and result2 == name1
--------------------------------------------------
Name: Nathniel Tuttle
Cname: Nathaniel Tulttle
M Sco

KeyboardInterrupt: 