# Cleaning Names

The purpose of this notebook is to clean the names of individuals. All the problems that we aim to fix in this notebook are listed [here](https://docs.google.com/document/d/1pcSQfWNll6K9tl-_rB4lztN0TsZsclU9vOnbyQob-Zs/edit).

In [None]:
# import all the necessary packages
from nameparser import HumanName
import pandas as pd 
import numpy as np
import re
import csv
import ast

In [None]:
# import aggregated debt file
agg_debt = pd.read_csv('data/final_agg_debt.csv')

In [None]:
print(agg_debt.dtypes)

## Documenting Changes

<b>Goal: </b> We need to document changes we make to ```agg_debt.csv``` in a separate dataframe: ```name_changes```. This way, we can double-check whether those changes were appropriate. 

<b>Steps</b>
1. Create an empty dataframe. Here are the column names:
    - ```title_org```: The original title of the individual (Mr., Ms., etc.)
    - ```title_new```: The new title of the individual (Mr., Ms., etc.) 
    - ```first_name_org```: The original first name of the individual from the unchanged ```agg_debt.csv```
    - ```last_name_org```: The original last name of the individual from the unchanged ```agg_debt.csv``` 
    - ```first_name_new``` : If first name changed, record it here. Otherwise, this entry will still be the old name. 
    - ```last_name_new```: If last name changed, record it here. Otherwise, this entry will still be the old name. 
    - ```cleaning case```: This corresponds with the task number in the objectives document linked above. 
    - ```file_loc```: The individual state filename in which the row came from 
    - ```org_index```: The original index/row that the debt entry can be found in ```file_loc``` 
2. Create a function that adds a new row to the dataframe. This function will be called while we are cleaning. 

**Cleaning case = Objective number** 
- Clean company names = 2,
- Handle two names = 3,
- Handle abbreviations = 5,
- Standardize names (Ancestry) = 6

In [None]:
# record changes in this dataframe
name_changes = pd.DataFrame({'title_org': pd.Series(dtype='str'),
                       'title_new': pd.Series(dtype='str'),
                       'first_name_org': pd.Series(dtype='str'),
                       'last_name_org': pd.Series(dtype='str'),
                       'first_name_new': pd.Series(dtype='str'),
                       'last_name_new': pd.Series(dtype='str'),
                       'cleaning case': pd.Series(dtype='int'),
                       'file_loc': pd.Series(dtype='str'),
                       'org_index': pd.Series(dtype='int')})

In [None]:
def add_changes(title_org, title_new, fn_org, ln_org, fn_new, ln_new, case, file, index):
    name_changes.loc[len(name_changes.index)] = [title_org, title_new, fn_org, ln_org, fn_new, ln_new, case, file, index]

## Company Names

There are multiple kinds of companies. 

<b>Goal: </b> Some debt entries are actually company names or represent a group of people (example: ```James Vernon & Co.```). 

<b>Steps: </b>
1. Use string parsing to find if a debt entry has '& co' or '& others' in it's name. Note: I noticed these company names appear in the first name column. I do <b>not</b> run this program on the last name column.
2. I remove the '& co' or '& others' from the name. I use a human name parser library. This library can find out what parts of the name are the first name versus last name. 
3. I put the first name and last name in their own respective columns. 
4. Record name change in ``name_changes``.

In [None]:
# retrieve manual corrections from csv file if they exist 
manual_corrects_df = pd.read_csv('data/manual_corrections.csv')
manual_corrects_dict = manual_corrects_df.to_dict(orient='index')
manual_corrects = {}
# add manual corrections to manual_corrects dictionary 
for correction in manual_corrects_dict.keys():
    manual_corrects[manual_corrects_dict[correction]['Unnamed: 0']] = [manual_corrects_dict[correction]['new first name'], manual_corrects_dict[correction]['new last name']]

print(manual_corrects)

In [None]:
# dictionary of manual changes i have to make 
changes = {
    'Henry Mc Clellen & Henry & co' : 'Henry Mc Clellen & Co'
}

conn_words = [' for ', ' of ', ' and '] # these are connector key words
corp_key_words = ('corporation', ' and co', ' and coy', ' and others', ' and several others', ' and heirs', ' and comp') # these are corporation key words

In [None]:
english_nlp = spacy.load('en_core_web_sm')

def handle_comp_name(row):        
    org_fname = str(row['to whom due | first name'])
    org_lname = str(row['to whom due | last name'])
    
    fname = str(row['to whom due | first name'])
    fname = fname.replace('&', 'and')
    fname = fname.replace('.', '')
    
    if fname in changes:
        fname = changes[fname]
    
    fname_l = str(fname).lower()
    
    # check if string ends with co, coy, or others; if so, delete 
    for key_word in corp_key_words:
        if fname_l.endswith(key_word):
            print('index=' + str(row['Unnamed: 0']))
            print('old name=' + str(fname_l))      
            fname_corr = fname_l.split(key_word)
            print('corrected name=' + str(fname_corr[0])) 
            fname_corr = fname_corr[0]
            fname_sp = fname_corr.split()
            
            # only one name; put name into last name column 
            if len(fname_sp) == 1:
                row['to whom due | first name'] = ''
                row['to whom due | last name'] = fname_sp[0].capitalize()
                print('corrected name=' + str(fname_sp[0])) 
                print('new last name=' + str(fname_sp[0].capitalize()))
                
            # if there are is only a first name and a last name, put into respective columns
            elif len(fname_sp) == 2:
                row['to whom due | first name'] = fname_sp[0].capitalize()
                row['to whom due | last name'] = fname_sp[1].capitalize()
                print('new first name=' + str(fname_sp[0].capitalize()))
                print('new last name=' + str(fname_sp[1].capitalize()))
                
            # handles middle names; put middle names in last name column 
            elif len(fname_sp) == 3:
                row['to whom due | first name'] = fname_sp[0].capitalize() 
                row['to whom due | last name'] = fname_sp[1].capitalize() + ' ' + fname_sp[2].capitalize()
                print('new first name=' + str(fname_sp[0].capitalize()))
                print('new last name=' + str(fname_sp[1].capitalize() + ' ' + fname_sp[2].capitalize()))  
            # manually clean debt entries that have long names 
            else: 
                # check if name has already been manually cleaned
                if fname_corr in manual_corrects:
                    new_fname = manual_corrects[fname_corr][0]
                    new_lname = manual_corrects[fname_corr][1]
                else:
                    new_fname = input('new first name: ')
                    new_lname = input('new last name: ') 
                    manual_corrects[fname_corr] = [new_fname, new_lname]
                
                row['to whom due | first name'] = new_fname.capitalize()
                row['to whom due | last name'] = new_lname.capitalize()
                    
                print('new first name=' + str(new_fname.capitalize()))
                print('new last name=' + str(new_lname.capitalize()))  
                
            # record change 
            add_changes(row['to whom due | title'], row['to whom due | title'], org_fname, org_lname, 
                   row['to whom due | first name'], row['to whom due | last name'], 2, row['org_file'], row['org_index'])
            
            print('+------------------------------+')
        # if the name starts with any keyword: 'corporation for the relief of...'; manually change these names
        elif fname_l.startswith(key_word): 
            print('index=' + str(row['Unnamed: 0']))
            print('old name=' + str(fname_l))      
            
            # check if name has already been manually cleaned
            if fname_l in manual_corrects:
                new_fname = str(manual_corrects[fname_l][0])
                new_lname = str(manual_corrects[fname_l][1])
            else:
                new_fname = input('new first name: ')
                new_lname = input('new last name: ') 
                manual_corrects[fname_l] = [new_fname, new_lname]

            row['to whom due | first name'] = new_fname.capitalize()
            row['to whom due | last name'] = new_lname.capitalize()

            print('new first name=' + str(new_fname.capitalize()))
            print('new last name=' + str(new_lname.capitalize()))  
    
    return row

agg_debt = agg_debt.apply(lambda row: handle_comp_name(row), axis=1)

In [None]:
# checkup on name_changes
name_changes.head()

## Cleaning Entries with Two Names

<b>Goal: </b>There are debt entries that have two names in a single cell: ```NY_2422: Messes Williamson & Beckman```. The plan is to split the name across the first name and last name columns. Note: I have to check naming conventions during thre 1700s. 

<b>Steps: </b>
1. Use string parsing to check if the name contains '&' or 'and' and split the string accordingly. 
2. Use the human name parser library to determine the first name and last names. 
3. Put each person's first name and last name in the respective columns, split by ```|``` to separate both individuals' names. 
4. Record change in ```name_changes```.

In [None]:
changes = {
    'van zandt & kittletas' : ['', 'van zandt | kittletas'],
    'trustees of & davids church':['trustees of & davids church', '']
}

In [None]:
def handle_two_name(row):
    org_fn = row['to whom due | first name']
    org_ln = row['to whom due | last name']
    
    org_fn_l = str(org_fn).lower()
        
    # remove extraneous information like 'for the estates of...'
    org_fn_l = org_fn_l.split(' for ')[0]

    # remove extraneous information like 'of the heirs of...'
    org_fn_l = org_fn_l.split(' of ')[0]

    # remove occupations: guardians, etc. 
    org_fn_l = org_fn_l.replace(' guardian', '')
    
    # check if there are two individuals, but check if there are more than 7 words (most likely a society)
    if ' and ' in org_fn_l and len(org_fn_l.split()) <= 7:   
        print('original name= ' + org_fn_l)
        
        # cleaning extraneous information can reveal there to be only one name
        #if ' and ' in org_fn_l:
        person1 = org_fn_l.split(' and ')[0]
        person2 = org_fn_l.split(' and ')[1]
        person1_sp = person1.split() 
        person2_sp = person2.split()

        # recapitalize people's names
        for word in person1_sp:
            word = word.capitalize()
        for word in person2_sp:
            word = word.capitalize()

        # if both individuals only have a last name; put both last names into last name column  ex. edward and joseph
        if len(person1_sp) == 1 and len(person2_sp) == 1:
            row['to whom due | last name'] = person1_sp[0] + ' | ' + person2_sp[0]
            print('new last name col=' + person1_sp[0] + ' | ' + person2_sp[0])
        # if there are three separate last names; put all three into last name column: ex. vance caldwell and vance
        elif len(person1_sp) == 2 and len(person2_sp) == 1:
            row['to whom due | last name'] = person1_sp[0] + ' | ' + person1_sp[1] + ' | ' + person2_sp[0]
            print('new last name col=' + person1_sp[0] + ' | ' + person1_sp[1] + ' | ' + person2_sp[0])
        # if both individuals belong to the same family; put names into respective cols: ex. peter and isaac wikoff  
        elif len(person1_sp) == 1 and len(person2_sp) == 2:
            row['to whom due | first name'] = person1_sp[0] + ' | ' + person2_sp[0]
            row['to whom due | last name'] = person2_sp[1]
            print('new first name col=' + person1_sp[0] + ' | ' + person2_sp[0])
            print('new last name col=' + person2_sp[1])
        # if both individuals are two completely different people with full names; ex. john doe and james hill
        elif len(person1_sp) == 2 and len(person2_sp) == 2:
            row['to whom due | first name'] = person1_sp[0] + ' | ' + person2_sp[0]
            row['to whom due | last name'] = person1_sp[1] + ' | ' + person2_sp[1]
            print('new first name col=' + person1_sp[0] + ' | ' + person2_sp[0])
            print('new last name col=' + person1_sp[1] + ' | ' + person2_sp[1])
        # if either individual has a middle name; group middle names with the last name; ex. john hill doe and james madison hill
        elif len(person1_sp) == 3 or len(person2_sp) == 3:
            row['to whom due | first name'] = person1_sp[0] + ' | ' + person2_sp[0]
            # determine which individual has the middle name
            if len(person1_sp) == 3:
                person2_ln = ''
                if len(person2_sp) > 1:
                    person2_ln = person2_sp[1]
                
                row['to whom due | last name'] = person1_sp[1] + ' ' + person1_sp[2] + ' | ' + person2_ln
                print('new last name col=' + person1_sp[1] + ' ' + person1_sp[2] + ' | ' + person2_ln)
            elif len(person2_sp) == 3:
                person1_ln = ''
                if len(person1_sp) > 1:
                    person1_ln = person1_sp[1]
                
                row['to whom due | last name'] = person1_ln + ' | ' + person2_sp[1] + ' ' + person2_sp[2]
                print('new last name col=' + person1_ln + ' | ' + person2_sp[1] + ' ' + person2_sp[2]) 
            # both individuals have a middle name 
            else:
                row['to whom due | last name'] = person1_sp[1] + ' ' + person1_sp[2] + ' | ' + person2_sp[1] + ' ' + person2_sp[2]
                print('new last name col=' + person1_sp[1] + ' ' + person1_sp[2] + ' | ' + person2_sp[1] + ' ' + person2_sp[2]) 
            
        print('+------------------------------+')
    # might be a corporation or many names; manually fix
    elif ' and ' in org_fn_l and len(org_fn_l.split()) > 7:
        print('original name= ' + org_fn_l)
         # check if name has already been manually cleaned
        if org_fn in manual_corrects:
            new_fname = str(manual_corrects[org_fn][0])
            new_lname = str(manual_corrects[org_fn][1])
        else:
            new_fname = input('new first name: ')
            new_lname = input('new last name: ') 
            manual_corrects[org_fn] = [new_fname, new_lname]

        row['to whom due | first name'] = new_fname.capitalize()
        row['to whom due | last name'] = new_lname.capitalize()

        print('new first name col=' + row['to whom due | first name'])
        print('new last name col=' + row['to whom due | last name'])

        print('+------------------------------+')
        
    return row

agg_debt.apply(lambda row: handle_two_name(row), axis=1)

In [None]:
# save manual corrections 
manual_corrects_df = pd.DataFrame.from_dict(manual_corrects, orient='index') 
manual_corrects_df.columns = ['new first name', 'new last name']
manual_corrects_df.to_csv('data/manual_corrections.csv')

In [None]:
# checkup on name_changes
name_changes.tail()

## Handle Abbreviations of a Name

<b>Goal: </b>There are individuals who have a handwritten abbreviation of a name in their debt entry. Thanks to Chris, he found a website with all these [abbreviations](https://hull-awe.org.uk/index.php/Conventional_abbreviations_for_forenames). 

<b>Steps: </b>
1. Copy and past the name abbreviations from the website into a dictionary. 
2. Iterate through each row in the dataframe.
3. Check if the name is an abbreviation and change accordingly. 
4. Record changes. 


In [None]:
abbreviations = {
    'And':'Andrew', 'Ant':'Anthony', 'Bart':'Bartholomew', 'Cha':'Charles', 'Dor':'Dorothy', 'Dot':'Dorothy', 'Doth':'Dorothy',
    'Edw':'Edward', 'Eliz':'Elizabeth', 'Geo':'George', 'H':'Henry', 'Herb':'Herbert', 'Ja':'James', 'Jn':'John', 'Marg':'Margaret', 
    'Mich':'Michael', 'Pat': 'Patrick', 'Rich':'Richard', 'Tho':'Thomas', 'W':'William'
}

In [None]:
def handle_abbreviations(row):
    fn = str(row['to whom due | first name'])
    if fn in abbreviations:
        row['to whom due | first name'] = abbreviations[fn]
        # record changes
        add_changes(row['to whom due | title'], row['to whom due | title'], fn, 
                    row['to whom due | last name'], row['to whom due | first name'], 
                    row['to whom due | last name'], 5, row['org_file'], row['org_index'])
    
    return row

agg_debt.apply(lambda row: handle_abbreviations(row), axis=1)

In [None]:
# checkup on name_changes
name_changes.tail()

## Standardizing Names

<b>Goal: </b>Multiple different spellings of a name can be referring to the same identity. We will use a phonetics library and Ancestry to fix this. 

<b>Steps: </b>
1. Login to Emory's Ancestry subscription 
2. Iterate through ```agg_debt```, through each debt entry. 
3. Use a combination of phonetics fuzzy string matching and normal fuzzy string matching to determine if two names from a state are similar.  
4. Search each name in Ancestry: Edit URL (state and person's name). 
5. Check if there are any results for both person's name:
    - Yes: Check if one spelling of the name appears for both individuals (that's most likely the correct spelling of that name) 
    - No: Leave entries as two separate entries. 
6. Record name change in ```fixes``` list (save ```fixes``` as ```out.csv``` too). 
7. Run ```agg_debt``` through ```fixes```, making changes as necessary. 
8. Save ```agg_debt``` as a new .csv file.

<b style="color: red;">Note: Runtime is long. This is due to the fact there are over 200,000 debt entries and accessing Ancestry takes time too. </b>


In [None]:
# import necessary fuzzy string libraries 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.expected_conditions import element_to_be_clickable, presence_of_element_located
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from phonetics import metaphone
from fuzzywuzzy import fuzz
import getpass

In [None]:
# options
options = Options()
options.add_argument('--headless')
options.add_argument("--window-size=1000,1000")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--no-sandbox')   
options.add_argument(r'--user-data-dir=C:/Users/david/AppData/Local/Google/Chrome/User Data')

In [None]:
# install driver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
wait = WebDriverWait(driver, 30)

In [None]:
# login to emory ancestry 
driver.get('https://guides.libraries.emory.edu/ALE')
login_btn_xpath = '/html/body/main/div/div/div/a'
wait.until(element_to_be_clickable((By.XPATH, login_btn_xpath))).click()

# input login information and click 'login'
netid_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[1]/input'
password_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[2]/input'
username = input('username: ')
password = getpass.getpass(prompt='password: ')
netid_input = wait.until(element_to_be_clickable((By.XPATH, netid_xpath)))
netid_input.click()
netid_input.send_keys(username)
pass_input = wait.until(element_to_be_clickable((By.XPATH, password_xpath)))
pass_input.click()
pass_input.send_keys(password)

login_btn_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[3]/button'
wait.until(element_to_be_clickable((By.XPATH, login_btn_xpath))).click()
time.sleep(1)

driver.get('https://www.ancestrylibrary.com/search/collections/5058/')

In [None]:
'''
Find out what states are in the agg_debt dataframe - Helps with finding Ancestry urls 
Note:
- CS most likely stands for congress: Hazen's regiment 
- F probably stands for 'foreign' officers: most likely France
'''
agg_debt.state.unique()

In [None]:
# ancestry has unique urls for each state
ancestry_state_urls = {
    'nh':'_new+hampshire-usa_32',
    'ny':'_new+york-usa_35',
    'ma':'_massachusetts-usa_24',
    'ct':'_connecticut-usa_9',
    'pa':'_pennsylvania-usa_41',
    'md':'_maryland-usa_23',
    'nc':'_north+carolina-usa_36',
    'ri':'_rhode+island-usa_42'
}

In [None]:
fixes = [] #record name necessary name changes here

In [None]:
c_checked = [] #multiple debt entries for the same person: don't search these names again when comparing

In [None]:
checked0 = [] #multiple debt entries for the same person: don't search these names again

In [None]:
rerun_rows = [] #ancestry crashed trying to find these names

In [None]:
def access_ancestry(fn0, ln0, c_fn, c_ln, state, row0, c_row):
    name0 = fn0 + ' ' + ln0 # static
    c_name = c_fn + ' ' + c_ln # changing
    
    driver.get('https://www.ancestrylibrary.com/search/collections/5058/?name=' + fn0 + '_' + ln0 + '&residence=' + ancestry_state_urls[state] + '&residence_x=_1-0')
    print(driver.current_url)
    
    try:
        try:
            check_exists = driver.find_element(By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/div/div')
            result1 = name0
            result2 = c_name
        except NoSuchElementException:  
            result = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span'))).text
            result_name = HumanName(result)

            if result_name.middle != '':
                result1 = result_name.first + ' ' + result_name.middle + ' ' + result_name.last
            else:
                result1 = result_name.first + ' ' + result_name.last
        
        print('index: ' + str(row0['index1']))
        print('first ancestry result: ' + result1)

        driver.get('https://www.ancestrylibrary.com/search/collections/5058/?name=' + c_fn + '_' + c_ln + '&residence=' + ancestry_state_urls[state] + '&residence_x=_1-0')
        print(driver.current_url)

        try:
            check_exists = driver.find_element(By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/div/div')
            result1 = name0
            result2 = c_name
        except NoSuchElementException:
            result = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span'))).text
            result_name = HumanName(result)

            if result_name.middle != '':
                result2 = result_name.first + ' ' + result_name.middle + ' ' + result_name.last
            else:
                result2 = result_name.first + ' ' + result_name.last

        print('second ancestry result: ' + result2) 

        if result1 == name0 and result2 == name0: # name0 is the correct spelling of the name
            fixes.append([state, {c_name:name0}])
            
            # record changes
            add_changes(c_row['to whom due | title'], c_row['to whom due | title'], c_row['to whom due | first name'],
                        c_row['to whom due | last name'], fn0, ln0, 6, c_row['org_file'], c_row['org_index'])
        elif result1 == c_name and result2 == c_name: # c_name is the correct spelling of the name
            fixes.append([state, {name0:c_name}])
            
            # record changes
            add_changes(row0['to whom due | title'], row0['to whom due | title'], row0['to whom due | first name'],
                        row0['to whom due | last name'], c_fn, c_ln, 6, row0['org_file'], row0['org_index'])
    except:
        rerun_rows.append([row0, c_row])

def only_f(row0, row):
    try:
        if row['to whom due | first name'][0] == row0['to whom due | first name'][0] and row['to whom due | last name'][0] == row0['to whom due | last name'][0] and row['state'] == row0['state']:
            return row
    except:
        return 

def fuzzy_similarity(c_row, row0):
    name = row0['to whom due | first name'] + ' ' + row0['to whom due | last name']
    cname = c_row['to whom due | first name'] + ' ' + c_row['to whom due | last name']
    
    # check fuzzy string ratio and metaphone ratio
    code1 = metaphone(name)
    code2 = metaphone(cname)
    ratio = fuzz.ratio(name, cname)
    score = fuzz.ratio(code1, code2)

    # only search ancestry when the two names have a ratio greater than 90, the names don't equal each other, 
    # and we haven't checled current name already
    if score > 90 and ratio > 90 and name != cname and (cname not in c_checked):
        print('name: ' + name)
        print('comparing to name: ' + cname)
        
        correct_name = access_ancestry(row0['to whom due | first name'], row0['to whom due | last name'],
                                       c_row['to whom due | first name'], c_row['to whom due | last name'], row0['state'], row0, c_row)
            
        print('--------------------------------------------------')
        c_checked.append(cname)
    
def determine_similarities(row0):
    current_name = row0['to whom due | first name'] + ' ' + row0['to whom due | last name']
    
    # only search ancestry when ancestry has records and if we have not checked name already
    if (row0['state'] in ancestry_state_urls) and ([row0['state'], current_name] not in checked0):
        
        # shorten table to only include names that share first letter of first name and last name and come from the same state
        short_table = agg_debt.apply(lambda row: only_f(row0, row), axis=1).dropna()

        if len(short_table) > 0:
            c_checked.clear()
            short_table.apply(lambda row: fuzzy_similarity(row, row0))
            
        checked0.append([row0['state'], current_name])
                        
    if len(fixes) % 5 == 0 and len(fixes) > 0:
        print(fixes[len(fixes) - 1]) 
    
agg_debt['to whom due | first name'] = agg_debt['to whom due | first name'].astype(str)
agg_debt['to whom due | last name'] = agg_debt['to whom due | last name'].astype(str)
agg_debt.sort_values('to whom due | last name', inplace=True)
agg_debt.reset_index(inplace=True)
agg_debt['index1'] = agg_debt.index
agg_debt.apply(lambda row: determine_similarities(agg_debt, row), axis=1)

In [None]:
# save agg_debt as csv file: 'out.csv'
agg_debt.to_csv('data/out.csv')

In [None]:
with open('data/out.csv', 'r') as read_obj:

    csv_reader = csv.reader(read_obj)
  
    # convert string to list
    entries = list(csv_reader)
    
    # remove empty lists
    entries = [entry for entry in entries if entry != []]
  
    print(entries)

In [None]:
# reorganize entries and group by state
entries_dict = {}
for entry in entries:
    if entry[0] not in entries_dict:
        entries_dict[entry[0]] = [entry]
    else:
        entries_dict[entry[0]] += [entry]

print(entries_dict)

In [None]:
# implement changes to agg_debt
def implement_name_changes(row):  
    if str(row['state']) not in entries_dict:
        return row 
    
    # only select part of the list that belongs to one person 
    entries_red = entries_dict[row['state']]
    
    # loop through list
    # check if name matches 
    full_name = str(row['to whom due | first name']) + ' ' + str(row['to whom due | last name'])
    
    for entry in entries_red:
        name_dict = ast.literal_eval(entry[1])
        
        if full_name in name_dict:
            new_name = name_dict[full_name]
            new_name_l = new_name.split()
            new_fn = ''
            new_ln = ''
            
            # account for middle initials and middle names
            if len(new_name_l) >= 3:
                name = HumanName(new_name)
                
                if len(name.middle) == 1: # middle initial
                    new_fn = name.first + ' ' + name.middle 
                    new_ln = name.last
                elif len(name.middle) > 1: # middle name
                    new_fn = name.first
                    new_ln = name.middle + ' ' + name.last
                else: # no middle name 
                    new_fn = name.first
                    new_ln = name.last
                    
            # if there is only a first name and a last name 
            else:
                new_fn = new_name_l[0]
                new_ln = new_name_l[1] 
            
            # record changes
            add_changes(row['to whom due | title'], row['to whom due | title'], row['to whom due | first name'], row['to whom due | last name'], 
                   new_fn, new_ln, 6, row['org_file'], row['org_index'])
            
            # remove unncessary spaces at the end of the string 
            new_fn.strip()
            new_ln.strip()
            
            row['to whom due | first name'] = new_fn 
            row['to whom due | last name'] = new_ln
            
            print('old name=' + full_name)
            print('new first name=' + new_fn)
            print('new last name=' + new_ln)
            print('name_changes status=' + str(len(name_changes)))
            print('------------------')
            
    return row

agg_debt = agg_debt.apply(lambda row: implement_name_changes(row), axis=1)

In [None]:
agg_debt.head()

In [None]:
agg_debt.to_csv('data/agg_debt_clean_david.csv')

In [None]:
name_changes.to_csv('data/name_changes_david.csv')