# Cleaning Names

The purpose of this notebook is to clean the names of individuals. All the problems that we aim to fix in this notebook are listed [here](https://docs.google.com/document/d/1pcSQfWNll6K9tl-_rB4lztN0TsZsclU9vOnbyQob-Zs/edit).

In [24]:
# import all the necessary packages
import pandas as pd 
import numpy as np
import re
import csv
import ast

In [36]:
# import aggregated debt file
agg_debt = pd.read_csv('data/final_agg_debt.csv')

  agg_debt = pd.read_csv('data/final_agg_debt.csv')


In [37]:
print(agg_debt.dtypes)

Unnamed: 0                                 int64
letter                                    object
date of the certificate | month          float64
date of the certificate | day            float64
date of the certificate | year           float64
to whom due | first name                  object
to whom due | last name                   object
to whom due | title                       object
time when the debt became due | month    float64
time when the debt became due | day       object
time when the debt became due | year      object
amount | dollars                         float64
amount | 90th                             object
line strike through? | yes?              float64
line strike through? | note               object
notes                                     object
state                                     object
org_file                                  object
org_index                                  int64
to whom due | title.1                     object
to whom due | first 

## Documenting Changes

<b>Goal: </b> We need to document changes we make to ```agg_debt.csv``` in a separate dataframe: ```name_changes```. This way, we can double-check whether those changes were appropriate. 

<b>Steps</b>
1. Create an empty dataframe. Here are the column names:
    - ```title_org```: The original title of the individual (Mr., Ms., etc.)
    - ```title_new```: The new title of the individual (Mr., Ms., etc.) 
    - ```first_name_org```: The original first name of the individual from the unchanged ```agg_debt.csv```
    - ```last_name_org```: The original last name of the individual from the unchanged ```agg_debt.csv``` 
    - ```first_name_new``` : If first name changed, record it here. Otherwise, this entry will still be the old name. 
    - ```last_name_new```: If last name changed, record it here. Otherwise, this entry will still be the old name. 
    - ```cleaning case```: This corresponds with the task number in the objectives document linked above. 
    - ```file_loc```: The individual state filename in which the row came from 
    - ```org_index```: The original index/row that the debt entry can be found in ```file_loc``` 
2. Create a function that adds a new row to the dataframe. This function will be called while we are cleaning. 

**Cleaning case = Objective number** 
- Clean company names = 2,
- Handle two names = 3,
- Handle abbreviations = 5,
- Standardize names (Ancestry) = 6

In [27]:
# record changes in this dataframe
name_changes = pd.DataFrame({'title_org': pd.Series(dtype='str'),
                       'title_new': pd.Series(dtype='str'),
                       'first_name_org': pd.Series(dtype='str'),
                       'last_name_org': pd.Series(dtype='str'),
                       'first_name_new': pd.Series(dtype='str'),
                       'last_name_new': pd.Series(dtype='str'),
                       'cleaning case': pd.Series(dtype='int'),
                       'file_loc': pd.Series(dtype='str'),
                       'org_index': pd.Series(dtype='int')})

In [28]:
print(name_changes.to_markdown())

| title_org   | title_new   | first_name_org   | last_name_org   | first_name_new   | last_name_new   | cleaning case   | file_loc   | org_index   |
|-------------|-------------|------------------|-----------------|------------------|-----------------|-----------------|------------|-------------|


In [29]:
def add_changes(title_org, title_new, fn_org, ln_org, fn_new, ln_new, case, file, index):
    new_row = [title_org, title_new, fn_org, ln_org, fn_new, ln_new, case, file, index]
    name_changes.loc[len(name_changes.index)] = new_row

## Company Names

<b>Goal: </b> Some debt entries are actually company names or represent a group of people (example: ```James Vernon & Co.```). 

<b>Steps: </b>
1. Use string parsing to find if a debt entry has '& co' or '& others' in it's name. Note: I noticed these company names appear in the first name column. I do <b>not</b> run this program on the last name column.
2. I remove the '& co' or '& others' from the name. I use a human name parser library. This library can find out what parts of the name are the first name versus last name. 
3. I put the first name and last name in their own respective columns. 
4. Record name change in ``name_changes``.

In [65]:
# retrieve manual corrections from csv file if they exist 
manual_corrects_df = pd.read_csv('data/manual_corrections.csv')
manual_corrects_dict = manual_corrects_df.to_dict(orient='index')
manual_corrects = {}
# add manual corrections to manual_corrects dictionary 
for correction in manual_corrects_dict.keys():
    manual_corrects[manual_corrects_dict[correction]['Unnamed: 0']] = [manual_corrects_dict[correction]['new first name'], manual_corrects_dict[correction]['new last name']]

print(manual_corrects)

{'henry mc clellen and henry': ['henry', 'mc clellen | henry'], 'william allison ex of mat mease': ['william', 'allison'], 'samuel ely and michael gellington esq': ['samuel | michael', 'ely | gellington'], 'michael schubert for st michael and zeus ': ['michael', 'schubert'], 'corporation for the relief of poor and distressed presbyterian ministers': ['corporation for the relief of poor and distressed presbyterian ministers', nan], 'corporation for relief of poor and distressed presbyterian ministers': ['corporation for relief of poor and distressed presbyterian ministers', nan], 'Nathaniel Appleton and other trustees of Judah Monis Legasy': ['Nathaniel', 'Appleton'], 'Gilbert Sutton John Voorhees Lawrence Vandiveer and Jacobus Vandiveer': ['Gilbert | John | Lawrence | Jacobus', 'Sutton | Voorhees | Vandiveer | Vandiveer'], 'Society for Relief of Poor Masters of ships  widows and children': ['Society for Relief of Poor Masters of ships  widows and children', nan], 'Robert Morris and Joh

In [31]:
# dictionary of manual changes i have to make 
changes = {
    'Henry Mc Clellen & Henry & co' : 'Henry Mc Clellen & Co'
}

conn_words = [' for ', ' of ', ' and '] # these are connector key words
corp_key_words = ('corporation', ' and co', ' and coy', ' and others', ' and several others', ' and heirs', ' and comp', ' and other trustees') # these are corporation key words

In [43]:
print(agg_debt['to whom due | first name'].loc[[5776, 8879]].to_markdown())

|      | to whom due | first name   |
|-----:|:---------------------------|
| 5776 | Henry Wisner & Co          |
| 8879 | James Mc Farlane & others  |


In [33]:
def handle_comp_name(row):        
    org_fname = str(row['to whom due | first name'])
    org_lname = str(row['to whom due | last name'])
    
    fname = str(row['to whom due | first name'])
    fname = fname.replace('&', 'and')
    fname = fname.replace('.', '')
    
    if fname in changes:
        fname = changes[fname]
    
    fname_l = str(fname).lower().strip()
    
    # check if string ends with co, coy, or others; if so, delete 
    for key_word in corp_key_words:
        if fname_l.endswith(key_word):
            print('index=' + str(row['Unnamed: 0']))
            print('old name=' + str(org_fname))      
            fname_corr = fname_l.split(key_word)
            print('corrected name=' + str(fname_corr[0])) 
            fname_corr = fname_corr[0]
            fname_sp = fname_corr.split()
            
            # only one name; put name into last name column 
            if len(fname_sp) == 1:
                row['to whom due | first name'] = ''
                row['to whom due | last name'] = fname_sp[0].capitalize()
                print('corrected name=' + str(fname_sp[0])) 
                print('new last name=' + str(fname_sp[0].capitalize()))
                
            # if there are is only a first name and a last name, put into respective columns
            elif len(fname_sp) == 2:
                row['to whom due | first name'] = fname_sp[0].capitalize()
                row['to whom due | last name'] = fname_sp[1].capitalize()
                print('new first name=' + str(fname_sp[0].capitalize()))
                print('new last name=' + str(fname_sp[1].capitalize()))
                
            # handles middle names; put middle names in last name column 
            elif len(fname_sp) == 3:
                row['to whom due | first name'] = fname_sp[0].capitalize() 
                row['to whom due | last name'] = fname_sp[1].capitalize() + ' ' + fname_sp[2].capitalize()
                print('new first name=' + str(fname_sp[0].capitalize()))
                print('new last name=' + str(fname_sp[1].capitalize() + ' ' + fname_sp[2].capitalize()))  
            # manually clean debt entries that have long names 
            else: 
                # check if name has already been manually cleaned
                if fname_corr in manual_corrects:
                    new_fname = manual_corrects[fname_corr][0]
                    new_lname = manual_corrects[fname_corr][1]
                else:
                    new_fname = input('new first name: ')
                    new_lname = input('new last name: ') 
                    manual_corrects[fname_corr] = [new_fname, new_lname]
                
                row['to whom due | first name'] = new_fname.capitalize()
                row['to whom due | last name'] = new_lname.capitalize()
                    
                print('new first name=' + str(new_fname.capitalize()))
                print('new last name=' + str(new_lname.capitalize()))  
                
            # record change 
            add_changes(row['to whom due | title'], row['to whom due | title'], org_fname, org_lname, 
                   row['to whom due | first name'], row['to whom due | last name'], 2, row['org_file'], row['org_index'])
            
            print('+------------------------------+')
        # if the name starts with any keyword: 'corporation for the relief of...'; manually change these names
        elif fname_l.startswith(key_word): 
            print('index=' + str(row['Unnamed: 0']))
            print('old name=' + str(fname_l))      
            
            # check if name has already been manually cleaned
            if fname_l in manual_corrects:
                new_fname = str(manual_corrects[fname_l][0])
                new_lname = str(manual_corrects[fname_l][1])
            else:
                new_fname = input('new first name: ')
                new_lname = input('new last name: ') 
                manual_corrects[fname_l] = [new_fname, new_lname]

            row['to whom due | first name'] = new_fname.capitalize()
            row['to whom due | last name'] = new_lname.capitalize()
            
            # record change 
            add_changes(row['to whom due | title'], row['to whom due | title'], org_fname, org_lname, 
                   row['to whom due | first name'], row['to whom due | last name'], 2, row['org_file'], row['org_index'])

            print('new first name=' + str(new_fname.capitalize()))
            print('new last name=' + str(new_lname.capitalize()))  
    
    return row

agg_debt = agg_debt.apply(lambda row: handle_comp_name(row), axis=1)

index=5776
old name=Henry Wisner & Co
corrected name=henry wisner
new first name=Henry
new last name=Wisner
+------------------------------+
index=5777
old name=Henry Wisner & Co
corrected name=henry wisner
new first name=Henry
new last name=Wisner
+------------------------------+
index=5778
old name=Henry Wisner & Co
corrected name=henry wisner
new first name=Henry
new last name=Wisner
+------------------------------+
index=5779
old name=Henry Wisner & Co
corrected name=henry wisner
new first name=Henry
new last name=Wisner
+------------------------------+
index=5780
old name=Henry Wisner & Co
corrected name=henry wisner
new first name=Henry
new last name=Wisner
+------------------------------+
index=5781
old name=Henry Wisner & Co
corrected name=henry wisner
new first name=Henry
new last name=Wisner
+------------------------------+
index=5782
old name=Henry Wisner & Co
corrected name=henry wisner
new first name=Henry
new last name=Wisner
+------------------------------+
index=5783
ol

In [34]:
# checkup on name_changes
name_changes.tail()

Unnamed: 0,title_org,title_new,first_name_org,last_name_org,first_name_new,last_name_new,cleaning case,file_loc,org_index
685,,,Mark Nephew & Co,,Mark,Nephew,2,loan_office_certificates_9_states.xlsx,80889
686,,,Mark Nephew & Co,,Mark,Nephew,2,loan_office_certificates_9_states.xlsx,80890
687,,,Mark Nephew & Co,,Mark,Nephew,2,loan_office_certificates_9_states.xlsx,80891
688,,,Mark Nephew & Co,,Mark,Nephew,2,loan_office_certificates_9_states.xlsx,80892
689,,,J Mc Nesbitt & Co,,J,Mc Nesbitt,2,Marine_Liquidated_Debt_Certificates.xlsx,728


In [54]:
print(name_changes[['first_name_org', 'last_name_org', 'first_name_new', 'last_name_new']].head(1).to_markdown())

|    | first_name_org    |   last_name_org | first_name_new   | last_name_new   |
|---:|:------------------|----------------:|:-----------------|:----------------|
|  0 | Henry Wisner & Co |             nan | Henry            | Wisner          |


In [42]:
agg_debt['Unnamed: 0'] = agg_debt.index

In [43]:
agg_debt.rename(columns={'Unnamed: 0' : 'index'}, inplace=True)

## Cleaning Entries with Two Names

<b>Goal: </b>There are debt entries that have two names in a single cell: ```NY_2422: Messes Williamson & Beckman```. The plan is to split the name across the first name and last name columns. Note: I have to check naming conventions during thre 1700s. 

<b>Steps: </b>
1. Use string parsing to check if the name contains '&' or 'and' and split the string accordingly. 
2. Use the human name parser library to determine the first name and last names. 
3. Put each person's first name and last name in the respective columns, split by ```|``` to separate both individuals' names. 
4. Record change in ```name_changes```.

<b>Examples of different formats</b>
- James and Ash 
- William Miller and John Gamble

In [71]:
changes = {
    'van zandt & kittletas' : ['', 'van zandt | kittletas'],
    'trustees of & davids church':['trustees of & davids church', '']
}

In [72]:
# make sure all names are of type: str
agg_debt[['to whom due | first name', 'to whom due | last name']] = agg_debt[['to whom due | first name', 'to whom due | last name']].astype(str)

In [73]:
print(agg_debt['to whom due | first name'].loc[[182, 178682]].to_markdown())

|        | to whom due | first name            |
|-------:|:------------------------------------|
|    182 | Furman & Hunt                       |
| 178682 | William Rigden and Edward Middleton |


In [74]:
# function to convert
def listToString(s):
 
    # initialize an empty string
    str1 = " "
 
    # return string
    return (str1.join(s))

In [75]:
def handle_two_name(row):
    org_fn = row['to whom due | first name']
    org_ln = row['to whom due | last name']
    
    org_fn_l = str(org_fn).lower()
        
    # remove extraneous information like 'for the estates of...'
    org_fn_l = org_fn_l.split(' for ')[0]

    # remove extraneous information like 'of the heirs of...'
    org_fn_l = org_fn_l.split(' of ')[0]

    # remove occupations: guardians, etc. 
    org_fn_l = org_fn_l.replace(' guardian', '')
    
    # check if there are two individuals, but check if there are more than 7 words (most likely a society)
    if ' and ' in org_fn_l and len(org_fn_l.split()) <= 7:   
        print('original name= ' + org_fn_l)
        
        # cleaning extraneous information can reveal there to be only one name
        #if ' and ' in org_fn_l:
        person1 = org_fn_l.split(' and ')[0]
        person2 = org_fn_l.split(' and ')[1]
        person1_sp = person1.split() 
        person2_sp = person2.split()

        # recapitalize people's names
        person1_sp = [i.title() for i in person1_sp]
        person2_sp = [i.title() for i in person2_sp]

        # if both individuals only have a last name; put both last names into last name column  ex. edward and joseph
        if len(person1_sp) == 1 and len(person2_sp) == 1:
            row['to whom due | first name'] = ''
            row['to whom due | last name'] = [person1_sp[0], person2_sp[0]] 
            
            print('new last name col (org)=' + listToString(row['to whom due | last name']))
        # if there are three separate last names; put all three into last name column: ex. vance caldwell and vance
        elif len(person1_sp) == 2 and len(person2_sp) == 1:
            row['to whom due | first name'] = ''
            row['to whom due | last name'] = [person1_sp[0], person1_sp[1], person2_sp[0]]
            print('new last name col=' + listToString(row['to whom due | last name']))
        # if both individuals belong to the same family; put names into respective cols: ex. peter and isaac wikoff  
        elif len(person1_sp) == 1 and len(person2_sp) == 2:
            row['to whom due | first name'] = [person1_sp[0], person2_sp[0]]
            row['to whom due | last name'] = person2_sp[1]
            print('new first name col=' + listToString(row['to whom due | first name']))
            print('new last name col=' + listToString(row['to whom due | last name']))
        # if both individuals are two completely different people with full names; ex. john doe and james hill
        elif len(person1_sp) == 2 and len(person2_sp) == 2:
            row['to whom due | first name'] = [person1_sp[0], person2_sp[0]]
            row['to whom due | last name'] = [person1_sp[1], person2_sp[1]]
            print('new first name col=' + listToString(row['to whom due | first name']))
            print('new last name col=' + listToString(row['to whom due | last name']))
        # if either individual has a middle name; group middle names with the last name; ex. john hill doe and james madison hill
        elif len(person1_sp) == 3 or len(person2_sp) == 3:
            row['to whom due | first name'] = [person1_sp[0], person2_sp[0]]
            # determine which individual has the middle name
            if len(person1_sp) == 3:
                person2_ln = ''
                if len(person2_sp) > 1:
                    person2_ln = person2_sp[1]
                
                row['to whom due | last name'] = [person1_sp[1] + ' ' + person1_sp[2], person2_ln]
                print('new last name col=' + listToString(row['to whom due | last name']))
            elif len(person2_sp) == 3:
                person1_ln = ''
                if len(person1_sp) > 1:
                    person1_ln = person1_sp[1]
                
                row['to whom due | last name'] = [person1_ln, person2_sp[1] + ' ' + person2_sp[2]]
                print('new last name col=' + listToString(row['to whom due | last name']))
            # both individuals have a middle name 
            else:
                row['to whom due | last name'] = [person1_sp[1] + ' ' + person1_sp[2], person2_sp[1] + ' ' + person2_sp[2]]
                print('new last name col=' + listToString(row['to whom due | last name']))
        
        # handle all other types of names manually
        else:
            if org_fn in manual_corrects:
                new_fname = str(manual_corrects[org_fn][0])
                new_lname = str(manual_corrects[org_fn][1])
            else:
                new_fname = input('new first name: ')
                new_lname = input('new last name: ') 
                manual_corrects[org_fn] = [new_fname, new_lname]

            row['to whom due | first name'] = new_fname.capitalize()
            row['to whom due | last name'] = new_lname.capitalize()
        
        # record change 
        add_changes(row['to whom due | title'], row['to whom due | title'], org_fn, org_ln, 
                row['to whom due | first name'], row['to whom due | last name'], 3, row['org_file'], row['org_index'])
            
        print('+------------------------------+')
    # might be a corporation or many names; manually fix
    elif ' and ' in org_fn_l and len(org_fn_l.split()) > 7:
        print('original name= ' + org_fn_l)
         # check if name has already been manually cleaned
        if org_fn in manual_corrects:
            new_fname = str(manual_corrects[org_fn][0])
            new_lname = str(manual_corrects[org_fn][1])
        else:
            new_fname = input('new first name: ')
            new_lname = input('new last name: ') 
            manual_corrects[org_fn] = [new_fname, new_lname]

        row['to whom due | first name'] = new_fname.capitalize()
        row['to whom due | last name'] = new_lname.capitalize()
        
        # record change 
        add_changes(row['to whom due | title'], row['to whom due | title'], org_fn, org_ln, 
                row['to whom due | first name'], row['to whom due | last name'], 3, row['org_file'], row['org_index'])

        print('new first name col=' + listToString(row['to whom due | first name']))
        print('new last name col=' + listToString(row['to whom due | last name']))

        print('+------------------------------+')
    
    # capitalize the names properly 
    row['to whom due | first name'] = row['to whom due | first name']
    row['to whom due | last name'] = row['to whom due | last name']
        
    return row

agg_debt = agg_debt.apply(lambda row: handle_two_name(row), axis=1)

original name= daniel and john
new last name col (org)=Daniel John
+------------------------------+
original name= hermon and brimmer
new last name col (org)=Hermon Brimmer
+------------------------------+
original name= gordon and mcclemments
new last name col (org)=Gordon Mcclemments
+------------------------------+
original name= gordon and mcclemments
new last name col (org)=Gordon Mcclemments
+------------------------------+
original name= wilson and fury
new last name col (org)=Wilson Fury
+------------------------------+
original name= wilson and fury
new last name col (org)=Wilson Fury
+------------------------------+
original name= wilson and fury
new last name col (org)=Wilson Fury
+------------------------------+
original name= wilson and fury
new last name col (org)=Wilson Fury
+------------------------------+
original name= edward and joseph
new last name col (org)=Edward Joseph
+------------------------------+
original name= james glenn and co
new last name col=James Glen

new first name:  samuel | michael
new last name:  ely | gellington


new first name col=S a m u e l   |   m i c h a e l
new last name col=E l y   |   g e l l i n g t o n
+------------------------------+
original name= mess eastern and co
new last name col=Mess Eastern Co
+------------------------------+
original name= ferrason and pocy
new last name col (org)=Ferrason Pocy
+------------------------------+
original name= ferrason and pocy
new last name col (org)=Ferrason Pocy
+------------------------------+
original name= ferrason and pocy
new last name col (org)=Ferrason Pocy
+------------------------------+
original name= terrason and pory
new last name col (org)=Terrason Pory
+------------------------------+
original name= terrason and pory
new last name col (org)=Terrason Pory
+------------------------------+
original name= ferrason and pocy
new last name col (org)=Ferrason Pocy
+------------------------------+
original name= sam and rob
new last name col (org)=Sam Rob
+------------------------------+
original name= henry and thomas brown
new first 

In [76]:
# save manual corrections 
manual_corrects_df = pd.DataFrame.from_dict(manual_corrects, orient='index') 
manual_corrects_df.columns = ['new first name', 'new last name']
manual_corrects_df.to_csv('data/manual_corrections.csv')

In [77]:
# if there are debt entries with multiple individuals, split them into their own rows
agg_debt = agg_debt.explode('to whom due | first name')
agg_debt = agg_debt.explode('to whom due | last name')
# reindex
agg_debt['index'] = agg_debt.index

In [78]:
# checkup on name_changes
name_changes.tail()

Unnamed: 0,title_org,title_new,first_name_org,last_name_org,first_name_new,last_name_new,cleaning case,file_loc,org_index
3624,,,SD George and Comp.,,,"[Sd, George, Comp.]",3,loan_office_certificates_9_states.xlsx,79210
3625,,,SD George and Comp.,,,"[Sd, George, Comp.]",3,loan_office_certificates_9_states.xlsx,79213
3626,,,SD George and Comp.,,,"[Sd, George, Comp.]",3,loan_office_certificates_9_states.xlsx,79214
3627,,,SD George and Comp.,,,"[Sd, George, Comp.]",3,loan_office_certificates_9_states.xlsx,79215
3628,,,Hoov and Harrison,,,"[Hoov, Harrison]",3,Marine_Liquidated_Debt_Certificates.xlsx,764


In [84]:
print(name_changes.loc[name_changes['first_name_org'] == 'Furman and Hunt'][['first_name_org', 'last_name_org', 'first_name_new', 'last_name_new']].head(1).to_markdown())

|      | first_name_org   |   last_name_org | first_name_new   | last_name_new      |
|-----:|:-----------------|----------------:|:-----------------|:-------------------|
| 1119 | Furman and Hunt  |             nan |                  | ['Furman', 'Hunt'] |


## Handle Abbreviations of a Name

<b>Goal: </b>There are individuals who have a handwritten abbreviation of a name in their debt entry. Thanks to Chris, he found a website with all these [abbreviations](https://hull-awe.org.uk/index.php/Conventional_abbreviations_for_forenames). 

<b>Steps: </b>
1. Copy and past the name abbreviations from the website into a dictionary. 
2. Iterate through each row in the dataframe.
3. Check if the name is an abbreviation and change accordingly. 
4. Record changes. 


In [89]:
abbreviations = {
    'And':'Andrew', 'Ant':'Anthony', 'Bart':'Bartholomew', 'Cha':'Charles', 'Dor':'Dorothy', 'Dot':'Dorothy', 'Doth':'Dorothy',
    'Edw':'Edward', 'Eliz':'Elizabeth', 'Geo':'George', 'H':'Henry', 'Herb':'Herbert', 'Ja':'James', 'Jn':'John', 'Marg':'Margaret', 
    'Mich':'Michael', 'Pat': 'Patrick', 'Rich':'Richard', 'Tho':'Thomas', 'W':'William', 'Will\'m':'William'
}

In [90]:
print(agg_debt.loc[agg_debt['to whom due | first name'] == 'And'][['to whom due | first name', 'to whom due | last name']].head(1).to_markdown())

|        | to whom due | first name   | to whom due | last name   |
|-------:|:---------------------------|:--------------------------|
| 102117 | And                        | Wardleberger              |


In [91]:
def handle_abbreviations(row):
    fn = str(row['to whom due | first name'])
    if fn in abbreviations:
        row['to whom due | first name'] = abbreviations[fn]
        # record changes
        add_changes(row['to whom due | title'], row['to whom due | title'], fn, 
                    row['to whom due | last name'], row['to whom due | first name'], 
                    row['to whom due | last name'], 5, row['org_file'], row['org_index'])
    
    return row

agg_debt = agg_debt.apply(lambda row: handle_abbreviations(row), axis=1)

In [92]:
# checkup on name_changes
name_changes.tail()

Unnamed: 0,title_org,title_new,first_name_org,last_name_org,first_name_new,last_name_new,cleaning case,file_loc,org_index
3758,,,Jn,Gallaher Esq,John,Gallaher Esq,5,loan_office_certificates_9_states.xlsx,62583
3759,,,Rich,Thompson,Richard,Thompson,5,loan_office_certificates_9_states.xlsx,69947
3760,,,Rich,Thompson,Richard,Thompson,5,loan_office_certificates_9_states.xlsx,69948
3761,,,Rich,Thompson,Richard,Thompson,5,loan_office_certificates_9_states.xlsx,69949
3762,,,Rich,Thompson,Richard,Thompson,5,loan_office_certificates_9_states.xlsx,69950


In [95]:
print(name_changes.loc[(name_changes['first_name_org'] == 'And') | (name_changes['last_name_org'] == 'Wardleberger')][['first_name_org', 'last_name_org', 'first_name_new', 'last_name_new']].head(1).to_markdown())

|      | first_name_org   | last_name_org   | first_name_new   | last_name_new   |
|-----:|:-----------------|:----------------|:-----------------|:----------------|
| 3683 | And              | Wardleberger    | Andrew           | Wardleberger    |


## Grouping Consecutive Names - Liam

In [54]:
og_df = agg_debt

In [57]:
#Loop that covers Objective 1 (aggregate data files)
def element_to_int(ele): # handles all kinds of Nans (returns 0 for nans)
    if type(ele) == np.float64:
        ele = round(ele)
    if ele == np.nan: return 0
    if str(ele) == "nan": return 0
    return round(np.float64(ele))

def get_dollar(row): #gets the dollar from a row by checking both dollar columns
    dollar = 0
    ninety = 0
    #if dollar (90th) is a decimal, then split it
    if '.' in str(element_to_int(row[11])): # "amount | dollar"
        split = str(element_to_int(row[11])).split(".")
        dollar, ninety = element_to_int(split[0]), element_to_int(split[1])
    elif str(row[11]) == "": # "amount | dollar"
        if '.' in str(element_to_int(row[24])): # "amount | specie"
            split = str(element_to_int(row[24])).split() # "amount | specie"
            dollar, ninety = element_to_int(split[0]), element_to_int(split[1])
    else:
        dollar = element_to_int(row[11]) # "amount | dollar"
        ninety = element_to_int(row[12]) # "amount | 90th"
    return float(str(dollar) + "." + str(ninety))

def new_tup(old_row, new_dol, new_ninety, new_title): # returns a new tuple, specifically for totaled debt amounts (since you can't assign new values in tuples)
    return (old_row[0], old_row[1], old_row[2], old_row[3], old_row[4], old_row[5], old_row[6],
            new_title, old_row[8], old_row[9], old_row[10], new_dol, new_ninety, old_row[13],
            old_row[14], old_row[15], old_row[16], old_row[17], old_row[18], old_row[19],
            old_row[20], old_row[21], old_row[22], old_row[23], old_row[24], old_row[25],
            old_row[26], old_row[27], old_row[28], old_row[29], old_row[30], old_row[31])

agg_df = pd.DataFrame(columns=og_df.columns)
last_f, last_l, last_t = "", "", ""
last_row = None
#save the sum of money
current_sum = 0
for row in og_df.itertuples(name=None, index=False): #main processing function
    fname, lname = str(row[5]).strip(), str(row[6]).strip()
    last_t = last_t if str(row[7]).strip().lower() == "nan" else str(row[7]).strip()
    if fname == last_f and lname == last_l: #If the next name is the same as the last one, add onto the amount
        dol = get_dollar(row)
        print(f"adding {dol} to {fname} {lname}'s total")
        current_sum += dol
    else: #If the next name is not the same as the last one:
        if current_sum > 0: #If the sum is more than 0 (ie. this is the end of consecutive same-name entries), then only add this on
            print(f"{last_row[5]} {last_row[6]} is consecutively owed {current_sum}")
            #consecutive has ended
            split = str(current_sum).split(".")
            agg_df.loc[len(agg_df.index)] = new_tup(last_row, int(split[0]), int(split[1]), last_t if last_t != "" else "")
            current_sum = 0
        else: #If the sum is not more than 0 (ie. this is one unique entry, add it on now)
            #Normal
            agg_df.loc[len(agg_df.index)] = last_row
    last_f, last_l = fname, lname
    last_row = row

adding 27.45 to Joshua Brackett's total
Joshua Brackett is consecutively owed 27.45
adding 75.67 to Henry Dearborn's total
Henry Dearborn is consecutively owed 75.67
adding 7.45 to Amos Morrill's total
Amos Morrill is consecutively owed 7.45
adding 590.65 to Joseph Wells's total
adding 186.8 to Joseph Wells's total
Joseph Wells is consecutively owed 777.45
adding 186.45 to Joseph Wells's total
adding 329.5 to Joseph Wells's total
Joseph Wells is consecutively owed 515.95
adding 64.67 to Joseph Wells's total
Joseph Wells is consecutively owed 64.67
adding 1108.6 to Joseph Milnor's total
adding 339.6 to Joseph Milnor's total
Joseph Milnor is consecutively owed 1448.1999999999998
adding 277.48 to Furman & Hunt nan's total
adding 111.6 to Furman & Hunt nan's total
Furman & Hunt nan is consecutively owed 389.08000000000004
adding 119.54 to Abner Hunt's total
Abner Hunt is consecutively owed 119.54
adding 297.2 to Stacey Potts's total
Stacey Potts is consecutively owed 297.2
adding 26.0 to W

In [59]:
agg_debt = agg_df

In [60]:
agg_debt.loc[agg_debt['org_file'] == 'Pierce_Certs_cleaned_2019.xlsx'].tail()

Unnamed: 0,index,letter,date of the certificate | month,date of the certificate | day,date of the certificate | year,to whom due | first name,to whom due | last name,to whom due | title,time when the debt became due | month,time when the debt became due | day,...,amount | 10th,exchange,amount in specie | dollars,amount in specie | cents,amount | 8th,delivered | month,delivered | day,delivered | year,total dollars | notes,total dollars | notes.1
51140,108595.0,,,,,Christop,Zrerinious,,,,...,,,,,,,,,,
51141,108597.0,,,,,Peter,Zwears,Jun,,,...,,,,,,,,,,
51142,108598.0,,,,,Henry,True,,,,...,,,,,,,,,,
51143,108599.0,,,,,Nero,True,,,,...,,,,,,,,,,
51144,108602.0,,,,,Zebulon,True,Jun,,,...,,,,,,,,,,


In [61]:
agg_debt.to_csv('data/agg_debt_grouped.csv') # save 

In [62]:
name_changes.to_csv('data/name_changes_david.csv') # save

## Old Ancestry Cleaning

<p style="color: red;"><b>Don't run this code. I'm keeping it here to reuse some code.</b></p>

In [None]:
def access_ancestry(fn0, ln0, c_fn, c_ln, state, row0, c_row):
    name0 = fn0 + ' ' + ln0 # static
    c_name = c_fn + ' ' + c_ln # changing
    
    driver.get('https://www.ancestrylibrary.com/search/collections/5058/?name=' + fn0 + '_' + ln0 + '&residence=' + ancestry_state_urls[state] + '&residence_x=10-0-0_1-0')
    print(driver.current_url)
    
    try:
        try:
            check_exists = driver.find_element(By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/div/div')
            result1 = name0
            result2 = c_name
        except NoSuchElementException:  
            result = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span'))).text
            result_name = HumanName(result)

            if result_name.middle != '':
                result1 = result_name.first + ' ' + result_name.middle + ' ' + result_name.last
            else:
                result1 = result_name.first + ' ' + result_name.last
        
        print('index: ' + str(row0['index1']))
        print('first ancestry result: ' + result1)

        driver.get('https://www.ancestrylibrary.com/search/collections/5058/?name=' + c_fn + '_' + c_ln + '&residence=' + ancestry_state_urls[state] + '&residence_x=_1-0')
        print(driver.current_url)

        try:
            check_exists = driver.find_element(By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/div/div')
            result1 = name0
            result2 = c_name
        except NoSuchElementException:
            result = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span'))).text
            result_name = HumanName(result)

            if result_name.middle != '':
                result2 = result_name.first + ' ' + result_name.middle + ' ' + result_name.last
            else:
                result2 = result_name.first + ' ' + result_name.last

        print('second ancestry result: ' + result2) 

        if result1 == name0 and result2 == name0: # name0 is the correct spelling of the name
            fixes.append([state, {c_name:name0}])
            
            # record changes
            add_changes(c_row['to whom due | title'], c_row['to whom due | title'], c_row['to whom due | first name'],
                        c_row['to whom due | last name'], fn0, ln0, 6, c_row['org_file'], c_row['org_index'])
        elif result1 == c_name and result2 == c_name: # c_name is the correct spelling of the name
            fixes.append([state, {name0:c_name}])
            
            # record changes
            add_changes(row0['to whom due | title'], row0['to whom due | title'], row0['to whom due | first name'],
                        row0['to whom due | last name'], c_fn, c_ln, 6, row0['org_file'], row0['org_index'])
    except:
        rerun_rows.append([row0, c_row])

def only_f(row0, row):
    try:
        if row['to whom due | first name'][0] == row0['to whom due | first name'][0] and row['to whom due | last name'][0] == row0['to whom due | last name'][0] and row['state'] == row0['state']:
            return row
    except:
        return 

def fuzzy_similarity(c_row, row0):
    name = row0['to whom due | first name'] + ' ' + row0['to whom due | last name']
    cname = c_row['to whom due | first name'] + ' ' + c_row['to whom due | last name']
    
    # check fuzzy string ratio and metaphone ratio
    code1 = metaphone(name)
    code2 = metaphone(cname)
    ratio = fuzz.ratio(name, cname)
    score = fuzz.ratio(code1, code2)

    # only search ancestry when the two names have a ratio greater than 90, the names don't equal each other, 
    # and we haven't checled current name already
    if score > 90 and ratio > 90 and name != cname and (cname not in c_checked):
        print('name: ' + name)
        print('comparing to name: ' + cname)
        
        correct_name = access_ancestry(row0['to whom due | first name'], row0['to whom due | last name'],
                                       c_row['to whom due | first name'], c_row['to whom due | last name'], row0['state'], row0, c_row)
            
        print('--------------------------------------------------')
        c_checked.append(cname)
    
def determine_similarities(row0):
    current_name = row0['to whom due | first name'] + ' ' + row0['to whom due | last name']
    
    # only search ancestry when ancestry has records and if we have not checked name already
    if (row0['state'] in ancestry_state_urls) and ([row0['state'], current_name] not in checked0):
        
        # shorten table to only include names that share first letter of first name and last name and come from the same state
        short_table = agg_debt.apply(lambda row: only_f(row0, row), axis=1).dropna()

        if len(short_table) > 0:
            c_checked.clear()
            short_table.apply(lambda row: fuzzy_similarity(row, row0))
            
        checked0.append([row0['state'], current_name])
                        
    if len(fixes) % 5 == 0 and len(fixes) > 0:
        print(fixes[len(fixes) - 1]) 
    
agg_debt['to whom due | first name'] = agg_debt['to whom due | first name'].astype(str)
agg_debt['to whom due | last name'] = agg_debt['to whom due | last name'].astype(str)
agg_debt.sort_values('to whom due | last name', inplace=True)
agg_debt.reset_index(inplace=True)
agg_debt['index1'] = agg_debt.index
agg_debt.apply(lambda row: determine_similarities(agg_debt, row), axis=1)

In [None]:
# save agg_debt as csv file: 'out.csv'
agg_debt.to_csv('data/agg_debt_clean.csv')

In [None]:
with open('data/out.csv', 'r') as read_obj:

    csv_reader = csv.reader(read_obj)
  
    # convert string to list
    entries = list(csv_reader)
    
    # remove empty lists
    entries = [entry for entry in entries if entry != []]
  
    print(entries)

In [None]:
# reorganize entries and group by state
entries_dict = {}
for entry in entries:
    if entry[0] not in entries_dict:
        entries_dict[entry[0]] = [entry]
    else:
        entries_dict[entry[0]] += [entry]

print(entries_dict)

In [None]:
# implement changes to agg_debt
def implement_name_changes(row):  
    if str(row['state']) not in entries_dict:
        return row 
    
    # only select part of the list that belongs to one person 
    entries_red = entries_dict[row['state']]
    
    # loop through list
    # check if name matches 
    full_name = str(row['to whom due | first name']) + ' ' + str(row['to whom due | last name'])
    
    for entry in entries_red:
        name_dict = ast.literal_eval(entry[1])
        
        if full_name in name_dict:
            new_name = name_dict[full_name]
            new_name_l = new_name.split()
            new_fn = ''
            new_ln = ''
            
            # account for middle initials and middle names
            if len(new_name_l) >= 3:
                name = HumanName(new_name)
                
                if len(name.middle) == 1: # middle initial
                    new_fn = name.first + ' ' + name.middle 
                    new_ln = name.last
                elif len(name.middle) > 1: # middle name
                    new_fn = name.first
                    new_ln = name.middle + ' ' + name.last
                else: # no middle name 
                    new_fn = name.first
                    new_ln = name.last
                    
            # if there is only a first name and a last name 
            else:
                new_fn = new_name_l[0]
                new_ln = new_name_l[1] 
            
            # record changes
            add_changes(row['to whom due | title'], row['to whom due | title'], row['to whom due | first name'], row['to whom due | last name'], 
                   new_fn, new_ln, 6, row['org_file'], row['org_index'])
            
            # remove unncessary spaces at the end of the string 
            new_fn.strip()
            new_ln.strip()
            
            row['to whom due | first name'] = new_fn 
            row['to whom due | last name'] = new_ln
            
            print('old name=' + full_name)
            print('new first name=' + new_fn)
            print('new last name=' + new_ln)
            print('name_changes status=' + str(len(name_changes)))
            print('------------------')
            
    return row

agg_debt = agg_debt.apply(lambda row: implement_name_changes(row), axis=1)

In [None]:
agg_debt.head()

In [15]:
checked0 = open('checked0.pickle', 'r')
checked0.close()