# Cells order
 - imports
 - load/create dataframes
 - helper functions
 - any manual corrections / manually dropping invalid rows
 - Standardizing Town/State/Estate/Heir of (obj. 4 & 8)
 - Standardizing names containing 'of' entirely in the first name column
 - Companies (obj 2)
 - Entries with 2 names (obj 3)
 - Names that are entirely in the first or last name column (obj 9)
 - Filling in blank columns (obj 7)
 - Deceased individuals (obj 12)
 - abbreviations (obj 5)
 - Group consecutive names (obj 1)
 - Ancestry code
# Objectives

[here](https://docs.google.com/document/d/1pcSQfWNll6K9tl-_rB4lztN0TsZsclU9vOnbyQob-Zs/edit).

In [None]:
#Imports
import pandas as pd
import datetime
import numpy as np
import json
import os
from fuzzywuzzy import fuzz

import nltk
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
import ssl

from nameparser import HumanName
#---
import re
import csv
import ast

In [None]:
#Dataframes
agg_debt = pd.read_csv('data/final_agg_debt.csv')

name_changes = pd.DataFrame({'title_org': pd.Series(dtype='str'),
                       'title_new': pd.Series(dtype='str'),
                       'first_name_org': pd.Series(dtype='str'),
                       'last_name_org': pd.Series(dtype='str'),
                       'first_name_new': pd.Series(dtype='str'),
                       'last_name_new': pd.Series(dtype='str'),
                       'cleaning case': pd.Series(dtype='int'),
                       'file_loc': pd.Series(dtype='str'),
                       'org_index': pd.Series(dtype='int')})

# retrieve manual corrections from csv file if they exist 
manual_corrects_df = pd.read_csv('data/manual_corrections.csv')
manual_corrects_dict = manual_corrects_df.to_dict(orient='index')
manual_corrects = {}
# add manual corrections to manual_corrects dictionary 
for correction in manual_corrects_dict.keys():
    manual_corrects[manual_corrects_dict[correction]['Unnamed: 0']] = [manual_corrects_dict[correction]['new first name'], manual_corrects_dict[correction]['new last name']]

# Documenting Changes

<b>Goal: </b> We need to document changes we make to ```agg_debt.csv``` in a separate dataframe: ```name_changes```. This way, we can double-check whether those changes were appropriate. 

<b>Steps</b>
1. Create an empty dataframe. Here are the column names:
    - ```title_org```: The original title of the individual (Mr., Ms., etc.)
    - ```title_new```: The new title of the individual (Mr., Ms., etc.) 
    - ```first_name_org```: The original first name of the individual from the unchanged ```agg_debt.csv```
    - ```last_name_org```: The original last name of the individual from the unchanged ```agg_debt.csv``` 
    - ```first_name_new``` : If first name changed, record it here. Otherwise, this entry will still be the old name. 
    - ```last_name_new```: If last name changed, record it here. Otherwise, this entry will still be the old name. 
    - ```cleaning case```: This corresponds with the task number in the objectives document linked above. 
    - ```file_loc```: The individual state filename in which the row came from 
    - ```org_index```: The original index/row that the debt entry can be found in ```file_loc``` 
2. Create a function that adds a new row to the dataframe. This function will be called while we are cleaning. 

**Cleaning case = Objective number** 
- Combine multiple consective debt entries (optional) = 1,
- Clean company names = 2,
- Handle two names = 3,
- Remove "Estate of" = 4,
- Handle abbreviations = 5,
- Standardize names (Ancestry) = 6
- Fill in blank name columns = 7,
- remove "Heirs of" prefixes = 8,
- Names that are entierly in first or last name column = 9,
- Occupations in the name = 10,
- Entry on behalf of someone else = 11,
- Mark deceased people = 12

In [None]:
#Helper functions
def add_changes(title_org, title_new, fn_org, ln_org, fn_new, ln_new, case, file, index):
    name_changes.loc[len(name_changes.index)] = [title_org, title_new, fn_org, ln_org, fn_new, ln_new, case, file, index]

#Download the necessary NLTK models for the below function
#Change the below to True to use the workaround in case downloads don't work
if True:
    try:
        _unverified = ssl._create_unverified_context
    except AttributeError:
        pass
    else:
        ssl._create_default_https_context = _unverified
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
def get_tags(text):
    nltk_results = ne_chunk(pos_tag(word_tokenize(text)))
    tags = {}
    for nltk_result in nltk_results:
        if type(nltk_result) == Tree:
            name = ''
            for nltk_result_leaf in nltk_result.leaves():
                name += nltk_result_leaf[0] + ' '
            tags[name] = nltk_result.label()
    return tags

In [None]:
#Remove long strings
#Super fast method - instead of going through it and adding to a new dataset,
#use apply with a simple function that doesn't include long strings in a new dataset
agg_debt = agg_debt[agg_debt['to whom due | first name'].apply(lambda name: len(str(name).split()) > 10) == False]
agg_debt = agg_debt[agg_debt['to whom due | last name'].apply(lambda name: len(str(name).split()) > 10) == False]

# Heirs of & Estate of

<b>Goal:</b> Remove "Estate of", "Heirs of", "State of" prefixes in an entry, and marks "State of" entries as organizations

<b>Steps:</b>

1. Check if a first name entry is longer than 2 words. If it is, run fuzzy checks to see if it begins with State of/Town of/Estate of/Heirs of (Use fuzzy checks to account for typos, which are quite frequent)
2. For State of and Town of matches, make the first name "State" or "Town" respectively, make the last name the name of the state/town, and mark it as an organization
3. For Estate of and Heirs of, make the first word the first name, and everything beyond it the last name
4. Record any changes in ```name_changes```

<b>Notes:</b>

1. Sometimes "Estate of" is abbreviated to "State of", which confuses it (an example is the first manual correction)
2. The "State of" fuzzy ratio threshold is higher than the "Estate of" and runs before it to catch "State of" as reliably as possible, just because they are 1 letter off.

In [None]:

agg_debt["organization?"] = False

manual_corrections = [
    {"og_fname": "State of William Sweet",
     "new_title": "",
     "new_fname": "William", 
     "new_lname": "Sweet"},
    {"og_fname": "Estateof Doct James Front",
     "new_title": "Doct",
     "new_fname": "James",
     "new_lname": "Front"},
    {"og_fname": "Estate of Capt John Williams",
     "new_title": "Capt",
     "new_fname": "John",
     "new_lname": "Williams"},
    {"og_fname": "Estate ofJon Bowman",
     "new_title": "",
     "new_fname": "Jon",
     "new_lname": "Bowman"},
    {"og_fname": "Esatate of Matthew Fentom",
     "new_title": "",
     "new_fname": "Matthew",
     "new_lname": "Fentom"}
]

def handle_ofs(row):
    og_fname = str(row["to whom due | first name"])
    og_lname = str(row["to whom due | last name"])
    title = str(row["to whom due | title"])
    
    for c in manual_corrections:
        if c["og_fname"] == og_fname:
            row["to whom due | first name"] = c["new_fname"]
            row["to whom due | last name"] = c["new_lname"]
            row["to whom due | title"] = c["new_title"]
            return row
    
    og_fname = og_fname.replace("the ", "").replace("The ", "")
    og_lname = og_lname.replace("the ", "").replace("The ", "")
    
    if len(og_fname.split()) > 2:
        prefix = og_fname.split()[0] + og_fname.split()[1]
        prefix = prefix.lower()
        if fuzz.ratio(prefix, "state of") >= 88 and "est" not in prefix: #"not in" so that this one won't pick up "Estate of"
            lname =  " ".join(og_fname.split()[2:])
            fname = "State"
            add_changes(title, title, row["to whom due | first name"], row["to whom due | last name"], fname, lname, 8, row["org_file"], row["org_index"])
            #save_manual_correction(title, row["to whom due | first name"], row["to whom due | last name"], title, fname, lname, 8, row["org_file"], row["org_index"], is_manual=False)
            row["to whom due | first name"] = fname
            row["to whom due | last name"] = lname
            row["organization?"] = True
        elif fuzz.ratio(prefix, "town of") >= 88:
            lname =  " ".join(og_fname.split()[2:])
            fname = "Town"
            add_changes(title, title, row["to whom due | first name"], row["to whom due | last name"], fname, lname, 8, row["org_file"], row["org_index"])
            #save_manual_correction(title, row["to whom due | first name"], row["to whom due | last name"], title, fname, lname, 8, row["org_file"], row["org_index"], is_manual=False)
            row["to whom due | first name"] = fname
            row["to whom due | last name"] = lname
            row["organization?"] = True
        elif (fuzz.ratio(prefix, "estate of") >= 85 or fuzz.ratio(prefix, "Est of") >= 85) and "est" in prefix: #"in prefix" so that this one won't pick up "State of"
            #print(og_fname.split()[2:])
            name = " ".join(og_fname.split()[2:])
            fname =  name.split()[0]
            lname = name.split()[1:] if len(name.split()) > 1 else ""
            if len(lname) == 0 and row["to whom due | last name"] != "": lname = row["to whom due | last name"]
            if type(lname) == list: lname = " ".join(lname)
            add_changes(title, title, row["to whom due | first name"], row["to whom due | last name"], fname, lname, 4, row["org_file"], row["org_index"])
            #save_manual_correction(title, row["to whom due | first name"], row["to whom due | last name"], title, fname, lname, 4, row["org_file"], row["org_index"], is_manual=False)
            row["to whom due | first name"] = fname
            row["to whom due | last name"] = lname
        elif fuzz.ratio(prefix, "heir of") >= 85 or fuzz.ratio(prefix, "heirs of") >= 85:
            name = " ".join(og_fname.split()[2:])
            fname =  name.split()[0]
            lname = name.split()[1:] if len(name.split()) > 1 else ""
            add_changes(title, title, row["to whom due | first name"], row["to whom due | last name"], fname, lname, 4, row["org_file"], row["org_index"])
            #save_manual_correction(title, row["to whom due | first name"], row["to whom due | last name"], title, fname, lname, 4, row["org_file"], row["org_index"], is_manual=False)
            row["to whom due | first name"] = fname
            row["to whom due | last name"] = lname
    return row

agg_debt = agg_debt.apply(lambda row: handle_ofs(row), axis=1)

In [None]:
#Standardizing names containing 'of' entirely in the first name column
manual_corrections = {
    "School Committee of Derbey": ["School Committee", "Derbey"],
    "Trusts of Wilmington Academy": ["Trusts", "Wilmington Academy"],
    "Trusts of Wilmington": ["Trusts", "Wilmington"],
    "Ruten of Chais": ["Ruten", ""]
}

def handle_all_orgs(row):
    og_fname = str(row["to whom due | first name"])
    og_lname = str(row["to whom due | last name"])
    title = row["to whom due | title"]
    
    for og, correction in manual_corrections.items():
        if og == og_fname:
            row["organization?"] = True
            row["to whom due | first name"] = correction[0]
            row["to whom due | last name"] = correction[1]
            return row
    
    fname, lname = "", ""
    if len(og_fname.split()) > 2 and (("of " in og_fname) or (" of" in og_fname)):
        tags = get_tags(og_fname)
        is_org = True
        for token, tag in tags.items():
            if tag == "PERSON": #Geo political entity
                is_org = False
        print(f"{og_fname} {tags}: {is_org}")
        if not is_org: return row
        row["organization?"] = True
        before_of, after_of = og_fname.split("of")
        fname = before_of.strip().replace("-", "")
        lname = after_of.strip().replace("-", "")
        add_changes(title, title, og_fname, og_lname, fname, lname, 14, row["org_file"], row["org_index"])
        #save_manual_correction(title, og_fname, og_lname, title, fname, lname, 14, row["org_file"], row["org_index"], is_manual=False)
        row["to whom due | first name"] = fname
        row["to whom due | last name"] = lname
    return row

agg_debt = agg_debt.apply(lambda row: handle_all_orgs(row), axis=1)

In [None]:
# Company names
# retrieve manual corrections from csv file if they exist 
manual_corrects_df = pd.read_csv('data/manual_corrections.csv')
manual_corrects_dict = manual_corrects_df.to_dict(orient='index')
manual_corrects = {}
# add manual corrections to manual_corrects dictionary 
for correction in manual_corrects_dict.keys():
    manual_corrects[manual_corrects_dict[correction]['Unnamed: 0']] = [manual_corrects_dict[correction]['new first name'], manual_corrects_dict[correction]['new last name']]

# dictionary of manual changes i have to make 
changes = {
    'Henry Mc Clellen & Henry & co' : 'Henry Mc Clellen & Co'
}

conn_words = [' for ', ' of ', ' and '] # these are connector key words
corp_key_words = ('corporation', ' and co', ' and coy', ' and others', ' and several others', ' and heirs', ' and comp', ' and other trustees') # these are corporation key words

def handle_comp_name(row):        
    org_fname = str(row['to whom due | first name'])
    org_lname = str(row['to whom due | last name'])
    
    fname = str(row['to whom due | first name'])
    fname = fname.replace('&', 'and')
    fname = fname.replace('.', '')
    
    if fname in changes:
        fname = changes[fname]
    
    fname_l = str(fname).lower().strip()
    
    # check if string ends with co, coy, or others; if so, delete 
    for key_word in corp_key_words:
        if fname_l.endswith(key_word):
            print('index=' + str(row['Unnamed: 0']))
            print('old name=' + str(fname_l))      
            fname_corr = fname_l.split(key_word)
            print('corrected name=' + str(fname_corr[0])) 
            fname_corr = fname_corr[0]
            fname_sp = fname_corr.split()
            
            # only one name; put name into last name column 
            if len(fname_sp) == 1:
                row['to whom due | first name'] = ''
                row['to whom due | last name'] = fname_sp[0].capitalize()
                print('corrected name=' + str(fname_sp[0])) 
                print('new last name=' + str(fname_sp[0].capitalize()))
                
            # if there are is only a first name and a last name, put into respective columns
            elif len(fname_sp) == 2:
                row['to whom due | first name'] = fname_sp[0].capitalize()
                row['to whom due | last name'] = fname_sp[1].capitalize()
                print('new first name=' + str(fname_sp[0].capitalize()))
                print('new last name=' + str(fname_sp[1].capitalize()))
                
            # handles middle names; put middle names in last name column 
            elif len(fname_sp) == 3:
                row['to whom due | first name'] = fname_sp[0].capitalize() 
                row['to whom due | last name'] = fname_sp[1].capitalize() + ' ' + fname_sp[2].capitalize()
                print('new first name=' + str(fname_sp[0].capitalize()))
                print('new last name=' + str(fname_sp[1].capitalize() + ' ' + fname_sp[2].capitalize()))  
            # manually clean debt entries that have long names 
            else: 
                # check if name has already been manually cleaned
                if fname_corr in manual_corrects:
                    new_fname = manual_corrects[fname_corr][0]
                    new_lname = manual_corrects[fname_corr][1]
                else:
                    new_fname = input('new first name: ')
                    new_lname = input('new last name: ') 
                    manual_corrects[fname_corr] = [new_fname, new_lname]
                
                row['to whom due | first name'] = new_fname.capitalize()
                row['to whom due | last name'] = new_lname.capitalize()
                    
                print('new first name=' + str(new_fname.capitalize()))
                print('new last name=' + str(new_lname.capitalize()))  
                
            # record change 
            add_changes(row['to whom due | title'], row['to whom due | title'], org_fname, org_lname, 
                   row['to whom due | first name'], row['to whom due | last name'], 2, row['org_file'], row['org_index'])
            
            print('+------------------------------+')
        # if the name starts with any keyword: 'corporation for the relief of...'; manually change these names
        elif fname_l.startswith(key_word): 
            print('index=' + str(row['Unnamed: 0']))
            print('old name=' + str(fname_l))      
            
            # check if name has already been manually cleaned
            if fname_l in manual_corrects:
                new_fname = str(manual_corrects[fname_l][0])
                new_lname = str(manual_corrects[fname_l][1])
            else:
                new_fname = input('new first name: ')
                new_lname = input('new last name: ') 
                manual_corrects[fname_l] = [new_fname, new_lname]

            row['to whom due | first name'] = new_fname.capitalize()
            row['to whom due | last name'] = new_lname.capitalize()
            
            # record change 
            add_changes(row['to whom due | title'], row['to whom due | title'], org_fname, org_lname, 
                   row['to whom due | first name'], row['to whom due | last name'], 2, row['org_file'], row['org_index'])

            print('new first name=' + str(new_fname.capitalize()))
            print('new last name=' + str(new_lname.capitalize()))  
    
    return row

agg_debt = agg_debt.apply(lambda row: handle_comp_name(row), axis=1)
agg_debt['Unnamed: 0'] = agg_debt.index
agg_debt.rename(columns={'Unnamed: 0' : 'index'}, inplace=True)

In [None]:
# Entries with 2 names
changes = {
    'van zandt & kittletas' : ['', 'van zandt | kittletas'],
    'trustees of & davids church':['trustees of & davids church', '']
}
# make sure all names are of type: str
agg_debt[['to whom due | first name', 'to whom due | last name']] = agg_debt[['to whom due | first name', 'to whom due | last name']].astype(str)
# Function to convert
def listToString(s):
 
    # initialize an empty string
    str1 = " "
 
    # return string
    return (str1.join(s))

def handle_two_name(row):
    org_fn = row['to whom due | first name']
    org_ln = row['to whom due | last name']
    
    org_fn_l = str(org_fn).lower()
        
    # remove extraneous information like 'for the estates of...'
    org_fn_l = org_fn_l.split(' for ')[0]

    # remove extraneous information like 'of the heirs of...'
    org_fn_l = org_fn_l.split(' of ')[0]

    # remove occupations: guardians, etc. 
    org_fn_l = org_fn_l.replace(' guardian', '')
    
    # check if there are two individuals, but check if there are more than 7 words (most likely a society)
    if ' and ' in org_fn_l and len(org_fn_l.split()) <= 7:   
        print('original name= ' + org_fn_l)
        
        # cleaning extraneous information can reveal there to be only one name
        #if ' and ' in org_fn_l:
        person1 = org_fn_l.split(' and ')[0]
        person2 = org_fn_l.split(' and ')[1]
        person1_sp = person1.split() 
        person2_sp = person2.split()

        # recapitalize people's names
        person1_sp = [i.title() for i in person1_sp]
        person2_sp = [i.title() for i in person2_sp]

        # if both individuals only have a last name; put both last names into last name column  ex. edward and joseph
        if len(person1_sp) == 1 and len(person2_sp) == 1:
            row['to whom due | first name'] = ''
            row['to whom due | last name'] = [person1_sp[0], person2_sp[0]] 
            
            print('new last name col (org)=' + listToString(row['to whom due | last name']))
        # if there are three separate last names; put all three into last name column: ex. vance caldwell and vance
        elif len(person1_sp) == 2 and len(person2_sp) == 1:
            row['to whom due | first name'] = ''
            row['to whom due | last name'] = [person1_sp[0], person1_sp[1], person2_sp[0]]
            print('new last name col=' + listToString(row['to whom due | last name']))
        # if both individuals belong to the same family; put names into respective cols: ex. peter and isaac wikoff  
        elif len(person1_sp) == 1 and len(person2_sp) == 2:
            row['to whom due | first name'] = [person1_sp[0], person2_sp[0]]
            row['to whom due | last name'] = person2_sp[1]
            print('new first name col=' + listToString(row['to whom due | first name']))
            print('new last name col=' + listToString(row['to whom due | last name']))
        # if both individuals are two completely different people with full names; ex. john doe and james hill
        elif len(person1_sp) == 2 and len(person2_sp) == 2:
            row['to whom due | first name'] = [person1_sp[0], person2_sp[0]]
            row['to whom due | last name'] = [person1_sp[1], person2_sp[1]]
            print('new first name col=' + listToString(row['to whom due | first name']))
            print('new last name col=' + listToString(row['to whom due | last name']))
        # if either individual has a middle name; group middle names with the last name; ex. john hill doe and james madison hill
        elif len(person1_sp) == 3 or len(person2_sp) == 3:
            row['to whom due | first name'] = [person1_sp[0], person2_sp[0]]
            # determine which individual has the middle name
            if len(person1_sp) == 3:
                person2_ln = ''
                if len(person2_sp) > 1:
                    person2_ln = person2_sp[1]
                
                row['to whom due | last name'] = [person1_sp[1] + ' ' + person1_sp[2], person2_ln]
                print('new last name col=' + listToString(row['to whom due | last name']))
            elif len(person2_sp) == 3:
                person1_ln = ''
                if len(person1_sp) > 1:
                    person1_ln = person1_sp[1]
                
                row['to whom due | last name'] = [person1_ln, person2_sp[1] + ' ' + person2_sp[2]]
                print('new last name col=' + listToString(row['to whom due | last name']))
            # both individuals have a middle name 
            else:
                row['to whom due | last name'] = [person1_sp[1] + ' ' + person1_sp[2], person2_sp[1] + ' ' + person2_sp[2]]
                print('new last name col=' + listToString(row['to whom due | last name']))
        
        # handle all other types of names manually
        else:
            if org_fn in manual_corrects:
                new_fname = str(manual_corrects[org_fn][0])
                new_lname = str(manual_corrects[org_fn][1])
            else:
                new_fname = input('new first name: ')
                new_lname = input('new last name: ') 
                manual_corrects[org_fn] = [new_fname, new_lname]

            row['to whom due | first name'] = new_fname.capitalize()
            row['to whom due | last name'] = new_lname.capitalize()
        
        # record change 
        add_changes(row['to whom due | title'], row['to whom due | title'], org_fn, org_ln, 
                row['to whom due | first name'], row['to whom due | last name'], 3, row['org_file'], row['org_index'])
            
        print('+------------------------------+')
    # might be a corporation or many names; manually fix
    elif ' and ' in org_fn_l and len(org_fn_l.split()) > 7:
        print('original name= ' + org_fn_l)
         # check if name has already been manually cleaned
        if org_fn in manual_corrects:
            new_fname = str(manual_corrects[org_fn][0])
            new_lname = str(manual_corrects[org_fn][1])
        else:
            new_fname = input('new first name: ')
            new_lname = input('new last name: ') 
            manual_corrects[org_fn] = [new_fname, new_lname]

        row['to whom due | first name'] = new_fname.capitalize()
        row['to whom due | last name'] = new_lname.capitalize()
        
        # record change 
        add_changes(row['to whom due | title'], row['to whom due | title'], org_fn, org_ln, 
                row['to whom due | first name'], row['to whom due | last name'], 3, row['org_file'], row['org_index'])

        print('new first name col=' + listToString(row['to whom due | first name']))
        print('new last name col=' + listToString(row['to whom due | last name']))

        print('+------------------------------+')
    
    # capitalize the names properly 
    row['to whom due | first name'] = row['to whom due | first name']
    row['to whom due | last name'] = row['to whom due | last name']
        
    return row

agg_debt = agg_debt.apply(lambda row: handle_two_name(row), axis=1)

# save manual corrections 
manual_corrects_df = pd.DataFrame.from_dict(manual_corrects, orient='index') 
manual_corrects_df.columns = ['new first name', 'new last name']
manual_corrects_df.to_csv('data/manual_corrections.csv')

# if there are debt entries with multiple individuals, split them into their own rows
agg_debt = agg_debt.explode('to whom due | first name')
agg_debt = agg_debt.explode('to whom due | last name')
# reindex
agg_debt['index'] = agg_debt.index

In [None]:
#Names that are entirely in the first or last name column

def correct_full_names_in_column(row):
    if row["organization?"] == True: return row #ignore orgnizations
    fname = str(row["to whom due | first name"])
    lname = str(row["to whom due | last name"])
    name = None
    if (len(lname.split()) == 0 or "nan" in lname or "NaN" in lname) and len(fname.split()) >= 2:
        name = HumanName(fname)
    if (len(fname.split()) == 0 or "nan" in fname or "NaN" in fname) and len(lname.split()) >= 2:
        name = HumanName(lname)
    if name == None:
        return row
    else:
        #save_manual_correction(row["to whom due | title"], fname, lname, row["to whom due | title"], name.first, name.last, 9, row["org_file"], row["org_index"], is_manual=False)
        add_changes(row["to whom due | title"], row["to whom due | title"], row["to whom due | first name"], fname, row["to whom due | last name"], lname, 9, row["org_file"], row["org_index"])
        row["to whom due | first name"] = name.first
        row["to whom due | last name"] = name.last
        return row

agg_debt = agg_debt.apply(lambda row: correct_full_names_in_column(row), axis=1)

In [None]:
#Filling in blank columns

def handle_blank_name_cols(row):
    fname = str(row["to whom due | first name"])
    lname = str(row["to whom due | last name"])
    if fname == "": fname = "UNDEFINED" # if there is no first name, make it undefined
    elif lname == "": lname = "UNDEFINED" # if there is no last name, make it undefined
    else: return row # if both aren't blank, return the row now
    #save_manual_correction(row["to whom due | title"], row["to whom due | first name"], row["to whom due | last name"], row["to whom due | title"], fname, lname, 7, row["org_file"], row["org_index"], is_manual=False)
    add_changes(row["to whom due | title"], row["to whom due | title"], row["to whom due | first name"], fname, row["to whom due | last name"], lname, 7, row["org_file"], row["org_index"])
    row["to whom due | first name"] = fname
    row["to whom due | last name"] = lname
    return row

agg_debt = agg_debt.apply(lambda row: handle_blank_name_cols(row), axis=1)

In [None]:
#Deceased individuals

agg_debt["deceased?"] = False

def check_deceased(row):
    fname = str(row["to whom due | first name"])
    lname = str(row["to whom due | last name"])
    fullname = str(fname) + " " + str(lname)
    for word in fullname.lower().split():
        if " dead" or " decease" or " passed" or " dec'd" or " dec." or " decd" or " deceasd" in word:
            row["deceased?"] = True
            fname = fname.replace(word, "")
            lname = lname.replace(word, "")
            add_changes(row["to whom due | title"], row["to whom due | title"], row["to whom due | first name"], fname, row["to whom due | last name"], lname, 12, row["org_file"], row["org_index"])
            #save_manual_correction(row["to whom due | title"], row["to whom due | first name"], row["to whom due | last name"], row["to whom due | title"], fname, lname, 12, row["org_file"], row["org_index"], is_manual=False)
            row["to whom due | first name"] = fname
            row["to whom due | last name"] = lname
    return row

agg_debt = agg_debt.apply(lambda row: check_deceased(row), axis=1)

In [None]:
#Abbreviations
abbreviations = {
    'And':'Andrew', 'Ant':'Anthony', 'Bart':'Bartholomew', 'Cha':'Charles', 'Dor':'Dorothy', 'Dot':'Dorothy', 'Doth':'Dorothy',
    'Edw':'Edward', 'Eliz':'Elizabeth', 'Geo':'George', 'H':'Henry', 'Herb':'Herbert', 'Ja':'James', 'Jn':'John', 'Marg':'Margaret', 
    'Mich':'Michael', 'Pat': 'Patrick', 'Rich':'Richard', 'Tho':'Thomas', 'W':'William', 'Will\'m':'William'
}

def handle_abbreviations(row):
    fn = str(row['to whom due | first name'])
    if fn in abbreviations:
        row['to whom due | first name'] = abbreviations[fn]
        # record changes
        add_changes(row['to whom due | title'], row['to whom due | title'], fn, 
                    row['to whom due | last name'], row['to whom due | first name'], 
                    row['to whom due | last name'], 5, row['org_file'], row['org_index'])
    
    return row

agg_debt = agg_debt.apply(lambda row: handle_abbreviations(row), axis=1)

In [None]:
#Loop that covers Objective 1 (aggregate data files)
def element_to_int(ele): # handles all kinds of Nans (returns 0 for nans)
    if type(ele) == np.float64:
        ele = round(ele)
    if ele == np.nan: return 0
    if str(ele) == "nan": return 0
    return round(np.float64(ele))

def get_dollar(row): #gets the dollar from a row by checking both dollar columns
    dollar = 0
    ninety = 0
    #if dollar (90th) is a decimal, then split it
    if '.' in str(element_to_int(row[11])): # "amount | dollar"
        split = str(element_to_int(row[11])).split(".")
        dollar, ninety = element_to_int(split[0]), element_to_int(split[1])
    elif str(row[11]) == "": # "amount | dollar"
        if '.' in str(element_to_int(row[24])): # "amount | specie"
            split = str(element_to_int(row[24])).split() # "amount | specie"
            dollar, ninety = element_to_int(split[0]), element_to_int(split[1])
    else:
        dollar = element_to_int(row[11]) # "amount | dollar"
        ninety = element_to_int(row[12]) # "amount | 90th"
    return float(str(dollar) + "." + str(ninety))

def new_tup(old_row, new_dol, new_ninety, new_title): # returns a new tuple, specifically for totaled debt amounts (since you can't assign new values in tuples)
    return (old_row[0], old_row[1], old_row[2], old_row[3], old_row[4], old_row[5], old_row[6],
            new_title, old_row[8], old_row[9], old_row[10], new_dol, new_ninety, old_row[13],
            old_row[14], old_row[15], old_row[16], old_row[17], old_row[18], old_row[19],
            old_row[20], old_row[21], old_row[22], old_row[23], old_row[24], old_row[25],
            old_row[26], old_row[27], old_row[28], old_row[29], old_row[30], old_row[31])

agg_df = pd.DataFrame(columns=agg_debt.columns)
last_f, last_l, last_t = "", "", ""
last_row = None
#save the sum of money
current_sum = 0
for row in agg_debt.itertuples(name=None, index=False): #main processing function
    fname, lname = str(row[5]).strip(), str(row[6]).strip()
    last_t = last_t if str(row[7]).strip().lower() == "nan" else str(row[7]).strip()
    if fname == last_f and lname == last_l: #If the next name is the same as the last one, add onto the amount
        dol = get_dollar(row)
        print(f"adding {dol} to {fname} {lname}'s total")
        current_sum += dol
    else: #If the next name is not the same as the last one:
        if current_sum > 0: #If the sum is more than 0 (ie. this is the end of consecutive same-name entries), then only add this on
            print(f"{last_row[5]} {last_row[6]} is consecutively owed {current_sum}")
            #consecutive has ended
            split = str(current_sum).split(".")
            agg_df.loc[len(agg_df.index)] = new_tup(last_row, int(split[0]), int(split[1]), last_t if last_t != "" else "")
            current_sum = 0
        else: #If the sum is not more than 0 (ie. this is one unique entry, add it on now)
            #Normal
            agg_df.loc[len(agg_df.index)] = last_row
    last_f, last_l = fname, lname
    last_row = row

In [None]:
#Ancestry

# import necessary fuzzy string libraries 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.expected_conditions import element_to_be_clickable, presence_of_element_located
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from phonetics import metaphone
from rapidfuzz import fuzz
from joblib import Parallel, delayed, cpu_count
from itertools import zip_longest
import time 
import getpass

# options
options = Options()
options.add_argument('--headless')
options.add_argument("--window-size=1000,1000")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--no-sandbox')  

# install driver - uncomment when necessary
'''
driver = webdriver.Chrome(service=Service(ChromeDriverManager(version='114.0.5735.16').install()), options=options)
wait = WebDriverWait(driver, 30)
'''

# voter records and censuses available for every state 
records = {
    'nh':['https://www.ancestrylibrary.com/search/collections/5058/'],
    'nj':['https://www.ancestrylibrary.com/search/collections/2234/', 
          'https://www.ancestrylibrary.com/search/collections/3562/'],
    'ny':['https://www.ancestrylibrary.com/search/collections/5058/'],
    'ma':['https://www.ancestrylibrary.com/search/collections/5058/'], 
    'ct':['https://www.ancestrylibrary.com/search/collections/5058/'], 
    'va':['https://www.ancestrylibrary.com/search/collections/2234/', 
         'https://www.ancestrylibrary.com/search/collections/3578/'], 
    'pa':['https://www.ancestrylibrary.com/search/collections/2702/',
         'https://www.ancestrylibrary.com/search/collections/2234/',
         'https://www.ancestrylibrary.com/search/collections/3570/'],
    'md':['https://www.ancestrylibrary.com/search/collections/3552/'],
    'nc':['https://www.ancestrylibrary.com/search/collections/3005/', 
         'https://www.ancestrylibrary.com/search/collections/2234/'],
    'ga':['https://www.ancestrylibrary.com/search/collections/2234/'],
    'ri':['https://www.ancestrylibrary.com/search/collections/3571/']
}

# ancestry has unique urls for each state
residence_urls = {
    'nh':'_new+hampshire-usa_32',
    'nj':'_new+jersey-usa_33', 
    'ny':'_new+york-usa_35',
    'ma':'_massachusetts-usa_24',
    'ct':'_connecticut-usa_9',
    'va':'_virginia-usa_49', 
    'pa':'_pennsylvania-usa_41',
    'md':'_maryland-usa_23',
    'nc':'_north+carolina-usa_36',
    'ga':'_georgia-usa_13',
    'ri':'_rhode+island-usa_42'
}

fixes = {} #record name necessary name changes here
checked1 = [] #multiple debt entries for the same person: don't search these names again when comparing
checked0 = [] #multiple debt entries for the same person: don't search these names again
rerun_rows = [] #ancestry crashed trying to find these names

# remove 'cs' (congress) and 'f' (foreign officers); these are not state, but specific regiments / types of officers
agg_debt_copy = agg_debt[(agg_debt['state'] != 'cs') & (agg_debt['state'] != 'f') & (agg_debt['state'] != 'de')]

# split dataframe based on state; makes searching faster
agg_debt_sp = agg_debt_copy.groupby('state')
agg_debts_st = [agg_debt_sp.get_group(x) for x in agg_debt_sp.groups]


for x in agg_debt_sp.groups:
    print(x)
    
# add state to 'fixes' list 
for st in agg_debt_sp.groups:
    fixes[st] = {}
    
netid_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[1]/input'
password_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[2]/input'
login_btn0_xpath = '/html/body/main/div/div/div/a'
login_btn1_xpath = '/html/body/div[1]/div[2]/section/div[1]/div/form/fieldset/div[3]/button'

# ask for password and username 
username = input('username: ')
password = getpass.getpass(prompt='password: ')
driver_objs = {}
# create a new driver object for each state
for st in agg_debt_sp.groups:
    driver_objs[st] = [webdriver.Chrome(service=Service(ChromeDriverManager(version='114.0.5735.16').install()), options=options)]
# create a new wait object for each state
for st in agg_debt_sp.groups:
    webdriver_obj = driver_objs[st][0]
    driver_objs[st].append(WebDriverWait(webdriver_obj, 30))
# for each driver obj: access emory's ancestry's subscription 
for st in agg_debt_sp.groups:
    webdriver_obj = driver_objs[st][0]
    wait_obj = driver_objs[st][1]
    
    # go to emory's library 
    webdriver_obj.get('https://guides.libraries.emory.edu/ALE')
    wait_obj.until(element_to_be_clickable((By.XPATH, login_btn0_xpath))).click()
    
    # input login information and click 'login'
    netid_input = wait_obj.until(element_to_be_clickable((By.XPATH, netid_xpath)))
    netid_input.click()
    netid_input.send_keys(username)
    pass_input = wait_obj.until(element_to_be_clickable((By.XPATH, password_xpath)))
    pass_input.click()
    pass_input.send_keys(password) 
    wait_obj.until(element_to_be_clickable((By.XPATH, login_btn1_xpath))).click()
    
    webdriver_obj.get('https://www.ancestrylibrary.com/search/collections/5058/')
    
    print(webdriver_obj.current_url)

def ancestry_cleaning(agg_debt_st, state):
    # retrieve selenium chromedrivers associated with that state
    st_driver = driver_objs[state][0]
    wait_driver = driver_objs[state][1]
    # run ancestry search on agg debt file
    # agg_debt_st.swifter.apply(lambda row0: compare_strings(agg_debt_st, row0['to whom due | first name'], row0['to whom due | last name'], state, row0), axis=1) # using apply
    agg_debt_clean = compare_strings_vect(agg_debt_st['to whom due | first name'].astype(str), agg_debt_st['to whom due | last name'].astype(str), state, st_driver, wait_driver) # using vectorization
    return agg_debt_st

# loop through the state agg_debt one more time; compare row0 (original row) with all the other rows (row1)
def compare_strings(fn0, ln0, state, st_driver, wait_driver):
    # make sure we haven't checked this name before (handles people who share the same fn & ln & live in same state) 
    # name0 = row0['to whom due | first name'] + ' ' + row0['to whom due | last name'] # uncomment when using apply
    agg_debt_st = agg_debt_sp.get_group(state)
    
    if (fn0, ln0, state) not in checked0:
        # compare both strings 
        # agg_debt_st.swifter.apply(lambda row1: fuzzy_comparison(fn0, ln0, row1['to whom due | first name'], row1['to whom due | last name'], state, row0, row1), axis=1) # using apply
        fuzzy_comparison_vect(fn0, ln0, agg_debt_st['to whom due | first name'].astype(str), agg_debt_st['to whom due | last name'].astype(str), state, st_driver, wait_driver) # using vectorization
        checked0.append((fn0, ln0, state))

# compare two strings using fuzzy string matching 
def fuzzy_comparison(fn0, ln0, fn1, ln1, state, st_driver, wait_driver):
    if (fn1, ln1, state) not in checked1:
        
        name0 = fn0 + ' ' + ln0
        name1 = fn1 + ' ' + ln1

        # use phonetic similarity (compares similar sounding names)
        meta0 = metaphone(name0.lower()) 
        meta1 = metaphone(name1.lower())
        phonetic_score = fuzz.ratio(meta0, meta1)

        # use fuzzy string similarity (compares similar spellings between names)
        fuzz_score = fuzz.ratio(name0, name1) 

        # check if phonetic score and fuzzy string score both meet threshold, both names are not the same  
        if phonetic_score > 90 and fuzz_score > 90 and name0 != name1:
            search_ancestry(fn0, ln0, fn1, ln1, name0, name1, state, st_driver, wait_driver) 
            checked1.append((fn1, ln1, state)) # record that we have checked this name

# look up both names in ancestry's database
def search_ancestry(fn0, ln0, fn1, ln1, name0, name1, state, driver, wait):
    # loop through state urls 
    for url in records[state]:        
        try:
            # search person-0
            url0 = url + '?name=' + fn0 + '_' + ln0 + '&name_x=ps&residence=1780' + residence_urls[state] + '&residence_x=10-0-0_1-0'
            driver.get(url0) 
                        
            # results were found for person0
            try:
                # use xpath to find result text
                # result0 = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span'))).text
                # use class_name to find result text
                result0 = wait.until(presence_of_element_located((By.CLASS_NAME, 'srchHit'))).text
            # no results were found; keep entries separate  
            except:
                result0 = ''
            
            # search person-1
            url1 = url + '?name=' + fn1 + '_' + ln1 + '&name_x=ps&residence=1780' + residence_urls[state] + '&residence_x=10-0-0_1-0'
            driver.get(url1)
                        
            # results were found for person1
            try: 
                # use xpath to find result text
                # result1 = wait.until(presence_of_element_located((By.XPATH, '/html/body/div[3]/div/div/div/section[1]/div[1]/table/tbody/tr[2]/td[2]/span'))).text
                # use class_name to find result text
                result1 = wait.until(presence_of_element_located((By.CLASS_NAME, 'srchHit'))).text
            # no results were found; keep entries separate
            except:
                result1 = ''
                
            '''
            compare results:
            if both results are empty, do not add to fixes dict 
            if both results are different, do not add to fixes dict
            if both results are the same, add to fixes dict
                find correct name
                if name0 = result0 and result1 : {name1 : name0}
                if name1 = result1 and result0 : {name0 : name0} 
            '''
            
            if result0 == result1 and result0 != '' and result1 != '':
                if name0 == result0 and name0 == result1: # name0 must be the correct version of the name 
                    fixes[state][(fn1, ln1, name1)] = (fn0, ln0, name0) # convert name1 to name0  
                    # record change
                    '''
                    add_changes(row1['to whom due | title'], row1['to whom due | title'], fn1, ln1, 
                       fn0, ln0, 6, row1['org_file'], row1['org_index'])
                    '''

                elif name1 == result0 and name1 == result1: # name1 must be the correct version of the name 
                    fixes[state][(fn0, ln0, name0)] = (fn1, ln1, name1) # convert name0 to name1
                    # record change
                    '''
                    add_changes(row0['to whom due | title'], row0['to whom due | title'], fn0, ln0, 
                       fn1, ln1, 6, row0['org_file'], row0['org_index'])
                    '''
            
            print('---------------------------+')
            print('Summary')
            print('driver=')
            print(driver)
            print('name0=' + str(name0))
            print('name1=' + str(name1))
            print('fn0=' + str(fn0))
            print('ln0=' + str(ln0))
            print('fn1=' + str(fn1))
            print('ln1=' + str(ln1))
            print('url-0=' + str(url0))
            print('url-1=' + str(url1))
            print('result0=' + str(result0))
            print('result1=' + str(result1))
            print('state=' + str(state))
            print('fixes length=' + str(len(fixes[state])))
            print('rerun rows length=' + str(len(rerun_rows)))
            print('---------------------------+')
        
        # there was error trying to access ancestry's records
        except Exception as e:
            print('---------------------------+')
            print('Error')
            print(e)
            print('name0=' + str(name0))
            print('name1=' + str(name1))
            print('---------------------------+')
            rerun_rows.append([fn0, ln0, fn1, ln1, name0, name1, state]) 

# record how long it takes to run ancestry search; useful information to see effectiveness of different methods 
start = time.time()

# vectorize our functions 
compare_strings_vect = np.vectorize(compare_strings)
fuzzy_comparison_vect = np.vectorize(fuzzy_comparison)

# initialize a parallelization job; the idea is to have one core work on one state's agg debt file

ancestry_calls = [delayed(ancestry_cleaning)(agg_debt_sp.get_group(st), st) for st in agg_debt_sp.groups]
results = Parallel(n_jobs=-1, backend="threading")(ancestry_calls) 

# split nj's agg_debt file to test parallelization
# agg_debt_nj = agg_debt_sp.get_group('nj')
# agg_debt_nj_sp = np.split(agg_debt_nj, 7)

# initialize parallelization (nj)
# ancestry_calls = [delayed(ancestry_cleaning)(df, st) for df in agg_debt_nj_sp.groups]
# results = Parallel(n_jobs=-1, backend="threading")(ancestry_calls) 

# try without parallization - try out on [nj] only
# agg_debt_nj = ancestry_cleaning(agg_debt_sp.get_group('nj'), 'nj')

end = time.time()
print(end - start)