# Cleaning Names

The purpose of this notebook is to clean the names of individuals. All the problems that we aim to fix in this notebook are listed [here](https://docs.google.com/document/d/1pcSQfWNll6K9tl-_rB4lztN0TsZsclU9vOnbyQob-Zs/edit).

## Merge into One Dataframe

Combine all the individual state files into one CSV file for cleaning. 

4 categories of debt files:
1. liquidated debt certificates 
2. loan office certificates 
3. marine liquidated debt certificates
4. pierce certificates

In [36]:
# import all the necessary packages
import pandas as pd 
import numpy as np
import nltk
import re
from fuzzywuzzy import fuzz
from nameparser import HumanName

In [37]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to C:\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to C:\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [38]:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

In [69]:
def clean_table(table, drp_cols, man_fix):
    table.drop(columns=drp_cols, inplace=True, axis=1)
    table.columns = table.columns.to_flat_index() 
    table.rename(columns=lambda x: x[0].lower() + ' | ' + x[1].lower(), inplace=True) # lowercase column titles
    table.rename(columns={'state | ' : 'state'}, inplace=True)
    table.rename(columns=man_fix, inplace=True)
    return table

In [71]:
# handle the liquidated debt certificates first for each file and merge into 1 dataframe
ct_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_CT.xlsx", header=[10,11])
de_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_DE.xlsx", header=[9,10])
ma_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_MA.xlsx", header=[10,11])
nh_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_NH.xlsx", header=[10,11])
nj_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_NJ.xlsx", header=[9,10])
ny_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_NY.xlsx", header=[10,11])
pa_stelle_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_PA_stelle.xlsx", header=[10,11])
pa_story_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_PA_story.xlsx", header=[10,11])
ri_debt = pd.read_excel("../../data_raw/pre1790/liquidated_debt_certificates_RI.xlsx", header=[10,11])

# add a state column to each dataframe
ct_debt['state'] = 'ct'
de_debt['state'] = 'de'
ma_debt['state'] = 'ma'
nh_debt['state'] = 'nh'
nj_debt['state'] = 'nj'
ny_debt['state'] = 'ny'
pa_stelle_debt['state'] = 'pa'
pa_story_debt['state'] = 'pa'
ri_debt['state'] = 'ri'

ny_drp_cols = ['Page', 'JPEG number', 'Number', 'Letter', 'Line Strike Through?']
ny_debt = clean_table(ny_debt, ny_drp_cols)
ct_drp_cols = ['Register Page', 'JPEG number', 'Number', 'Letter', 'Line Strike Thorugh?']
ct_debt = clean_table(ct_debt, ct_drp_cols)

de_drp_cols = ['Unnamed: 0_level_0', 'Unnamed: 1_level_0', 'Unnamed: 2_level_0', ('Amount', 'Strike Through'),
            ('Amount', 'Note')]
de_debt = clean_table(de_debt, de_drp_cols)

# clean new york's column titles first

  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  table.drop(columns=drp_cols, inplace=True, axis=1)
  table.drop(columns=drp_cols, inplace=True, axis=1)
  table.drop(columns=drp_cols, inplace=True, axis=1)


In [72]:
print(de_debt.columns)

Index(['date of the certificates | date', 'date of the certificates | month',
       'date of the certificates | year', 'to whom due | title',
       'to whom due | first name', 'to whom due | last name',
       'time when the debt became due | date',
       'time when the debt became due | month',
       'time when the debt became due | year', 'amount | dollars',
       'amount | 90th', 'state'],
      dtype='object')


In [73]:
print(ny_debt.dtypes)

date of the certificate | month          float64
date of the certificate | day            float64
date of the certificate | year           float64
to whom due | first name                  object
to whom due | last name                   object
to whom due | title                       object
to whom due | first name.1                object
to whom due | last name.1                 object
to whom due | title.1                    float64
time when the debt became due | month    float64
time when the debt became due | day      float64
time when the debt became due | year      object
amount | dollars                         float64
amount | 90th                             object
state                                     object
dtype: object


## Company Names

Search up company names and input their owners. There are multiple kinds of companies. 

'James Vernon & Co.' These are pretty simple to deal with. If they have '& co' anywhere in the string of the first name column, it is most likely a company. Just take the string beforehand. 

In [33]:
# dictionary of manual changes i have to make 
changes = {
    'Henry Mc Clellen & Henry & co' : 'Henry Mc Clellen & Co'
}

In [34]:
def handle_comp_name(row):    
    fname = row['to whom due | first name']
    
    if fname in changes:
        print(fname)
        fname = changes[fname]
    
    fname_c = str(fname).lower()
    if ('& co' in fname_c) or ('& others' in fname_c) or ('& several others' in fname_c):        
        fname_c = fname_c.replace('& co', '').replace('& others', '')
        name = HumanName(fname_c)
        row['to whom due | first name'] = name.first
        row['to whom due | last name'] = name.last
        row['under company'] = True # note that the original debt entry was held by a company 
        print(row)
        
        return row
    
    return row

ny_debt['under company'] = np.nan
ny_debt[['to whom due | first name', 'to whom due | last name', 'under company']] = ny_debt[['to whom due | first name', 
                                                                            'to whom due | last name', 'under company']].apply(lambda row: handle_comp_name(row), axis=1)

to whom due | first name     henry
to whom due | last name     wisner
under company                 True
Name: 491, dtype: object
to whom due | first name     henry
to whom due | last name     wisner
under company                 True
Name: 492, dtype: object
to whom due | first name     henry
to whom due | last name     wisner
under company                 True
Name: 493, dtype: object
to whom due | first name     henry
to whom due | last name     wisner
under company                 True
Name: 494, dtype: object
to whom due | first name     henry
to whom due | last name     wisner
under company                 True
Name: 495, dtype: object
to whom due | first name     henry
to whom due | last name     wisner
under company                 True
Name: 496, dtype: object
to whom due | first name     henry
to whom due | last name     wisner
under company                 True
Name: 497, dtype: object
to whom due | first name     henry
to whom due | last name     wisner
under company       

## Cleaning Entries with Two Names

There are debt entries that have two names in a single cell. NY_2422: "Messes Williamson & Beckman". The plan is to split the name across the first name and last name columns.  

In [20]:
changes = {
    'van zandt & kittletas' : ['', 'van zandt | kittletas'],
    'trustees of & davids church':['trustees of & davids church', '']
}

In [21]:
def handle_two_name(row):
    name = str(row['to whom due | first name']).lower()
    if (' & ' in name) or (' and ' in name):
        person1 = re.split('&|and', name)[0].strip()
        person2 = re.split('&|and', name)[1].strip()
        human_name_1 = HumanName(person1)
        human_name_2 = HumanName(person2)
        
        if name not in changes:
            if human_name_1.first != '' and human_name_2.first != '':
                row['to whom due | first name'] = human_name_1.first + " | " + human_name_2.first
            else: 
                row['to whom due | first name'] = human_name_1.first + human_name_2.first

            if human_name_1.last != '' and human_name_2.last != '':
                row['to whom due | last name'] = human_name_1.last + " | " + human_name_2.last
            else:
                row['to whom due | last name'] = human_name_1.last + human_name_2.last
        else:
            row['to whom due | first name'] = changes[name][0]
            row['to whom due | last name'] = changes[name][1]
        
        ny_debt['multiple persons'] = True
            
        print("old: " + name)
        print("new fn: " + row['to whom due | first name'])
        print("new ln: " + row['to whom due | last name'] +"\n")
        
    return row

ny_debt['multiple persons'] = np.nan
ny_debt.apply(lambda row: handle_two_name(row), axis=1)

old: harsen & ham
new fn: harsen | ham
new ln: 

old: harsen & ham
new fn: harsen | ham
new ln: 

old: messes smith & van buren
new fn: messes | van
new ln: smith | buren

old: messes bogart & van beuren
new fn: messes | van
new ln: bogart | beuren

old: messes smith & van buren
new fn: messes | van
new ln: smith | buren

old: messes smith & van buren
new fn: messes | van
new ln: smith | buren

old: messes williamson & beckman
new fn: messes | beckman
new ln: williamson

old: robinson & hale
new fn: robinson | hale
new ln: 

old: melgret & george ox
new fn: melgret | george
new ln: ox

old: robinson & hale
new fn: robinson | hale
new ln: 

old: robinson & hale
new fn: robinson | hale
new ln: 

old: robinson & hale
new fn: robinson | hale
new ln: 

old: robinson & hale
new fn: robinson | hale
new ln: 

old: bagart & dawrv
new fn: bagart | dawrv
new ln: 

old: quackenbush & dowe
new fn: quackenbush | dowe
new ln: 

old: whitbeck & fonda
new fn: whitbeck | fonda
new ln: 

old: van zandt &

Unnamed: 0,date of the certificate | month,date of the certificate | day,date of the certificate | year,to whom due | first name,to whom due | last name,to whom due | title,to whom due | first name.1,to whom due | last name.1,to whom due | title.1,time when the debt became due | month,time when the debt became due | day,time when the debt became due | year,amount | dollars,amount | 90th,state,under company,multiple persons
0,2.0,20.0,1784.0,John,Newkuk,,,,,4.0,20.0,1780,2.000000e+00,,ny,,
1,2.0,20.0,1784.0,Mathias,Teller,,,,,4.0,11.0,1780,8.000000e+00,87,ny,,
2,2.0,20.0,1784.0,Seth,Marvin,,,,,4.0,14.0,1780,1.800000e+01,,ny,,
3,2.0,20.0,1784.0,James,Gallaway,,,,,5.0,31.0,1780,2.800000e+01,80,ny,,
4,2.0,20.0,1784.0,Israel,Rogers,,,,,5.0,17.0,1780,1.500000e+01,,ny,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7306,3.0,31.0,1787.0,John,Myers,,,,,1.0,1.0,1777,2.190000e+02,33,ny,,
7307,3.0,31.0,1787.0,Luther,Halsey,,,,,1.0,1.0,1777,3.900000e+01,52,ny,,
7308,,,,,,,,,,,,,,,ny,,
7309,,,,,,,,,,,,,1.232992e+06,,ny,,


## Handle Abbreviations of a Name

There are individuals who have a handwritten abbreviation of a name in their debt entry. Fix these names. There will be a dictionary of abbreviations. Just check if any of the debt entries are in the dictionary and change it if needed. 

In [22]:
abbreviations = {
    'And':'Andrew', 'Ant':'Anthony', 'Bart':'Bartholomew', 'Cha':'Charles', 'Dor':'Dorothy', 'Dot':'Dorothy', 'Doth':'Dorothy',
    'Edw':'Edward', 'Eliz':'Elizabeth', 'Geo':'George', 'H':'Henry', 'Herb':'Herbert', 'Marg':'Margaret', 'Mich':'Michael', 
    'Pat': 'Patrick', 'Rich':'Richard', 'Tho':'Thomas', 'W':'William'
}

In [64]:
def handle_abbreviations(row):
    if str(row['to whom due | first name']) in abbreviations:
        print(row.index)
        print(row['to whom due | first name'])
    
    return row
        
ny_debt.apply(lambda row: handle_abbreviations(row), axis=1)

Unnamed: 0,date of the certificate | month,date of the certificate | day,date of the certificate | year,to whom due | first name,to whom due | last name,to whom due | title,to whom due | first name.1,to whom due | last name.1,to whom due | title.1,time when the debt became due | month,time when the debt became due | day,time when the debt became due | year,amount | dollars,amount | 90th,state
0,2.0,20.0,1784.0,John,Newkuk,,,,,4.0,20.0,1780,2.000000e+00,,ny
1,2.0,20.0,1784.0,Mathias,Teller,,,,,4.0,11.0,1780,8.000000e+00,87,ny
2,2.0,20.0,1784.0,Seth,Marvin,,,,,4.0,14.0,1780,1.800000e+01,,ny
3,2.0,20.0,1784.0,James,Gallaway,,,,,5.0,31.0,1780,2.800000e+01,80,ny
4,2.0,20.0,1784.0,Israel,Rogers,,,,,5.0,17.0,1780,1.500000e+01,,ny
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7306,3.0,31.0,1787.0,John,Myers,,,,,1.0,1.0,1777,2.190000e+02,33,ny
7307,3.0,31.0,1787.0,Luther,Halsey,,,,,1.0,1.0,1777,3.900000e+01,52,ny
7308,,,,,,,,,,,,,,,ny
7309,,,,,,,,,,,,,1.232992e+06,,ny
