# Proposal - MLND Capstone Project
### Chad Acklin
***


## Data Acquisition
As noted in the project proposal, due to the privacy concerns of real-world personally identifiable datasets, this project will make use of synthetically generated data.  The initial dataset was generated with the generate tool published by the [Freely Extensible Biomedical Record Linkage](http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/node70.html) package.  The FEBRL tool generates a dataset of demographic data and introduces noise that mimics OCR errors, typograhic errors, and phonetic misspellings.  

A dataset of 30,000 originals and 15,000 matches was generated, stored as a csv, and loaded below.  Originals have an id of rec-X-org and matches have an id of rec-X-dup-X

In [1]:
import pandas as pd
import numpy as np
import jellyfish
import Levenshtein
import matplotlib.pyplot as plt
%matplotlib inline
from time import time

In [2]:
col = ['id', 'nationality', 'gender', 'age', 'dob', 'title',
       'first_name', 'last_name', 'state', 'city', 'zip',
       'street_number', 'address_1', 'address_2', 'phone']
df = pd.read_csv('FEBRL_sample_data.csv', encoding = 'latin1', names = col)
df.head()

Unnamed: 0,id,nationality,gender,age,dob,title,first_name,last_name,state,city,zip,street_number,address_1,address_2,phone
0,rec_id,culture,sex,age,date_of_birth,title,given_name,surname,state,suburb,postcode,street_number,address_1,address_2,phone_number
1,rec-0-dup-0,,f,28,19901218,,caitlin,chappel,,pres,4504,,g'lbbes street,,087631 3909
2,rec-0-dup-1,,f,2,19901218,,cait in,chap pl,,preston,4504,,gibbess reet,,087631 909
3,rec-0-org,,f,28,19901218,,caitlin,chappel,,preston,4504,,gibbes street,,08 76313909
4,rec-1-org,ind,f,,19630324,,oscar,murjani,vic,bright,2060,14,aspinall street,,618 53569883



After loading the dataset, several additional features are calculated including phonetic representations of first name, last name, and city.  These were created using the metaphone algorithm available in the jellyfish package

Data is into 2 dataframes by originals and duplicates.

In [4]:
df[['prefix','trueRecordID', 'OrgDup', 'seqNum']] = df.id.str.split('-', expand=True)
df.drop('prefix', axis=1, inplace=True)
df.fillna ('', inplace=True)
df['last_name_met'] = df['last_name'].str.replace(' ', '').apply(lambda x: jellyfish.metaphone(x))
df['first_name_met'] = df['first_name'].str.replace(' ', '').apply(lambda x: jellyfish.metaphone(x))
df['city_met'] = df['city'].str.replace(' ', '').apply(lambda x: jellyfish.metaphone(x))
df['phone'] = df['phone'].str.replace(' ', '')
df['predicateLastName'] = df['last_name'].str[:2]
df['predicateFirstName'] = df['first_name'].str[:2]


dfDup = df[df.OrgDup == 'dup']
dfOrg = df[df.OrgDup == 'org']
dfDup.columns = [str(col) + '_match' for col in dfDup.columns]
dfOrg.columns = [str(col) + '_org' for col in dfOrg.columns]
print('Original Recordset:', dfOrg.shape)
print('Recordset of matches:', dfDup.shape)

Original Recordset: (30000, 23)
Recordset of matches: (15000, 23)


## Blocking

The demographics are uses to "block" potential mathes between the datasets.  Some trial and error was used here to find a balance of limiting the dataset and the completeness of including all matching pairs.  If every original were compared to every potential match, the resultant dataset would be 450M rows (30,000 x 15,000).  Blocking limits the number of comparisons that will be required.

After blocking, the paired dataset includes slightly more than 4.18M rows and has only failed to include 11 matches.  This is an acceptable trade-off for this task and the blocked matches becomes the basis for building the model.

In [5]:
dfZipMatches = pd.merge(dfOrg, dfDup, how='inner', left_on='zip_org', right_on='zip_match')
dfSurnameMatches = pd.merge(dfOrg, dfDup, how='inner', left_on='last_name_met_org', right_on='last_name_met_match')
#dfFirstMatches = pd.merge(dfOrg, dfDup, how='inner', left_on='first_name_met_org', right_on='first_name_met_match')
#dfFirstLastMatch = pd.merge(dfOrg, dfDup, how='inner', left_on='last_name_met_org', right_on='first_name_met_match')
dfCityMatches = pd.merge(dfOrg, dfDup, how='inner', left_on='city_met_org', right_on='city_met_match')
#dfStateMatches = pd.merge(dfOrg, dfDup, how='left', left_on='state_org', right_on='zip_match')
dfPhoneMatches = pd.merge(dfOrg, dfDup, how='inner', left_on='phone_org', right_on='phone_match')
dfPredicateMatches = pd.merge(dfOrg, dfDup, how='inner'
                                , left_on=['predicateFirstName_org', 'predicateLastName_org']
                                , right_on=['predicateFirstName_match', 'predicateLastName_match'])

In [6]:
matchFrames = [ dfZipMatches
              , dfSurnameMatches
              #, dfFirstMatches 
              #, dfFirstLastMatch
              , dfCityMatches
              #, dfStateMatches
              , dfPhoneMatches
              , dfPredicateMatches
              ]

for m in matchFrames:
    print( m.shape)

(421896, 46)
(689347, 46)
(174402, 46)
(1196104, 46)
(1785915, 46)


In [12]:
dfBlockedMatches = pd.concat(matchFrames)

In [13]:
## Clean up a little and drop duplicates

dfBlockedMatches.fillna('0', inplace=True)
dfBlockedMatches.drop_duplicates(subset=['id_org', 'id_match'], inplace=True)
dfBlockedMatches.shape

(4183821, 46)

In [14]:
dfBlockedMatches['MATCH'] = np.where(dfBlockedMatches[
        'trueRecordID_org']==dfBlockedMatches['trueRecordID_match'], 1, 0)

In [15]:
## How many of the 15,000 matches have we included in the dataset?
sum(dfBlockedMatches.MATCH)

14989

## Feature Engineering

Next, features are engineered that compare the edit distance of associated strings between the original and match.  This step uses the Damerau Levenshtein edit distance to determine the similarity of the strings.  Then, the edit distances are scaled by a factor of 10.

In [16]:
dfBlockedMatches['phone_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['phone_org'], x['phone_match']), axis=1)
dfBlockedMatches['last_name_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['last_name_org'], x['last_name_match']), axis=1)
dfBlockedMatches['first_name_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['first_name_org'], x['first_name_match']), axis=1)
dfBlockedMatches['city_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['city_org'], x['city_match']), axis=1)
dfBlockedMatches['state_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['state_org'], x['state_match']), axis=1)
dfBlockedMatches['address_1_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['address_1_org'], x['address_1_match']), axis=1)
dfBlockedMatches['address_1_2_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['address_1_org'], x['address_2_match']), axis=1)
dfBlockedMatches['dob_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['dob_org'], x['dob_match']), axis=1)
dfBlockedMatches['zip_dlev'] = dfBlockedMatches.apply(lambda x: jellyfish.damerau_levenshtein_distance
                                                        (x['zip_org'], x['zip_match']), axis=1)

In [17]:
def scale_invert(x):
    scaled = 1-(x/10)
    return scaled

In [18]:
dfBlockedMatches['phone_scaled'] = dfBlockedMatches['phone_dlev'].apply(scale_invert)
dfBlockedMatches['last_name_scaled'] = dfBlockedMatches['last_name_dlev'].apply(scale_invert)
dfBlockedMatches['first_name_scaled'] = dfBlockedMatches['first_name_dlev'].apply(scale_invert)
dfBlockedMatches['city_scaled'] = dfBlockedMatches['city_dlev'].apply(scale_invert)
dfBlockedMatches['state_scaled'] = dfBlockedMatches['state_dlev'].apply(scale_invert)
dfBlockedMatches['address_1_scaled'] = dfBlockedMatches['address_1_dlev'].apply(scale_invert)
dfBlockedMatches['address_1_2_scaled'] = dfBlockedMatches['address_1_2_dlev'].apply(scale_invert)
dfBlockedMatches['dob_scaled'] = dfBlockedMatches['dob_dlev'].apply(scale_invert)
dfBlockedMatches['zip_scaled'] = dfBlockedMatches['zip_dlev'].apply(scale_invert)

## Final Dataset

The final dataset is saved as csv and a sample generated to attach to the proposal.  The final dataset includes the original id, the id from the matching dataset, a "MATCH" indicator that will be our target variable, and scaled features based upon the string edit distances calculated above.  

In [19]:
dfFeatures = dfBlockedMatches[[ 'id_org', 'id_match', 'MATCH',
       'phone_scaled', 'last_name_scaled', 'first_name_scaled', 'city_scaled',
       'state_scaled', 'address_1_scaled', 'address_1_2_scaled', 'dob_scaled',
       'zip_scaled']]
dfFeatures.to_csv('dfFeatures.csv')
dfFeatures.sample(n=100000).to_csv('dfFeatures_sample.csv')

In [20]:
dfFeatures.head()

Unnamed: 0,id_org,id_match,MATCH,phone_scaled,last_name_scaled,first_name_scaled,city_scaled,state_scaled,address_1_scaled,address_1_2_scaled,dob_scaled,zip_scaled
0,rec-0-org,rec-0-dup-0,1,1.0,1.0,1.0,0.7,1.0,0.8,-0.3,1.0,1.0
1,rec-0-org,rec-0-dup-1,1,0.9,0.8,0.9,1.0,1.0,0.8,-0.3,1.0,1.0
2,rec-0-org,rec-11006-dup-0,0,0.2,0.3,0.4,0.0,1.0,-0.1,0.0,0.5,1.0
3,rec-0-org,rec-11006-dup-1,0,0.2,0.3,0.4,0.0,1.0,0.0,-0.1,0.5,1.0
4,rec-0-org,rec-12494-dup-1,0,0.1,0.3,0.4,0.2,0.9,0.5,-0.3,0.6,1.0


In [21]:
dfFeatures.shape

(4183821, 12)