# RECORD LINKAGE

----

This notebook will provide you with an instruction into Record Linkage using Python. Upon completion of this notebook you will be able to apply record linkage techniques using the `recordlinkage` package to combine data from different sources in Python. It will lead you through all the steps necessary for a successful record linkage starting with data preparation  including pre-processing, cleaning and standardization of data.

## Table of Contents
- [The Principles of Record Linkage](#The-Principles-of-Record-Linkage)
- [Linking Patents to Grants](#Linking-Patents-to-Grants)
- [Import of Packages](#Import-of-Packages)
- [Loading Grant and Patent Data](#Loading-Grant-and-Patent-Data)
- [The Importance of Pre-Processing](#The-Importance-of-Pre-Processing)
- [Record Linkage](#Record-Linkage)
- [References and Further Readings](#References-and-Further-Readings)

## The Principles of Record Linkage
The goal of record linkage is to determine if pairs of records describe the same identity. For instance, this is important for removing duplicates from a data source or joining two separate data sources together. Record linkage also goes by the terms data matching, merge/purge, duplication detection, de-duping, reference matching, entity resolution, disambiguation, co-reference/anaphora in various fields.

There are several approaches to record linkage that include 
    - exact matching, 
    - rule-based linking and 
    - probabilistic linking. 
- An example of **exact matching** is joining records based on a direct identifier. This is we have already have done in SQL by joining tables in the last lecture and lab. 
- **Rule-based matching** involves applying a cascading set of rules that reflect the domain knowledge of the - records being linked. 
- In **probabilistic record linkages**, linkage weights are estimated to calculate the probability of a certain match.

In practical applications you will need record linkage techiques to combine information addressing the same entity that is stored in different data sources. Record linkage will also help you to address the quality of different data sources. For example, if one of your databases has missing values you might be able to fill those by finding an identical pair in a different data source. Overall, the main applications of record linkage are

   1. Merging two or more data files 
   2. Identifying the intersection of the two data sets 
   3. Updating data files (with the data row of the other data files) and imputing missing data
   4. Entity disambiguation and de-duplication

## Linking Patents to Grants

In this notebook we will link patents to data on federal grants. Both data are in our class folder. The federal grants contain the name of the PI and the name and location of the institution that the PI is associated with. In our patent data we have the name of the first named inventor on the patent and the name and location of the assignee. Our goal is to identify if there are specific grants related to any patents. We might be able to answer that question by linking these two data sources using the identifiers outlined above. 

## Installation of Packages

The enviroment has the most commonly used packages installed so you are able to directly import them. Other packages might not be installed so we need to install them before we can import them. In this notebook we will be using the record linkage package which is not pre-installed. We can use the pip install command to install the package. On your home computer you only have to do this once. As our environment is only active for the current session we have to do this everytime we open the binder.

In [None]:
## Installation of Packages
%pip install recordlinkage 

## Import of Packages
The record linkage package provides us with tools we can use for record linkages so we don't have to start from scratch and code our own linkage algorithms. We need to import the package recordlinkage and all the modules we would like to use imn addition to the packages we already know

In [None]:
# general use imports
import pandas as pd
import numpy as np
import os
import glob

# Machine learing
import sklearn

# record linkage 
import recordlinkage as rl
from recordlinkage.preprocessing import clean, phonenumbers, phonetic

print( "Imports loaded at " + str( pd.datetime.now() ) )

## Loading Grant and Patent Data
We first have to prepare both of the data by selecting the relevant information and adding the different data for all years into one data frame. We can do this by using a loop to add files that are located in our data folders.

In [None]:
# My data directory
data_dir = "/home/jovyan/Yandex.Disk/BigDataPubPol/data"
print( "The data directory for the class data is " + data_dir )

### Lets Start with the Patent Data

In [None]:
# We can also switch work directories using python functions (provided by os package)
# Now we are switching into the folder that has the patent data
os.chdir(data_dir + "/patents")

In [None]:
# Generate an empty dataframe that will hold all the patent data we have
patents_1018 = pd.DataFrame([])

# Now loop through each file in the folder that starts with patent
# Read that file using only the columns we need
# And append it to the dataframe that we created above
for counter, file in enumerate(glob.glob("patent*?")):
    print(counter,file)
    patent = pd.read_csv(file, usecols=['patent_number','patent_date', 
                                        'patent_firstnamed_inventor_name_first',
                                        'patent_firstnamed_inventor_name_last',
                                        'patent_firstnamed_assignee_city', 
                                        'patent_firstnamed_assignee_state',
                                        'patent_firstnamed_assignee_organization'])
    patents_1018 = patents_1018.append(patent)

In [None]:
# Lets check if this worked
patents_1018.columns

In [None]:
patents_1018.head()

### Grant Data are next

In [None]:
# Change into directory of grants data
os.chdir(data_dir + "/projects")

In [None]:
# Generate an empty dataframe that will hold all the grant data we have
grants_1018 = pd.DataFrame([])

# Now loop through each file in the folder that starts with patent
# Read that file using only the columns we need
# And append it to the dataframe that we created above
for counter, file in enumerate(glob.glob("*.csv")):
    print(counter,file) 
    grant = pd.read_csv(file, low_memory=False, skipinitialspace=True, 
                        usecols=['PROJECT_ID', 'CONTACT_PI_PROJECT_LEADER', 'ORGANIZATION_NAME',
                                'ORGANIZATION_CITY', 'ORGANIZATION_STATE', 'PROJECT_START_DATE'])
    grants_1018 = grants_1018.append(grant)

In [None]:
# Lets check if this worked
grants_1018.columns

Now that we have one data frame with the grants info and one data frame with the patent info we can start exploring the identifiers we are planning to use for the linkage. We have to make sure that they are as identical as possible. This phase is called pre processing.

## The Importance of Pre-Processing
Data pre-processing is an important step in a data anlysis project in general, in record linkage applications in particular. The goal of pre-processing is to transform messy data into a dataset that can be used in a project workflow.

Linking records from different data sources comes with different challenges that need to be addressed by the analyst. The analyst must determine whether or not two entities (individuals, businesses, geographical units) on two different files are the same. This determination is not always easy. In most of the cases there is no common uniquely identifing characteristic for a entity. For example, is Bob Miller from New Yor the same person as Bob Miller from Chicago in a given dataset? This detemination has to be executed carefully because consequences of wrong linkages may be substantial (is person X the same person as the person X on the list of identified terrorists). Pre-processing can help to make better informed decisions.

Pre-processing can be difficult because there are a lot of things to keep in mind. For example, data input errors, such as typos, misspellings, truncation, abbreviations, and missing values need to be corrected. Literature shows that preprocessing can improve matches. In some situations, 90% of the improvement in matching efficiency may be due to preprocessing. The most common reason why matching projects fail is lack of time and resources for data cleaning. 

In the following we will walk you through some pre-processing steps, these include but are not limited to removing spaces, parsing fields, and standardizing strings.

### Parsing String Variables

By default, the split method returns a list of strings obtained by splitting the original string on spaces or commas, etc. The record linkage package comes with a build in cleaning function we can also use. In addition, we can extract information from strings for example by using regex search commands.

### Regular Expressions - regex
When defining a regular expression search pattern, it is a good idea to start out by writing down, explicitly, in plain English, what you are trying to search for and exactly how you identify when you've found a match.
For example, if we look at an author field formatted as "<last_name> , <first_name> <middle_name>", in plain English, this is how I would explain where to find the last name: "starting from the beginning of the line, take all the characters until you see a comma."

We can build a regular expression that captures this idea from the following components:
- ^ Matches beginning of the line
- . Matches any character
- .+ A modifier that means "match one or more of the preceding expression"

In a regular expression, there are special reserved characters and character classes like those in the list above. Anything that is not a special character or class is just looked for explicitly (for example, a comma is not a special character in regular expressions, so if it is in a regular expression pattern, the regular expression processor will just be looking for a comma in the string, at that point in the pattern).

Note: if you want to actually look for one of these reserved characters, it must be escaped, so that, for example, the expression looks for a literal period, rather than the special regular expression meaning of a period. To escape a reserved character in a regular expression, precede it with a back slash ( "." ).
This results in the regular expression: ^.+,

We start at the beginning of the line ( "^" ), matching any characters ( ".+" ) until we come to the literal character of a comma ( "," ).

In python, to use a regular expression like this to search for matches in a given string, we use the built-in "re" package ( https://docs.python.org/2/library/re.html ), specifically the "re.search()" method. To use "re.search()", pass it first the regular expression you want to use to search, enclosed in quotation marks, and then the string you want to search within. 

#### REGEX CHEATSHEET


    - abc...     Letters
    - 123...     Digits
    - \d         Any Digit
    - \D         Any non-Digit Character
    - .          Any Character
    - \.         Period
    - [a,b,c]    Only a, b or c
    - [^a,b,c]   Not a,b, or c
    - [a-z]      Characters a to z
    - [0-9]      Numbers 0 to 9
    - \w any     Alphanumeric chracter
    - \W         any non-Alphanumeric character
    - {m}        m Repetitions
    - {m,n}      m to n repetitions
    - *          Zero or more repetitions
    - +          One or more repetitions
    - ?          Optional Character
    - \s         any Whitespace
    - \S         any non-Whitespace character
    - ^...$      Starts & Ends
    - (...)      Capture Group
    - (a(bc))    Capture sub-Group
    - (.*)       Capture All
    - (abc|def)  Capture abc or def
     
#### EXAMPLES
    - (\d\d|\D) will match 22X, 23G, 56H, etc...
    - \w will match any characters between 0-9 or a-z
    - \w{1-3} will match any alphanumeric character of a length of 1 to 3. 
    - (spell|spells) will match spell or spells

#### Clean Patent Data
Now we will clean and preprocess the data on Patent. We want to check the names and standardize, make sure the location information is valid, look into standardizing Organization names and generate a year variable.

Lets start with the names and first look at the data check how the names are being recorded. We have first and last name in two different variables.

In [None]:
patents_1018['patent_firstnamed_inventor_name_first'].unique().tolist()[50:100]

We have some names that have only a first name, for some we have a first and middle name or inital, and then there are also hyphens. So we need to create one variable fname that only contains the first name. 

In [None]:
patents_1018['patent_firstnamed_inventor_name_last'].unique().tolist()[50:100]

The last name column looks better. We just need to make sure that all letters are lowercase.

The record linkage package comes with a built-in cleaning function we can use. The `clean()` function removes any characters such as - and . and  / and \  :  brackets of all types, and also lowercases by default.

In [None]:
# remove special characters in names and make them lowercase
patents_1018['name_last']=(clean(patents_1018['patent_firstnamed_inventor_name_last'], 
                                 lowercase=True, remove_brackets=True))
patents_1018['name_first']=(clean(patents_1018['patent_firstnamed_inventor_name_first'], 
                                  lowercase=True, remove_brackets=True))

In [None]:
# Compare the orginal names with the manipulates ones
patents_1018[['patent_firstnamed_inventor_name_last','patent_firstnamed_inventor_name_first', 
              'name_last','name_first']].head(10)

In [None]:
# Only keep the first name by splitting the name string
# And grabbing the first element (at position 0)
patents_1018['name_first'] = patents_1018.name_first.str.split(' ').str.get(0)

In [None]:
patents_1018['name_middle'] = patents_1018.name_first.str.split(' ')

In [None]:
patent = 'patent_{}.csv'
patent.format('2014')

In [None]:
patents_1018.head(20)

In [None]:
patents_1018[['patent_firstnamed_inventor_name_last','patent_firstnamed_inventor_name_first', 
              'name_last','name_first']].head(10)

Now lets look at the location data. We have State and City. A first quality check is to see if all the entries are valid entries.

In [None]:
# Lets look at the unique values of the state variable
patents_1018['patent_firstnamed_assignee_state'].unique()

Seems like states are recorded using abbreviation and uppercase. We also have missing values. Also seems like we have some abbreviations in here that might not be mainland US. We can count the entries to check.

In [None]:
# We can use the nunique() function to count the number of unique values
(print("We have " + str(patents_1018['patent_firstnamed_assignee_state'].nunique()) + 
       " States in our data when we use nunique()"))

# We can also use a different way to get to this information by getting the length 
# (number of elements) of our selection
(print("We have " + str(len(patents_1018['patent_firstnamed_assignee_state'].unique())) + 
       " States in our data when we use len()"))

Why do we have a difference here? 

Our data includes all US territories and DC. We can get a list of abbreviations of these states and check our entries against this list. But this looks good, we don't have to clean much here.

In [None]:
# using an extended list, containing 57 territories, including District of Columbia, and the other major US territories
listUsStates=['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'GU', 'HI', 'IA', 
              'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 
              'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', 'SC', 
              'SD', 'TN', 'TX', 'UM', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY']
len(listUsStates)

In [None]:
# We can select the rows of our patents data that have a valid state information
patents_1018_US = patents_1018.loc[patents_1018['patent_firstnamed_assignee_state'].isin(listUsStates)]

# Comparing the counts before and after we can see that we lost some rows 
print(len(patents_1018['patent_firstnamed_assignee_state']))
print(len(patents_1018_US['patent_firstnamed_assignee_state']))

# Probably the nans
patents_1018_US['patent_firstnamed_assignee_state'].unique()

Now lets look at the city information and see what the quality looks like. How would we do this?

Next is the Organization name. The names need to be standardized as well. 

In [None]:
patents_1018_US['patent_firstnamed_assignee_organization'].unique().tolist()[150:300]

In [None]:
# Lets remove the legal form from the organization name
patents_1018_US['org_name'] = (patents_1018_US['patent_firstnamed_assignee_organization'].
                               str.replace('Inc.|Inc|INC.|INC|LLC|LP|Company|Corporation|Ltd.|N.A.', ''))

In [None]:
# Lets also clean the string
patents_1018_US['org_name']=clean(patents_1018_US['org_name'], lowercase=True, remove_brackets=True)

In [None]:
patents_1018_US[['org_name','patent_firstnamed_assignee_organization']].head(20)

As last step we want to keep the variables needed, drop missings and duplicates to improve linkage quality

In [None]:
patents_to_link = patents_1018_US[['name_first', 'name_last', 'org_name', 
                                   'patent_firstnamed_assignee_state', 
                                   'patent_firstnamed_assignee_city', 'patent_number']].head(20)

In [None]:
# Now we can remove all the duplicates if there are any
patents_to_link = patents_to_link.drop_duplicates()

In [None]:
# Now drop missing values
patents_to_link = patents_to_link.dropna()

In [None]:
# Rename the patent dataset columns
patents_to_link = patents_to_link.rename(columns={'patent_firstnamed_assignee_city':'city'})
patents_to_link = patents_to_link.rename(columns={'patent_firstnamed_assignee_state':'state'})

In [None]:
patents_to_link.head()

Now we are done with the inital data prep work for the patent file. Please keep in mind that we just provided some examples for you to demonstrate the process. You can add as many further steps to it as necessary. 

#### Clean Grant Data
Now we will clean and preprocess the grants Data. We want to make them comparable to the patent data. So we need a field for first name, last name, organisation name, city, state, amd the grant number

In [None]:
grants_1018.head(10)

We can see that the names are in one field. There are missings in the City and State. The column names are upper case. It looks like the first entry in the name field is the last name and the one after the comma the first name.

In [None]:
# Some names have titles in them
(grants_1018[grants_1018.CONTACT_PI_PROJECT_LEADER.str.count(',')>1]).head(10)

In [None]:
# To clean these more complicated name problems we can use regex
grants_1018.loc[grants_1018.CONTACT_PI_PROJECT_LEADER.str.count(',')>1,'CONTACT_PI_PROJECT_LEADER']  =\
                            grant.CONTACT_PI_PROJECT_LEADER.str.replace(\
#It's not possible to break the regex with the \ character
'(?<=[\s,]).?(PH\.?D\.?|MA|MD|MBA|SCD|MSCE|MS|CDFM-A|CCC-SLP|FACS|PHARMD|M\.?P\.?H|MSC|PH\.D\. DABT)(?=,|$)','')  

#the replace statement leaves behind lots of commas
#this will reduce multiple commas with only one
grants_1018['CONTACT_PI_PROJECT_LEADER'] = grants_1018.CONTACT_PI_PROJECT_LEADER.str.replace(',{2,}',',')

# this will remove a comma if it's at the very end of the field (e.g. "Doe, Joe,")
grants_1018['CONTACT_PI_PROJECT_LEADER'] = grants_1018.CONTACT_PI_PROJECT_LEADER.str.replace(',$','')

```
(?<=[\s,]).?(PH\.?D\.?|MA|MD|CCC-SLP|FACS|PHARMD|M\.?P\.?H|MSC)(?=,|$)
```

where the first part...
```
(?<=[\s,])
```
...means that all my substrings of interest must be preceded either with a space or a comma


then...
```
.?
```
... is for mathing one character, that will be the space or comma defined earlier

then...
```
(PH\.?D\.?|MA|MD|MBA|MSCE|MS|CDFM-A|CCC-SLP|FACS|PHARMD|M\.?P\.?H|MSC|PH\.D\. DABT)
```
... matches all the substrings I'm interested to remove. Here _PH\.?D\.?_ is matching PHD and PH.D and PH.D. , etc. Same with M\.?P\.?H. 

finally...
```
(?=,|$)
```
... means that the substring of interest must be followed by either a comma or the end of the string.

The first and last section prevents to erase part of names, such as **MA**RIA

In [None]:
# Now we can split the PI name in two columns
grants_1018['name_first'] = grants_1018.CONTACT_PI_PROJECT_LEADER.str.split(',',1).str.get(1)
grants_1018['name_last'] = grants_1018.CONTACT_PI_PROJECT_LEADER.str.split(',',1).str.get(0)

In [None]:
# Remove whitespaces
grants_1018['name_first']=grants_1018.name_first.str.strip()
grants_1018['name_last']=grants_1018.name_last.str.strip()

In [None]:
# Lets check what we have
grants_1018[['name_first', 'name_last']].head(20)

In [None]:
# We still have the middle initial in the firstname
grants_1018['name_first'] = grants_1018.name_first.str.split(' ').str.get(0)

In [None]:
# remove special characters in names and make them lowercase
grants_1018['name_last']=(clean(grants_1018['name_last'], 
                                 lowercase=True, remove_brackets=True))
grants_1018['name_first']=(clean(grants_1018['name_first'], 
                                  lowercase=True, remove_brackets=True))

Now lets look at the state to make sure all entries are valid

In [None]:
# We can select the rows of our grants data that have a valid state information
grants_1018_US = grants_1018.loc[grants_1018['ORGANIZATION_STATE'].isin(listUsStates)]

**You can try here cleaning the organization information**

We've done some basic pre-processing of the data, using some of the very useful functions in `recordlinkage.preprocessing`. Now, let's move on to the actual record linkage portion. Though we can dive right in with comparing two names and checking if they match, this process can actually have a lot of nuance to it. For example, you should consider how long this process will take if you have extremely large datasets, with millions and millions of rows to check against millions and millions of rows. In addition, you should consider how strict you want your matching to be. For example, you want to make sure you catch any typos or common misspellings, but want to avoid relaxing the match condition to the point that anything will match with anything.

In [None]:
# Make all columns lowercase
grants_1018_US.columns = grants_1018_US.columns.str.lower()

In [None]:
grants_1018_US.head()

In [None]:
# keep vars needed
grants_to_link = (grants_1018_US[['name_first', 'name_last', 'organization_name', 'organization_city',
                                  'organization_state', 'project_id']])

In [None]:
# get final data for linkage
# Now we can remove all the duplicates if there are any
grants_to_link = grants_to_link.drop_duplicates()

# Now drop missing values
grants_to_link = grants_to_link.dropna()

# Rename the patent dataset columns
grants_to_link = grants_to_link.rename(columns={'organization_city':'city'})
grants_to_link = grants_to_link.rename(columns={'organization_state':'state'})
grants_to_link = grants_to_link.rename(columns={'organization_name':'org_name'})

In [None]:
grants_to_link.head()

In [None]:
patents_to_link.head()

In [None]:
# Drop dupliates
patents_to_link=patents_to_link.drop_duplicates(['name_first','name_last', 'state'], keep= 'last').reset_index()
grants_to_link=grants_to_link.drop_duplicates(['name_first','name_last', 'state'], keep= 'last').reset_index()

## Record Linkage
The record linkage package is a quite powerful tool for you to use when you want to link records within a dataset or across multiple datasets. It comes with different bulid in distances metrics and comparison functions, however, it also allows you to create your own. In general record linkage is divided in several steps. We've already done the pre-processing. We will add one more thing: a soundex.

Sometimes, words or names are recorded differently because they are written down as they sound. This can result in failed matches, because the same institution or individual will technically have different written names, even though the names would sound identically when pronounced out loud. To avoid these issues, we will add one more thing: a soundex (a phonetic algorithm for indexing names by sound, as pronounced in English).

The `phonetic()` function is used to convert strings into their corresponding phonetic codes. This is particularly useful when comparing names where different possible spellings make it difficult to find exact matches (e.g. Jillian and Gillian).

Let's add a column called `phonetic_first` and `phonetic_last` to our existing data, which will contain the result of applying a `phonetic` function to the person's name (the phonetic transcription of the name). We are using a method called NYSIIS - the New York State Identification and Intelligence System phonetic code. 

In [None]:
# Generate soundex
grants_to_link["phonetic_first"] = phonetic(grants_to_link["name_first"], method="nysiis")
grants_to_link["phonetic_last"] = phonetic(grants_to_link["name_last"], method="nysiis")

patents_to_link["phonetic_first"] = phonetic(patents_to_link["name_first"], method="nysiis")
patents_to_link["phonetic_last"] = phonetic(patents_to_link["name_last"], method="nysiis")

### Indexing

Indexing allows you to create candidate links, which basically means identifying pairs of data rows which might refer to the same real world entity. This is also called the comparison space (matrix). There are different ways to index data. The easiest is to create a full index and consider every pair a match. This is also the least efficient method, because we will be comparing every row of one dataset with every row of the other dataset.

If we had 10,000 records in data frame A and 100,000 records in data frame B, we would have 1,000,000,000 candidate links. You can see that comparing over a full index is getting inefficient when working with big data.

In [None]:
# Let's generate a full index first (comparison table of all possible linkage combinations)
#indexer = rl.FullIndex()
#pairs = indexer.index(grants_to_link, patents_to_link)
# Returns a pandas MultiIndex object
## How many records do we have?
#print (len(grants), len(inventor))

We can do better if we actually include our knowledge about the data to eliminate bad link from the start. This can be done through blocking. The recordlinkage packages gives you multiple options for this. For example, you can block by using variables, which menas only links exactly equal on specified values will be kept. You can also use a neighbourhood index in which the rows in your dataframe are ranked by some value and python will only link between the rows that are closeby.

In [None]:
# Try and see how this changes when you block on more or less variables
indexerBL = rl.BlockIndex(on=['state'])
pairs = indexerBL.index(grants_to_link, patents_to_link)
# Returns a pandas MultiIndex object
print(len(pairs))

### Record Comparison

After you have created a set of candidate links, you’re ready to begin comparing the records associated with each candidate link. In recordlinkage you must initiate a Compare object prior to performing any comparison functionality between records. This object stores both dataframes, the candidate links, and a vector containing comparison results. Further, the Compare object contains the methods for performing comparisons. The code block below initializes the comparison object.

In [None]:
# Initiate compare object (we are using the blocked ones here)
# You want to give python the name of the MultiIndex and the names of the datasets
compare_cl = rl.Compare()

Currently there are five specific comparison methods within recordlinkage: Compare.exact(), Compare.string(), Compare.numeric(), Compare.geo(), and Compare.date(). The Compare.exact() method is simple: if two values are an exact match a comparison score of 1 is returned, otherwise 0 is retured. The Compare.string() method is a bit more complicated and generates a score based on well known string-comparison algorithms (for this example Levenshtein or Jaro Winkler).

`Compare.string()` method generates a score based on well-known string-comparison algorithms. For this example, Jaro-Winkler distance is used (specifically developed with record linkage applications in mind) - words with more characters in common have a higher Jaro-Winkler value than those with fewer characters in common. The output value is normalized to fall between 0 (complete dissimilar strings) and 1 (exact match on strings). (Information about other string-comparison methods is included in the References section below).

As you remember, we already did an exact matching on `state`, when we did the blocking above and created the candidate links.

We need to specify the respective columns with organization names in both datasets, the method, and the threshold. In this case, for all strings that have more than 70% in similarity, according to the Jaro-Winkler distance, a 1 will be returned, and otherwise 0.

In [None]:
compare_cl.string('name_first', 'name_first', method='jarowinkler', threshold=0.70, label='name_first')
compare_cl.string('name_last', 'name_last', method='jarowinkler', threshold=0.70, label='name_last')

compare_cl.string('phonetic_first', 'phonetic_first', method='jarowinkler', threshold=0.70, label='phonetic_first')
compare_cl.string('phonetic_last', 'phonetic_last', method='jarowinkler', threshold=0.70, label='phonetic_last')

compare_cl.exact('state', 'state', label='state')

The comparing of record pairs starts when the `compute` method is called. 

In [None]:
## The comparing of record pairs starts when the compute method is called. 
## All attribute comparisons are stored in a DataFrame with horizontally the features and vertically the record pairs.

features = compare_cl.compute(pairs, grants_to_link, patents_to_link)
features.head()

In [None]:
# Here you can see the matches of the first name only
features[features['name_first'] == 1]

### Classification

Let's check how many records we get where one or both of comparison attributes match.

In [None]:
## Simple Classification: Check for how many attributes records are identical by summing the comparison results.
features.sum(axis=1).value_counts().sort_index(ascending=False)

In [None]:
matches = features[features.sum(axis=1) > 3]
print(len(matches))

Now that we have the list of matches we can fuse our dataset, becasue at the end we want to have a combined dataset. We are using a function for this task.

In [None]:
matches.head()

Now let's merge these matches back to original dataframes. Our `matches` dataframe has MultiIndex - two indices to the left which correspond to the `patents` table and `grants` table respectively. We will pull all corresponding rows from both tables separately.

In [None]:
# For grants (the first data used)
grants_results = []  # Create an empty list

for match in matches.index:  # For every pair in matches (index)
    df = pd.DataFrame(grants_to_link.loc[[match[0]]])  # Get the location in the original table, convert to dataframe
    grants_results.append(df)  # Append to a list
    
grants_concat = pd.concat(grants_results)  # concate list of frames into one    

In [None]:
# For patents (the second data used)
patents_results = []  # Create an empty list

for i in matches.index:  # For every pair in matches (index)
    df = pd.DataFrame(patents_to_link.loc[[i[1]]])  # Get the location in the original table, convert to dataframe
    patents_results.append(df)  # Append to a list

patents_concat = pd.concat(patents_results)  # Concatenate into one dataframe

Now we need to combine two tables on the index - notice that our tables right now have indices from the original tables. We can reset the index using `.reset_index()`.

In [None]:
# reset index
grants_concat = grants_concat.reset_index()
patents_concat = patents_concat.reset_index()

In [None]:
grants_concat.head()

In [None]:
patents_concat.head()

Now we concatenate these two tables using `.concat()`.

In [None]:
matched = pd.concat([grants_concat,patents_concat],axis=1)  # Specify axis=1 to concatenate horizontally

In [None]:
# And this is our result
matched.head()

## References and Further Readings


### Parsing

* Python online documentation: https://docs.python.org/2/library/string.html#deprecated-string-functions
* Python 2.7 Tutorial(Splitting and Joining Strings): http://www.pitt.edu/~naraehan/python2/split_join.html

### Regular Expression

* Python documentation: https://docs.python.org/2/library/re.html#regular-expression-syntax
* Online regular expression tester (good for learning): http://regex101.com/

### String Comparators

* GitHub page of jellyfish: https://github.com/jamesturk/jellyfish
* Different distances that measure the differences between strings:
    - Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance
    - Damerau–Levenshtein distance: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    - Jaro–Winkler distance: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
    - Hamming distance: https://en.wikipedia.org/wiki/Hamming_distance
    - Match rating approach: https://en.wikipedia.org/wiki/Match_rating_approach



