## Fuzzy Matching and Fuzzy Pandas
      
[Max Harlow](https://twitter.com/maxharlow), a journalist at the Financial Times, wrote this library `csvmatch`, and he's been adding new algorithms to facilitate fuzzy matching across datasets. He's used it for a bunch of stories, including:
- https://www.theguardian.com/uk-news/2014/jul/09/offshore-tax-dealings-celebrities-sportsmen-leaked-jersey-files
- https://www.theguardian.com/politics/2014/jul/08/offshore-secrets-wealthy-political-donors

Similar techniques have also been used in other stories like:
- https://www.globalwitness.org/en/campaigns/oil-gas-and-mining/myanmarjade
- https://www.irinnews.org/investigation/2016/09/02/exclusive-un-paying-blacklisted-diamond-company-central-african-republic

But, wait, first: 

### What is Fuzzy Matching? 

Automating the look-up for names in documents is [inherently imprecise](https://www.elastic.co/blog/found-fuzzy-search). The computer can't _know_ that different representations of the same _thing_ refer to the same _thing_. For example: 
- _Apple Inc._; _Apple Computer Company_; _Apple Computer, Inc._; and _Apple_ all refer to the fruit company. 
- _Samuel Langhorne Clemens_, _Samuel L. Clemens_, _Samuel Clemens_ and Mark Twain all refer to the same person. 
- _Robert Ford_, _Rob Ford_, and _Robert Frod_ refer to the same person **probably**. 

When you're working with unstructured data, you can't take anything for granted. Least of all, you can't assume that:
- documents will have correct spellings
- first, last, and middle names will exist in all documents
- the abbreviated/shortened names of people won't make an appearance (e.g. Jon instead of Jonathan, Tom instead of Thomas, Phil instead of Philip, etc.) 

So, when you're living in an uncertain world, you try to make things slightly more _certain_ with **Fuzzy Matching**. You might not hit 100 percent, but at least you'll hit more than what you would without fuzzy matching. 

There are multiple algorithms that try to minimise the uncertainty/enable fuzzy matching. The library we are going to be look at today incorporates a bunch of these, instead of just doing one thing. 

This notebook's predominantly based on an [awesome NICAR2019 presentation](https://docs.google.com/presentation/d/1djKgqFbkYDM8fdczFhnEJLwapzmt4RLuEjXkJZpKves/) where Max Harlow (the aforementioned news app developer at the Financial Times) demonstrated [csvmatch](https://github.com/maxharlow/csvmatch). And, then, Soma basically created a library to make it with Pandas. 

Worth remembering that there are no shortcuts in life, and few panaceas. Depending on the project you're working on, you might be more inclined to use one algorithm or the other. Or, you know, try a few of them and see what happens. And, also, remember: all computational tools you use need to hand-in-hand with traditional reporting. People share names, there's more than one John Smith, etc. 


In [1]:
# Make sure you `pip install fuzzy_pandas` first. 

import pandas as pd
import fuzzy_pandas as fpd

### A Toy Example

We'll be working with two toy datasets first, just to get going and get an idea as to what's possible. The names of the files are not terribly imaginative: `data1.csv` and `data2.csv`. And, they both contain structured data: names, code names and locations of characters from John le Carré's spy thriller: Tinker Tailor Soldier Spy. 

Right, let's have a look. 


In [2]:
df1 = pd.read_csv("sources/data1.csv")
df2 = pd.read_csv("sources/data2.csv")

In [3]:
df1

Unnamed: 0,name,location,codename
0,George Smiley,London,Beggerman
1,Percy Alleline,London,Tinker
2,Roy Bland,London,Soldier
3,Toby Esterhase,Vienna,Poorman
4,Peter Guillam,Brixton,none
5,Bill Haydon,London,Tailor
6,Oliver Lacon,London,none
7,Jim Prideaux,Slovakia,none
8,Connie Sachs,Oxford,none


In [4]:
df2

Unnamed: 0,Person Name,Location
0,Maria Andreyevna Ostrakova,Russia
1,Otto Leipzig,Estonia
2,George SMILEY,London
3,Peter Guillam,Brixton
4,Konny Saks,Oxford
5,Saul Enderby,London
6,Sam Collins,Vietnam
7,Tony Esterhase,Vienna
8,Claus Kretzschmar,Hamburg


### Exact matches

We start with doing "exact matches", i.e. both tables should have the exact same name. Capitalisation matters, accents matter. With this function, for example:
- John le Carre will not match with John le Carré
- George SMILEY will not match with George Smiley

Based on what you see in the data frames above, how many matches do you expect? 

In [5]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name')

Unnamed: 0,name,location,codename,Person Name,Location
0,Peter Guillam,Brixton,none,Peter Guillam,Brixton


Right, so, we only find one match as expected. But, are there any other matches that a _smarter_ algorithm could find? Let's try something called **Levenshtein**, a nifty simple algorithm that's pretty common. It's the basis for a bunch of spellcheck algorithms, amongst other things, and the way it works is it checks the number of characters that are different between two inputs, and if the _distance_ is small enough, it assumes the two words are the same. 

For example, in the above two data frames, you have Toby Esterhase and Tony Esterhase, which means the Levenshtein distance is 1 (The 'b' v. 'n' in To(b,n)y.). 

In [6]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method='levenshtein')

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Toby Esterhase,Vienna,Poorman,Tony Esterhase,Vienna
2,Peter Guillam,Brixton,none,Peter Guillam,Brixton


The other thing you'll notice above is that, by default, the _Levenshtein_ algorithm doesn't care about case. 

However, are we still missing potential matches? 

When we work with any algorithms, we need a confidence threshold that we decide on. By default, the `csvmatch` algorithm has a `threshold` of 0.6, i.e. only if the algorithm returns a match score greater than or equal to 0.6 will it return a match. 

The score, in this case, is calculated using the below formula: 

> `1 - (distance/maximum(value1, value2))`

We can be slightly more conservative with the threshold, and we get a Brand New Result in our output. 

In [7]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method='levenshtein', threshold=0.55)

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Toby Esterhase,Vienna,Poorman,Tony Esterhase,Vienna
2,Peter Guillam,Brixton,none,Peter Guillam,Brixton
3,Connie Sachs,Oxford,none,Konny Saks,Oxford


This is _cool_. By changing the threshold, we found another match based on what the pronunciation of the two names are: Connie and Konny. **But, what could be cooler?**

In [8]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method='metaphone')

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Peter Guillam,Brixton,none,Peter Guillam,Brixton
2,Connie Sachs,Oxford,none,Konny Saks,Oxford


The **metaphone** algorithm does phonetic matching, and gives you results based on that. 

Note: In theory, the documentation says that you can combine a couple of these algorithms if you're so inclined. But, it looks like when you combine two algorithms, it doesn't _quite_ work. ¯\_(ツ)_/¯

In [9]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method=['levenshtein', 'metaphone'])

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Toby Esterhase,Vienna,Poorman,Tony Esterhase,Vienna
2,Peter Guillam,Brixton,none,Peter Guillam,Brixton


In [10]:
## swap the methods around and then look at the results, too.
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method=['metaphone','levenshtein'])

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Peter Guillam,Brixton,none,Peter Guillam,Brixton
2,Connie Sachs,Oxford,none,Konny Saks,Oxford


What do you think is happening here? 

This is important—you're often going to be using tools built by other folks, but where there's code, there are bugs. You should make sure that you play with the tool a bit to make sure it's doing _exactly_ what you think it's doing. And, if it's not, you know where it falls short. 

### Less Fictional Datasets

We are going to be using the same datasets Max Harlow used for this exercise. As he explains in his presentation [here](https://docs.google.com/presentation/d/1djKgqFbkYDM8fdczFhnEJLwapzmt4RLuEjXkJZpKves/edit#slide=id.g3512a0ce6b_1_22), there are a bunch of files: 
- a list of world billionaires published by Bloomberg
- a similar list published by Forbes
- a list also published by Forbes that only includes Chinese individuals
- a list published by the CIA of chiefs of state and cabinet members of foreign governments
- a list of all the people that attended the World Economic Forum conference in Davos this year
- a list of all the people and companies that have been sanctioned by the United Nations


In [11]:
## Read in the two billionaire lists (Forbes + Bloomberg)

forbes_df = pd.read_csv("sources/forbes-billionaires.csv")
bloom_df = pd.read_csv("sources/bloomberg-billionaires.csv")

Can you find out how many billionaires appear in both lists (exact matching)? 

In [12]:
forbes_df.sample(30)

Unnamed: 0,name,lastName,uri,imageUri,worthChange,source,industry,gender,country,timestamp,realTimeWorth,realTimeRank,realTimePosition,squareImage
635,Liu Yonghao,Liu,liu-yonghao,liu-yonghao,12.684,agribusiness,Service,M,China,1547574901333,4511.094,409.0,409.0,//specials-images.forbesimg.com/imageserve/5bc...
1929,Djoko Susanto,Susanto,djoko-susanto,djoko-susanto,-13.901,supermarkets,Fashion & Retail,M,Indonesia,1547574901334,1400.306,1574.0,1574.0,//specials-images.forbesimg.com/imageserve/5a1...
1042,Marc Rowan,Rowan,marc-rowan,marc-rowan,33.56,private equity,Finance and Investments,M,United States,1547574901333,2992.427,727.0,727.0,//specials-images.forbesimg.com/imageserve/55f...
1388,Abhay Firodia,Firodia,abhay-firodia,abhay-firodia,19.893,automobiles,Automotive,M,India,1547574901334,1944.777,1181.0,1181.0,//specials-images.forbesimg.com/imageserve/58c...
348,Pallonji Mistry,Mistry,pallonji-mistry,pallonji-mistry,334.23,construction,Construction & Engineering,M,Ireland,1547575201866,14204.367,80.0,80.0,//specials-images.forbesimg.com/imageserve/222...
1991,Archie Hwang,Hwang,archie-hwang,no-pic,0.0,semiconductors,Technology,M,Taiwan,1547574901334,1334.262,1631.0,1631.0,//specials-images.forbesimg.com/imageserve/5be...
873,Wang Yusuo,Wang,wang-yusuo,wang-yusuo,1.71,natural gas distribution,Energy,M,China,1547574901333,4204.322,454.0,454.0,//specials-images.forbesimg.com/imageserve/5bc...
1692,Jiang Xuefei,Jiang,jiang-xuefei,no-pic,1.539,printed circuit boards,Technology,M,China,1547574901334,1154.201,1825.0,1825.0,
405,Graeme Hart,Hart,graeme-hart,graeme-hart,0.0,investments,Finance and Investments,M,New Zealand,1547575201867,8729.149,159.0,159.0,//specials-images.forbesimg.com/imageserve/5a7...
2002,Lev Leviev,Leviev,lev-leviev,lev-leviev,2.043,diamonds,Metals & Mining,M,Israel,1547574901334,1023.188,1973.0,1973.0,//specials-images.forbesimg.com/imageserve/5a7...


In [13]:
bloom_df.sample(30)

Unnamed: 0,Rank,Name,Total_net_worth,Country,Industry
366,367,Gwendolyn Sontheim Meyer,$4.58B,United States,Commodities
264,265,Natie Kirsh,$5.91B,South Africa,Food & Beverage
260,261,Tsai Eng-Meng,$5.98B,Taiwan,Food & Beverage
223,224,Alexey Kuzmichev,$6.50B,Russian Federation,Diversified
393,394,Anders Holch Povlsen,$4.38B,Denmark,Retail
137,138,Lei Jun,$9.77B,China,Technology
166,167,Ma Jianrong,$8.22B,China,Consumer
160,161,Pam Mars-Wright,$8.37B,United States,Food & Beverage
411,412,Micky Jagtiani,$4.25B,India,Retail
370,371,Herbert Johnson III,$4.54B,United States,Consumer


In [14]:
billionaires_df = fpd.fuzzy_merge(forbes_df, bloom_df, left_on='name', right_on='Name')
billionaires_df.head()

Unnamed: 0,name,lastName,uri,imageUri,worthChange,source,industry,gender,country,timestamp,realTimeWorth,realTimeRank,realTimePosition,squareImage,Rank,Name,Total_net_worth,Country,Industry
0,Alexander Otto,Otto,alexander-otto,no-pic,2.12,real estate,Real Estate,M,Germany,1547575201867,10821.927,126.0,126.0,//specials-images.forbesimg.com/imageserve/5a7...,323,Alexander Otto,$4.94B,Germany,Real Estate
1,Ben Ashkenazy,Ashkenazy,ben-ashkenazy,no-pic,0.0,real estate,Real Estate,M,United States,1547574901333,4000.0,499.0,499.0,//specials-images.forbesimg.com/imageserve/59e...,447,Ben Ashkenazy,$4.05B,United States,Real Estate
2,Giovanni Ferrero,Ferrero,giovanni-ferrero,no-pic,0.0,"Nutella, chocolates",Food and Beverage,M,Italy,1547575201866,22673.165,38.0,38.0,//specials-images.forbesimg.com/imageserve/5b1...,33,Giovanni Ferrero,$22.6B,Italy,Food & Beverage
3,Henry Cheng,Cheng,henry-cheng-1,no-pic,3.542,property,Diversified,M,Hong Kong,1547574901334,1334.282,1630.0,1630.0,//specials-images.forbesimg.com/imageserve/5a7...,79,Henry Cheng,$14.1B,Hong Kong,Retail
4,Henry Laufer,Laufer,henry-laufer,no-pic,0.0,hedge funds,Finance and Investments,M,United States,1547574901333,2000.0,1141.0,1142.0,,463,Henry Laufer,$3.95B,United States,Finance


Now, can you find the ones where the ranks aren't the same across the two datasets? What about the ones that are the same?

In [15]:
billionaires_df[billionaires_df.realTimeRank == billionaires_df.Rank].shape

(12, 19)

### Fuzzy matching with non-fictional data

In the above couple of cells, we've conducted "exact matching", i.e. the equivalent of you running a `Cmd+F`/`Ctrl+F` on your text editor. But, this is _almost_ worse as it's case sensitive, i.e. "Tom" and "tom" are treated differently. 

We've gone through some of this already, but what are the things we can ignore when it comes to name-matching? Harlow, in his presentation, identified:
- case
- title (Mr., Mrs., etc.)
- non-latin characters (é, å, ß, etc.)
- the order of the names
- non-alphanumerics (e.g. hyphenated names)

Now, you don't _have to_ ignore _anything_, but sometimes, it might make your life far easier. Other times, you'll end up with false positives and whatnot. 

The library `csvmatch`—and by extension `fuzzy_pandas`—support a bunch of the above parameters, which you can just pass in to the function. Passing in a bunch of these parameters would allow you to go from `Orbán, Viktor` to `Viktor Orban`, which is quite useful. (Again, the example's from Harlow's slides)

For this bit, we'll move on to two of the other datasets: `cia-world-leaders.csv` and `davos-attendees-2019.csv`. As always, read in the data and figure out which columns the exact match should run on. 

In [16]:
cia_world_leaders = pd.read_csv('sources/cia-world-leaders.csv')
davos_attendees = pd.read_csv('sources/davos-attendees-2019.csv')
print(f"Our CIA World Leaders df has these columns: {cia_world_leaders.columns} \
      \n The Davos attendees have these: {davos_attendees.columns}")

Our CIA World Leaders df has these columns: Index(['country', 'role', 'name'], dtype='object')       
 The Davos attendees have these: Index(['full_name', 'position_short_name', 'org_name', 'org_country'], dtype='object')


In [17]:
cia_world_leaders.sort_values('name').head(20)

Unnamed: 0,country,role,name
384,Bangladesh,Min. of Foreign Affairs,A. H. Mahmood ALI
499,Belize,"Governor, Central Bank",A. Joy GRANT
4721,Somalia,Min. of Public Works & Reconstruction,"ABAS Abdullahi Sheikh ""Siraji"""
4451,Saudi Arabia,Min. of Interior,ABD AL-AZIZ bin Saud bin Nayif bin Abd al-Aziz...
2625,Jordan,King,ABDALLAH II
4193,Qatar,Deputy Amir,ABDALLAH bin Hamad Al Thani
4213,Qatar,Prime Min.,ABDALLAH bin Nasir bin Khalifa Al Thani
4205,Qatar,Min. of Interior,ABDALLAH bin Nasir bin Khalifa Al Thani
4194,Qatar,"Governor, Qatar Central Bank",ABDALLAH bin Saud Al Thani
5482,United Arab Emirates,Min. of Foreign Affairs and International Coop...,ABDALLAH bin Zayid Al Nuhayyan


In [18]:
davos_attendees.sort_values('full_name').head(20)

Unnamed: 0,full_name,position_short_name,org_name,org_country
1864,Aaron Karczmer,"Executive Vice-President; Chief Risk, Complian...",PayPal,USA
2532,Aaron Motsoaledi,Minister of Health of South Africa,Ministry of Health of South Africa,South Africa
422,Aarthi Subramanian,"Executive Director, Board of Directors",Tata Consultancy Services,India
2074,Abdelkader Messahel,Minister of Foreign Affairs of Algeria,Ministry of Foreign Affairs of Algeria,Algeria
897,Abdulaziz Al Judaimi,"Senior Vice-President, Downstream",Saudi Aramco,Saudi Arabia
888,Abdulaziz Al Subeaei,Chairman,Jabal Omar Development Company,Saudi Arabia
35,Abdulaziz Al-Helaissi,Group Chief Executive Officer (GIB),Gulf International Bank BSC,Bahrain
903,Abdulaziz Al-Jarbou,Chairman of the Board,Saudi Basic Industries Corporation,Saudi Arabia
2641,Abdulla Al Basti,"General Secretary, Executive Council of Dubai,...",Executive Council of Dubai,United Arab Emirates
809,Abdulla Al Khalifa,Chief Executive Officer,Qatar National Bank Q.P.S.C.,Qatar


In [19]:
# Let's start by seeing what an exact match would look like. 
davos_df = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', right_on='full_name')
davos_df.shape

(0, 7)

Apparently, there are no matches. What do you reckon? What _should_ the overlap between CIA's list of world leaders and Davos attendees be? 

Let's try some _fuzzy matching_, first by simply ignoring case. 

In [20]:
davos_df = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', right_on='full_name', ignore_case=True)

In [21]:
print(davos_df.shape)
davos_df

(119, 7)


Unnamed: 0,country,role,name,full_name,position_short_name,org_name,org_country
0,Algeria,Min. of Foreign Affairs & International Cooper...,Abdelkader MESSAHEL,Abdelkader Messahel,Minister of Foreign Affairs of Algeria,Ministry of Foreign Affairs of Algeria,Algeria
1,Argentina,Min. of Production & Work,Dante SICA,Dante Sica,Minister of Industry and Labour of Argentina,Ministry of Industry and Labour of Argentina,Argentina
2,Argentina,"Pres., Central Bank",Guido SANDLERIS,Guido Sandleris,Governor of the Central Bank of Argentina,Central Bank of Argentina,Argentina
3,Armenia,Prime Min.,Nikol PASHINYAN,Nikol Pashinyan,Prime Minister of the Republic of Armenia,Office of the Prime Minister of the Republic o...,Armenia
4,Australia,Min. for Defense Industry,Steven CIOBO,Steven Ciobo,Minister of Defence Industry of Australia,Department of Defence of Australia,Australia
5,Australia,"Min. for Trade, Investment, & Tourism",Simon BIRMINGHAM,Simon Birmingham,"Minister for Trade, Tourism and Investment of ...",Department of Foreign Affairs and Trade of Aus...,Australia
6,Austria,Chancellor,Sebastian KURZ,Sebastian Kurz,Federal Chancellor of Austria,Office of the Federal Chancellor of Austria,Austria
7,Austria,"Min. for Europe, Integration, & Foreign Affairs",Karin KNEISSL,Karin Kneissl,"Federal Minister for Europe, Integration and F...","Federal Ministry for Europe, Integration and F...",Austria
8,Azerbaijan,Pres.,Ilham ALIYEV,Ilham Aliyev,President of the Republic of Azerbaijan,Administration of the President of the Republi...,Azerbaijan
9,Belgium,Dep. Prime Min.,Alexander DE CROO,Alexander De Croo,Deputy Prime Minister and Minister of Finance ...,"Ministry of Foreign Affairs, Foreign Trade and...",Belgium


OK, we have more matches, but this is also pretty boring. There's nothing super-smart about ignoring cases to get matches. Your word processors have been doing that for _decades_. 

But, now, let's start adding some of our other parameters discussed above, and see what happens

In [22]:
davos_df = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', right_on='full_name', 
                           ignore_case=True, 
                           ignore_order_words=True,
                           ignore_nonalpha=True,
                           ignore_titles=True
                          )

In [23]:
print(davos_df.shape)
davos_df

(121, 7)


Unnamed: 0,country,role,name,full_name,position_short_name,org_name,org_country
0,Algeria,Min. of Foreign Affairs & International Cooper...,Abdelkader MESSAHEL,Abdelkader Messahel,Minister of Foreign Affairs of Algeria,Ministry of Foreign Affairs of Algeria,Algeria
1,Argentina,Min. of Production & Work,Dante SICA,Dante Sica,Minister of Industry and Labour of Argentina,Ministry of Industry and Labour of Argentina,Argentina
2,Argentina,"Pres., Central Bank",Guido SANDLERIS,Guido Sandleris,Governor of the Central Bank of Argentina,Central Bank of Argentina,Argentina
3,Armenia,Prime Min.,Nikol PASHINYAN,Nikol Pashinyan,Prime Minister of the Republic of Armenia,Office of the Prime Minister of the Republic o...,Armenia
4,Australia,Min. for Defense Industry,Steven CIOBO,Steven Ciobo,Minister of Defence Industry of Australia,Department of Defence of Australia,Australia
5,Australia,"Min. for Trade, Investment, & Tourism",Simon BIRMINGHAM,Simon Birmingham,"Minister for Trade, Tourism and Investment of ...",Department of Foreign Affairs and Trade of Aus...,Australia
6,Austria,Chancellor,Sebastian KURZ,Sebastian Kurz,Federal Chancellor of Austria,Office of the Federal Chancellor of Austria,Austria
7,Austria,"Min. for Europe, Integration, & Foreign Affairs",Karin KNEISSL,Karin Kneissl,"Federal Minister for Europe, Integration and F...","Federal Ministry for Europe, Integration and F...",Austria
8,Azerbaijan,Pres.,Ilham ALIYEV,Ilham Aliyev,President of the Republic of Azerbaijan,Administration of the President of the Republi...,Azerbaijan
9,Belgium,Dep. Prime Min.,Alexander DE CROO,Alexander De Croo,Deputy Prime Minister and Minister of Finance ...,"Ministry of Foreign Affairs, Foreign Trade and...",Belgium


Right, 19 more results. Baby steps, but at least steps in the right direction. Now, let's start using the more _intelligent_ algorithms in place—this one named after a Russian mathematician: Levenshtein. 

All *Levenshtein* does is look at how many characters are different between two inputs? For example:

In [24]:
from jellyfish._jellyfish import damerau_levenshtein_distance

damerau_levenshtein_distance("Évry", "Every")

2

Let's quickly use the algorithm directly to see the output. The above cell imports something called `jellyfish`, which is another package that `csvmatch` uses. Typically, you wouldn't call the function directly (you could if you wanted to), but this is just to give you an idea of how the algorithm works. 

Right, now let's use this with our above data, and see if we have better luck. 

Remember: the threshold specified by us is 0.6, and it's calculated by: `1-(distance/max(value1, value2))`. In the case of `Évry` and `Every` above, our calculation would be:

`1 - (2/5)` = `3/5` = `0.6`

So, in this case, the two would lead to a fuzzy match.

In [31]:
davos_df = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', right_on='full_name', 
                           ignore_case=True, 
                           ignore_order_words=True,
                           ignore_nonalpha=True,
                           ignore_titles=True,
                           method='levenshtein'
                          )

In [32]:
davos_df.shape

(1419, 7)

**WAIT, WHAT?!** Have we just gone from 138 matches to 1952? Is that overtly optimistic? 

In [27]:
davos_df.sample(50)

Unnamed: 0,country,role,name,full_name,position_short_name,org_name,org_country
1358,United Kingdom,"Sec. of State for Environment, Food, & Rural A...",Michael GOVE,Bill Michael,"Senior Partner, United Kingdom; Chairman",KPMG,United Kingdom
1253,Sudan,Attorney Gen.,Omer AHMED Mohamed,Ahmed Shide Mohamed,Minister of Finance of Ethiopia,Ministry of Finance of Ethiopia,Ethiopia
68,Australia,Dep. Prime Min.,Michael MCCORMACK,Michael Corbat,"Chief Executive Officer, Citigroup",Citi,USA
1391,Vietnam,Pres.,Nguyen Phu TRONG,Nguyen Xuan Phuc,Prime Minister of Viet Nam,Office of the Government of Viet Nam,Viet Nam
692,"Korea, North","Member, SPA Presidium",JON Yong Nam,Song Juntao,"Global Head, International Business Affairs an...",Alibaba Group,People's Republic of China
638,Japan,Min. in Charge of Regional Revitalization,Satsuki KATAYAMA,Satsuki Katayama,Minister of State for Regional Revitalization ...,Cabinet Office of Japan,Japan
305,China,Min. of Agriculture,HAN Changfu,Tan Chengxu,"Mayor of Dalian, People's Republic of China",Dalian Municipal Government,People's Republic of China
311,China,Min. of Human Resources & Social Security,ZHANG Jinan,Zhang Ying,Co-Founder and Director,Ticket Youth Association,People's Republic of China
874,Marshall Islands,Min. of Justice,Thomas HEINE,Thomas Rabe,Chief Executive Officer and Chairman of the Ex...,Bertelsmann SE & Co. KGaA,Germany
1024,Nigeria,Min. of Environment,Amina MOHAMMED,Kamal Bin Ahmed Mohammed,Minister of Transportation and Telecommunicati...,Bahrain Economic Development Board,Bahrain


Why, yes. Yes, it is. This is why you _always_ confirm what an algorithm does. Right, maybe let's bump up our threshold and see what happens. 

In [28]:
davos_df = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', right_on='full_name', 
                           ignore_case=True, 
                           ignore_order_words=True,
                           ignore_nonalpha=True,
                           ignore_titles=True,
                           method='levenshtein',
                           threshold=0.8
                          )

In [29]:
davos_df.shape

(182, 7)

OK, that seems more reasonable. Let's sanity check.

In [30]:
davos_df.sample(50)

Unnamed: 0,country,role,name,full_name,position_short_name,org_name,org_country
27,Canada,Min. of Foreign Affairs,Chrystia FREELAND,Chrystia Freeland,Minister of Foreign Affairs of Canada,Global Affairs Canada,Canada
104,Luxembourg,Prime Min.,Xavier BETTEL,Xavier Bettel,Prime Minister and Minister for Communications...,Office of the Prime Minister of Luxembourg,Luxembourg
32,Costa Rica,Pres.,Carlos ALVARADO Quesada,Carlos Alvarado Quesada,President of Costa Rica,Office of the President of Costa Rica,Costa Rica
159,Tunisia,"Min. of Development, Investment & Internationa...",Zied LAADHARI,Zied Ladhari,"Minister of Development, Investment and Intern...","Ministry of Development, Investment and Intern...",Tunisia
28,China,Chief Executive,Carrie LAM,Carrie Lam,"Chief Executive of Hong Kong SAR, China",Hong Kong SAR Government,"Hong Kong SAR, China"
97,Liechtenstein,Min. of General Govt. Affairs & Finance,Adrian HASLER,Adrian Hasler,Prime Minister of Liechtenstein,Government of the Principality of Liechtenstein,Liechtenstein
163,Uganda,Pres.,Yoweri Kaguta MUSEVENI,Yoweri Kaguta Museveni,President of Uganda,Office of the President Uganda,Uganda
147,South Africa,Min. of Economic Development,Ebrahim PATEL,Ebrahim Patel,Minister of Economic Development of South Africa,Ministry of Economic Development of South Africa,South Africa
155,Switzerland,Federal Chancellor,Walter THURNHERR,Walter Thurnherr,Federal Chancellor of the Swiss Confederation,Federal Chancellery of Switzerland,Switzerland
107,Malaysia,Min. of Intl. Trade & Industry,Ignatius Dorell LEIKING,Ignatius Darell Leiking,Minister of International Trade and Industry o...,Ministry of International Trade and Industry o...,Malaysia


Next question, one for you guys to do:

Which names from the CIA world leaders list are also on the Forbes billionaires list?

Who can get the best result?

In [33]:
ciabill_df = fpd.fuzzy_merge(cia_world_leaders, forbes_df, left_on='name', right_on='name', 
                           ignore_case=True, 
                           ignore_order_words=True,
                           ignore_nonalpha=True,
                           ignore_titles=True,
                           method='levenshtein',
                           threshold=0.8
                          )

In [34]:
ciabill_df.shape

(12, 17)

So in what scenarios do you reckon Levenshtein will perform badly? 

Next up: **metaphone**. 

Metaphone's great for names which sound similar, which wouldn't be caught by Levenshtein. It's especially handy when you're working with transcript data. But, it too comes with its pitfalls. 

Let's look at an example and then discuss what the possible pitfalls could be. 

Which names from the CIA world leaders list are also on the United Nations sanctions list?

In [35]:
sanctions_df = pd.read_csv("sources/un-sanctions.csv")

In [37]:
ciasanct_df = fpd.fuzzy_merge(cia_world_leaders, sanctions_df, left_on='name', right_on='name', 
                           method='metaphone'
                          )

In [38]:
ciasanct_df.shape

(18, 22)

In [None]:
## How does this compare to our other algorithms (ignore case)? 


In [None]:
# ciasanct2_df.shape

In [None]:
## How does this compare to our other algorithms (levenshtein)? 
ciasanct3_df = fpd.fuzzy_merge(cia_world_leaders, sanctions_df, left_on='name', right_on='name', 
                           ignore_case=True, 
                           ignore_order_words=True,
                           ignore_nonalpha=True,
                           ignore_titles=True,
                           method='levenshtein',
                           threshold=0.8
                          )

In [None]:
ciasanct3_df.shape

Finally, we get to **Bilenko**. And, yes, this uses machine learning, where you train your own data. So, now you have human *smarts* being involved in the process of matching up names across documents. 

Let's look at an example: 

Which names from the CIA world leaders list are also on the Davos attendees list?

In [None]:
ciasanct4_df = fpd.fuzzy_merge(cia_world_leaders, sanctions_df, left_on='name', right_on='name', 
                           method='bilenko'
                          )


Answer questions as follows:
 y - yes
 n - no
 s - skip
 f - finished

name: OMAR bin Sultan al-Ulama

name: OMAR HAMMAMI

Do these records refer to the same thing? [y/n/s/f] 

n



name: Imran KHAN

name: PEJMAN INDUSTRIAL SERVICES CORPORATION

Do these records refer to the same thing? [y/n/s/f] 

n



name: Milena HARITO

name: OCEAN MARITIME MANAGEMENT COMPANY, LIMITED (OMM)

Do these records refer to the same thing? [y/n/s/f] 

n



name: AMIR Aman

name: AMIN MUHAMMAD UL HAQ

Do these records refer to the same thing? [y/n/s/f] 

n



name: Paul BIYOGHE MBA

name: CENTRAL MILITARY COMMISSION OF THE WORKERS’ PARTY OF KOREA (CMC)

Do these records refer to the same thing? [y/n/s/f] 

n



name: Muhammad bin Hamad bin Sayf al-RUMHI

name: Muhammad Bahrum Naim

Do these records refer to the same thing? [y/n/s/f] 

n



name: Cecilia PEREZ

name: RICARDO PEREZ AYERAS

Do these records refer to the same thing? [y/n/s/f] 

n



name: Yeafesh OSMAN

name: STATE ENTERPRISE FOR MARKETING EQUIPMENT AND MAINTENANCE

Do these records refer to the same thing? [y/n/s/f] 

n



name: MUHAMMAD BIN RASHID Al Maktum

name: MOHAMMAD ESLAMI

Do these records refer to the same thing? [y/n/s/f] 

n



name: Peter MUNYA

name: STATE BATTERY MANUFACTURING ESTABLISHMENT

Do these records refer to the same thing? [y/n/s/f] 

n



name: CHOE Pu Il

name: CHOE CHUN YONG

Do these records refer to the same thing? [y/n/s/f] 

n



name: WEI Fenghe

name: WEIHAI WORLD-SHIPPING FREIGHT

Do these records refer to the same thing? [y/n/s/f] 

n



name: Ravi YERRIGADOO

name: RAFIDAIN STATE ORGANIZATION FOR IRRIGATION PROJECTS

Do these records refer to the same thing? [y/n/s/f] 

n



name: Anda CAKSA

name: STATE ORGANIZATION FOR AGRICULTURAL MECHANIZATION AND AGRICULTURAL SUPPLIES

Do these records refer to the same thing? [y/n/s/f] 

n



name: Ebrima MBALLOW

name: STATE ESTABLISHMENT FOR AGRICULTURAL MARKETING

Do these records refer to the same thing? [y/n/s/f] 

n



name: RI Su Yong

name: RI SU YONG

Do these records refer to the same thing? [y/n/s/f] 

y



name: Ivo VALENTE

name: NATIONAL ENTERPRISE FOR EQUIPMENT MARKETING AND MAINTENANCE

Do these records refer to the same thing? [y/n/s/f] 

n



name: MOHAMMED VI

name: MOHAMMED AL GHABRA

Do these records refer to the same thing? [y/n/s/f] 

n



name: Erna SOLBERG

name: STATE ENTERPRISE FOR LEATHER INDUSTRIES

Do these records refer to the same thing? [y/n/s/f] 

n



name: Muhammad al-AMARI

name: Muhammad Bahrum Naim

Do these records refer to the same thing? [y/n/s/f] 

n



name: CHOE Sung Ho

name: CHOE HWI

Do these records refer to the same thing? [y/n/s/f] 

n



name: ZHONG Shan

name: SHEN ZHONG INTERNATIONAL SHIPPING

Do these records refer to the same thing? [y/n/s/f] 

y



name: CHOE Yong Rim

name: CHOE HWI

Do these records refer to the same thing? [y/n/s/f] 

n



name: Talal ARSLAN

name: NATIONAL CENTRE FOR ENGINEERING AND ARCHITECTURAL CONSULTANCY

Do these records refer to the same thing? [y/n/s/f] 

n



name: ABDULLAH bin Muhammad Belhaf al-Nuaymi

name: ABDULLAH ANSHORI

Do these records refer to the same thing? [y/n/s/f] 

n



name: KANG Kyung-wha

name: KANG RYONG

Do these records refer to the same thing? [y/n/s/f] 

n



name: SHEN Jong-chin

name: SHEN ZHONG INTERNATIONAL SHIPPING

Do these records refer to the same thing? [y/n/s/f] 

n



name: Steadroy BENJAMIN

name: VOCATIONAL TRAINING CENTRE FOR ENGINEERING AND METALLIC INDUSTRIES

Do these records refer to the same thing? [y/n/s/f] 

n



name: MIAO Wei

name: M23

Do these records refer to the same thing? [y/n/s/f] 

n



name: KANG Myong Chol

name: KANG RYONG

Do these records refer to the same thing? [y/n/s/f] 

n



name: Henry PUNA

name: TARKHAN TAYUMURAZOVICH BATIRASHVILI

Do these records refer to the same thing? [y/n/s/f] 

n



name: JANG Chol

name: JANG BOM SU

Do these records refer to the same thing? [y/n/s/f] 

n



name: Bruno LE MAIRE

name: TOUS POUR LA PAIX ET LE DEVELOPPEMENT (NGO)

Do these records refer to the same thing? [y/n/s/f] 

n



name: Borut PAHOR

name: STATE CONTRACTING BUILDINGS COMPANY

Do these records refer to the same thing? [y/n/s/f] 

n



name: ALI Abdu

name: ALI MUSA AL-SHAWAKH

Do these records refer to the same thing? [y/n/s/f] 

n



name: Muhammad al-AHMED

name: Muhammad Bahrum Naim

Do these records refer to the same thing? [y/n/s/f] 

n



name: Arben ADEMI

name: SECOND ACADEMY OF NATURAL SCIENCES

Do these records refer to the same thing? [y/n/s/f] 

n



name: HAN ZAW

name: HAN YU-RO

Do these records refer to the same thing? [y/n/s/f] 

n



name: Muhammad bin Mazyad al-TUWAYJRI

name: Muhammad Bahrum Naim

Do these records refer to the same thing? [y/n/s/f] 

n



name: CHANG Po-ya

name: CHANG AN SHIPPING & TECHNOLOGY

Do these records refer to the same thing? [y/n/s/f] 

nn


Do these records refer to the same thing? [y/n/s/f] 

n



name: WANG Menghui

name: SINGWANG ECONOMICS AND TRADING GENERAL CORPORATION

Do these records refer to the same thing? [y/n/s/f] 

n



name: SALIM al-Abdallah al-Jabir al-Sabah

name: SALIM KONY

Do these records refer to the same thing? [y/n/s/f] 

n



name: Omar HILALE

name: MUSA HILAL ABDALLA

Do these records refer to the same thing? [y/n/s/f] 

n



name: Denis NAUGHTEN

name: STATE ESTABLISHMENT FOR SLAUGHTERING HOUSES

Do these records refer to the same thing? [y/n/s/f] 

n



name: Piyush GOYAL

name: AL-HARAMAIN & AL MASJED AL-AQSA CHARITY FOUNDATION

Do these records refer to the same thing? [y/n/s/f] 

n



name: Colin JORDAN

name: KEVIN JORDAN AXEL

Do these records refer to the same thing? [y/n/s/f] 

n



name: Ahmad bin Abdallah al-MAHMUD

name: Ahmad Iman Ali

Do these records refer to the same thing? [y/n/s/f] 

n



name: Ahmad bin Muhammad bin Salim al-FUTAISI

name: Ahmad Iman Ali

Do these records refer to the same thing? [y/n/s/f] 

n



name: MOHAMMED VI

name: ABOU MOHAMED AL ADNANI

Do these records refer to the same thing? [y/n/s/f] 

n



name: Pierre RAFFOUL

name: فيض

Do these records refer to the same thing? [y/n/s/f] 

n



name: Aysha MOHAMMED

name: TARAD MOHAMMAD ALJARBA

Do these records refer to the same thing? [y/n/s/f] 

n



name: Darmin NASUTION

name: STATE ENTERPRISE FOR TEXTILE AND SPINNING PRODUCTS IMPORTING AND DISTRIBUTION

Do these records refer to the same thing? [y/n/s/f] 

n



name: Ian DOUGLAS

name: IRUTA DOUGLAS MPAMO

Do these records refer to the same thing? [y/n/s/f] 

n



name: Dharmendra PRADHAN

name: CHEMICAL, PETROCHEMICAL, MECHANICAL AND METALURICAL TRAINING CENTRE

Do these records refer to the same thing? [y/n/s/f] 

n



name: CHOE Sung Ho

name: CHOE SONG IL

Do these records refer to the same thing? [y/n/s/f] 

n



name: MUHAMMAD BIN SALMAN bin Abd al-Aziz Al Saud

name: MAALIM SALMAN

Do these records refer to the same thing? [y/n/s/f] 

n



name: Muhammad al-JABRI

name: Muhammad Bahrum Naim

Do these records refer to the same thing? [y/n/s/f] 

n



name: MUJAHID bin Yusof

name: MUJAHIDIN INDONESIAN TIMUR (MIT)

Do these records refer to the same thing? [y/n/s/f] 

n



name: Muhammad HASHIM

name: Muhammad Bahrum Naim

Do these records refer to the same thing? [y/n/s/f] 

n



name: CHO Myoung-gyon

name: CHO IL U

Do these records refer to the same thing? [y/n/s/f] 

n



name: AHMAD bin Muhammad bin Hamad bin Abdallah Al Khalifa

name: AHMAD VAHID DASTJERDI

Do these records refer to the same thing? [y/n/s/f] 

n



name: CHEA CHANTO

name: CHEMICAL, PETROCHEMICAL, MECHANICAL AND METALURICAL TRAINING CENTRE

Do these records refer to the same thing? [y/n/s/f] 

In [None]:
ciasanct4_df.shape