# Problem Statement
Sometimes we have to combine multiple datasets, but the *only common field is a free-text string* containing mostly similar but sometimes slightly different representations of the same entities. These entities could be Customer Names or Company Names. Examples of these slightly different representations include "IBM" vs "I.B.M", and "Microsoft" vs "Microsotf".

If your systems rely on the strings being identical, this would cause a matching problem. 

Here, we showcase some sample code, leveraging on Python's fuzzywuzzy library to handle this thorny problem, which is frequently encountered when cleaning up datasets.

# Purpose
- To illustrate the use of fuzzy string matching with Python's fuzzywuzzy library. This attempts to solve the problem of matching slightly different representations of the same entities.

In [30]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from fuzzywuzzy import fuzz, process

# pandas options
pd.set_option('display.max_columns', None)  # Shows all columns in DataFrames. See http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_rows', None) # Shows all rows in DataFrames.
pd.set_option('display.width', 5000)
pd.set_option('display.multi_sparse', False)  #  Display every cell (for multi-level index).
pd.set_option('display.max_colwidth', -1)  # Display full contents of each column.


# Examining the data
Here we have 2 datasets, df_db1 and df_db2. df_db1 contains the CompanyName and symbol, while df_db2 contains the CompanyName and stock price. We would like to join both datasets together to get the CompanyName, symbol and stock price in 1 dataset. However, if you compare the 2 fields containing CompanyName (co_name_1, co_name_2), we see that they contain slightly different representations of the same company name. For example, "CapitaLand Mall Trust" vs "Capita Mall Trust".
This makes it difficult to combine both datasets.

In [31]:
df_db1 = pd.read_excel('Example - Sample Company Name Data - Illustrate fuzzywuzzy.xlsx', sheet_name='db1')
df_db2 = pd.read_excel('Example - Sample Company Name Data - Illustrate fuzzywuzzy.xlsx', sheet_name='db2')

print(df_db1)
print(df_db2)

    symbol                                           co_name_1
0  C38U.SI  CapitaLand Mall Trust                             
1  O39.SI   Oversea-Chinese Banking Corporation Limited (OCBC)
2  S58.SI   SATS Ltd.                                         
3  G13.SI   Genting Singapore Limited                         
4  C52.SI   ComfortDelGro Corporation Limited                 
5  Z74.SI   Singapore Telecommunications Limited              
6  D01.SI   Dairy Farm International Holdings Limited         
7  C31.SI   CapitaLand Limited                                
                   co_name_2  stock_price
0  Capita Mall Trust          2.36       
1  OCBC                       11.04      
2  SATS Ltd                   5.05       
3  Gneting Singapore Limited  1.01       
4  ComfortDelGro              2.43       
5  SingTel                    2.94       
6  Dairy Farm                 7.76       
7  CapLand                    3.43       


# Creating a wrapper around process.extractOne()
A function get_best_match_string() is created as a wrapper around fuzzywuzzy's process.extractOne(), to make it easier to use with .apply() on the DataFrame's column.

The general idea is that we come up with a unique list of options ("l_options") from the 2nd dataset ("df_db2"). Then for each CompanyName in the first dataset ("df_db1"), we obtain the closest-matched string from l_options.
We also cater for the situation whereby none of the strings in l_options are a good match. We do this by tweaking the score_cutoff parameter.

In [32]:
l_options = df_db2['co_name_2'].unique()  # Get list of allowed target values.

def get_best_match_string(str_query, l_options, scorer=fuzz.token_set_ratio, score_cutoff=75):
    """ Uses fuzzywuzzy's process.extractOne to return best matching string. 
   
    'scorer' is set to fuzz.token_set_ratio algorithm, instead of the default fuzz.WRatio scorer.
    
    'score_cutoff' is set at a higher bar, so that the match must be reasonably probable, otherwise system returns a NaN.
    This is to catch situations whereby the query string is obviously not in the right table, and we don't wish to return capricious results.
    """
    tup_ret = process.extractOne(str_query, l_options, scorer=scorer, score_cutoff=score_cutoff)
    if tup_ret is not None:
        return tup_ret[0]
    else:
        return np.nan

df_db1['co_name_2_matched'] = df_db1['co_name_1'].apply(lambda x: get_best_match_string(x, l_options))

# Examining the best-matched results
Below, we see the best matches, taken from df_db2.co_name_2. Recall that we're doing this to form a "bridge" from df_db1 to df_db2. 

Generally, most names were matched correctly. We see that the algorithm works for various cases, whether the names were missing a character ("SATS Ltd." vs "SATS Ltd"), or a long string was compared to a short string ("Oversea-Chinese Banking Corporation Limited (OCBC)" vs "OCBC").

What about the failed matches? For the case of "CapitaLand Limited" vs "CapLand", the match resulted in a NaN because the match score did not pass the hurdle we set (score_cutoff=75). In other words, we use this parameter to tell the system to effectively return a "NOT FOUND", if it's not confident.
As for why "Singapore Telecommunications Limited" matched with "Gneting Singapore Limited", this is likely caused by the presence of the word "Limited". Each dataset is unique, but with careful observation, you can use mitigation techniques such as creating an auxiliary column with stop-words/keywords removed, to aid the algorithm's matching accuracy.

In [33]:
df_db1

Unnamed: 0,symbol,co_name_1,co_name_2_matched
0,C38U.SI,CapitaLand Mall Trust,Capita Mall Trust
1,O39.SI,Oversea-Chinese Banking Corporation Limited (OCBC),OCBC
2,S58.SI,SATS Ltd.,SATS Ltd
3,G13.SI,Genting Singapore Limited,Gneting Singapore Limited
4,C52.SI,ComfortDelGro Corporation Limited,ComfortDelGro
5,Z74.SI,Singapore Telecommunications Limited,Gneting Singapore Limited
6,D01.SI,Dairy Farm International Holdings Limited,Dairy Farm
7,C31.SI,CapitaLand Limited,


# Combining both datasets
Finally, we merge both datasets, using the "bridge" (co_name_2_matched) that we've created. This way, we managed to bring in fields from both datasets, into a combined dataset.

In [34]:
df_merge = pd.merge(df_db1, df_db2, how='left', left_on=['co_name_2_matched'], right_on=['co_name_2'])
df_merge.drop(labels=['co_name_2_matched'], axis=1, inplace=True)
df_merge

Unnamed: 0,symbol,co_name_1,co_name_2,stock_price
0,C38U.SI,CapitaLand Mall Trust,Capita Mall Trust,2.36
1,O39.SI,Oversea-Chinese Banking Corporation Limited (OCBC),OCBC,11.04
2,S58.SI,SATS Ltd.,SATS Ltd,5.05
3,G13.SI,Genting Singapore Limited,Gneting Singapore Limited,1.01
4,C52.SI,ComfortDelGro Corporation Limited,ComfortDelGro,2.43
5,Z74.SI,Singapore Telecommunications Limited,Gneting Singapore Limited,1.01
6,D01.SI,Dairy Farm International Holdings Limited,Dairy Farm,7.76
7,C31.SI,CapitaLand Limited,,


# Troubleshooting the mismatches
By using process.extract() on the wrong matches, you can get the score values (100 is perfect match) to get a sense of how close the matches are. Then by tweaking either the scorer and/or score_cutoff parameters accordingly, we can fine-tune the matching to create the highest number of matches possible.

In [35]:
# TROUBLESHOOT MATCHING #
process.extract('CapitaLand Limited', l_options, limit=20, scorer=fuzz.token_set_ratio)

[('Gneting Singapore Limited', 56),
 ('CapLand', 56),
 ('Capita Mall Trust', 51),
 ('SingTel', 32),
 ('SATS Ltd', 31),
 ('Dairy Farm', 29),
 ('ComfortDelGro', 26),
 ('OCBC', 9)]

# References
See the below links for further reading
- https://www.datacamp.com/community/tutorials/fuzzy-string-python
- https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/fuzz.py