In [1]:
#import library for string matching.
from fuzzywuzzy import fuzz



String matching using levhenstein distance mostly giving a score between 0 - 100. If you want to do at a large scale need python-Levenshtein so its faster as noted in the warning message

Option 1: Direct string match. Best to use if only looking at one or two words and think the error is a typo

In [6]:
#slight typo
print("Synchrony Financial vs Synchrony Financal =>",fuzz.ratio("Synchrony Financial", "Synchrony Financal"))
print("Synchrony Financial vs Synchrony =>",fuzz.ratio("Synchrony Financial", "Synchrony"))
print("123 ABC Streeet Apt 3 vs 123 abc st #3 =>",fuzz.ratio("123 ABC Street Apt 3".lower(), "123 ABC St #3".lower()))
print("Synchrony Financial vs Chase Bank =>",fuzz.ratio("Synchrony Financial", "Chase Bank"))
print("123 ABC Street Apt 3 vs 123 =>",fuzz.ratio("123 ABC Street Apt 3", "123"))

Synchrony Financial vs Synchrony Financal => 97
Synchrony Financial vs Synchrony => 64
123 ABC Streeet Apt 3 vs 123 abc st #3 => 73
Synchrony Financial vs Chase Bank => 28
123 ABC Street Apt 3 vs 123 => 26


Option 2: Partial string match. Better at inconsistent substrings which can help find similarities when strings are of different lengths. Basically is concerned with the best matching pattern it can find. Note this can perhaps give unwanted results in some instances as is showed in the examples below

In [5]:
print("Synchrony Financial vs Synchrony Financal =>",fuzz.partial_ratio("Synchrony Financial", "Synchrony Financal"))
print("Synchrony Financial vs Synchrony =>",fuzz.partial_ratio("Synchrony Financial", "Synchrony"))
print("123 ABC Streeet Apt 3 vs 123 abc st #3 =>",fuzz.partial_ratio("123 ABC Street Apt 3", "123 abc st #3"))
print("Synchrony Financial vs Chase Bank =>",fuzz.partial_ratio("Synchrony Financial", "Chase Bank"))
print("123 ABC Street Apt 3 vs 123 =>",fuzz.partial_ratio("123 ABC Street Apt 3", "123"))

Synchrony Financial vs Synchrony Financal => 94
Synchrony Financial vs Synchrony => 100
123 ABC Streeet Apt 3 vs 123 abc st #3 => 46
Synchrony Financial vs Chase Bank => 30
123 ABC Street Apt 3 vs 123 => 100


Option 3: Token Sort. Takes all available words, tokenizes them, sorts them and then joins them back into the string for a more wholistic match. Normally good for phrases where most words should be the same or similar prefixes

In [8]:
print("Synchrony Financial vs Synchrony Financal =>",fuzz.token_sort_ratio("Synchrony Financial", "Synchrony Financal"))
print("Synchrony Financial vs Synchrony =>",fuzz.token_sort_ratio("Synchrony Financial", "Synchrony"))
print("123 ABC Street Apt 3 vs 123 abc st #3 =>",fuzz.token_sort_ratio("123 ABC Street Apt 3", "123 abc st #3"))
print("Synchrony Financial vs Chase Bank =>",fuzz.token_sort_ratio("Synchrony Financial", "Chase Bank"))
print("123 ABC Street Apt 3 vs 123 =>",fuzz.token_sort_ratio("123 ABC Street Apt 3", "123"))

Synchrony Financial vs Synchrony Financal => 97
Synchrony Financial vs Synchrony => 64
123 ABC Street Apt 3 vs 123 abc st #3 => 75
Synchrony Financial vs Chase Bank => 34
123 ABC Street Apt 3 vs 123 => 26


Option 4: Token Set. Similar to Token sort but separates the string into two parts: intersection and remainder. Good for when you think you'll have a fair amount of overlap in your phrases and some excess words that may still be related but the characters aren't exactly the same

In [12]:
print("Synchrony Financial vs Synchrony Financal =>",fuzz.token_set_ratio("Synchrony Financial", "Synchrony Financal"))
print("Synchrony Financial vs Synchrony =>",fuzz.token_set_ratio("Synchrony Financial", "Synchrony"))
print("123 ABC Streeet Apt 3 vs 123 abc st #3 =>",fuzz.token_set_ratio("123 ABC Street Apt 3", "123 abc st #3"))
print("Synchrony Financial vs Chase Bank =>",fuzz.token_set_ratio("Synchrony Financial", "Chase Bank"))
print("123 ABC Street Apt 3 vs 123 =>",fuzz.token_set_ratio("123 ABC Street Apt 3", "123"))

Synchrony Financial vs Synchrony Financal => 97
Synchrony Financial vs Synchrony => 100
123 ABC Streeet Apt 3 vs 123 abc st #3 => 86
Synchrony Financial vs Chase Bank => 34
123 ABC Street Apt 3 vs 123 => 100


There is no perfect method for every scenario. Based on your data and types of differences you think you will find, choose the appropriate method or methods

Other things that can help match strings is to do data cleaning before attempting to match. For example, making everything the same case, change Lane => ln, apartment => apt, road => rd, etc. to have a common pattern for comparison