#### Fuzzywuzzy is used for String Matching.
It calculates _Levenshtein Distance_ to calculate the difference between sequences.

In [2]:
from fuzzywuzzy import fuzz

#### Simple Ratio

It Measures the Levenstein Ratio i.e.

$$
\text{similarity} = 1 - \frac{\text{lev}(a, b)}{\sum(\text{len}(a), \text{len}(b))}
$$


***While Calculating Ratio, If Any Substitution happens, then distance is increased by 1***

In [None]:
string1 = "ab"
string2 = "ac"

# Substitution b -> c: 1
print(fuzz.ratio(string1, string2))

(1 - ((1+1) / (2+2)))*100

50


50.0

In [11]:
string1 = "ab"
string2 = "a"

# Deletion b: 1
print(fuzz.ratio(string1, string2))

(1 - ((1) / (1+2)))*100

67


66.66666666666667

In [12]:
string1 = "Hello"
string2 = "Hello Worlds"

# Addition " Worlds" -> 7
print(fuzz.ratio(string1, string2))

(1 - ((7) / (5+12)))*100

59


58.82352941176471

In [15]:
string1 = "Apple iPhone 14 Pro Max"
string2 = "iPhone 14"

# Addition "Apple * Pro Max" -> 14
print(fuzz.ratio(string1, string2))

(1 - ((14) / (9+23)))*100

56


56.25

#### Partial Ratio

It is a string similarity metric that measures how well a **shorter string** matches a **substring of a longer string**.

#### How it Works

String 1: Apple iPhone 14 Pro Max

String 2: iPhone 14

Step 1: Identify the shorter and longer string
- Shorter = "iPhone 14"
- Longer = "Apple iPhone 14 Pro Max"

Step 2: Slide the Shorter String over all substring of Longer one
- It compares "iPhone 14" with every possible substring of "Apple iPhone 14 Pro Max" that’s roughly the same length, and calculates a similarity ratio for each (using Levenshtein distance).

Step 3: Return the Highest Similarity Score calculated using Levenshtein distance.
- The algorithm finds the substring "iPhone 14" inside the longer string and gives a near-perfect match.

In [18]:
def all_subtsring(text):
    
    substrings = []

    # Generate all contiguous substrings
    for start in range(len(text)):
        for end in range(start + 1, len(text) + 1):
            substrings.append(text[start:end])

    return substrings


In [None]:
string1 = "Apple iPhone 14 Pro Max"
string2 = "iPhone 14"

# Addition "Apple * Pro Max" -> 14
print(f"Partial Ratio between {string1} and {string2} is {fuzz.partial_ratio(string1, string2)}")


# Step 1: 
    # Smaller: iPhone 14
    # Larger: Apple iPhone 14 Pro Max

# Step 2 & 3: Slide Smaller over all Substring, and return the Maximum Levenshtein Ratio
Substrings = all_subtsring(string2)
susbtrings_l_ratio = [fuzz.ratio(string2, sub_str) for sub_str in Substrings]
max_l_ratio = max(susbtrings_l_ratio)

print(f"Maximum Levenshtein Ratio from the substring: {Substrings[susbtrings_l_ratio.index(max_l_ratio)]} is: {max(susbtrings_l_ratio)}")

Partial Ration between Apple iPhone 14 Pro Max and iPhone 14 is 100
Maximum Levenshtein Ratio from the substring: iPhone 14 is: 100


#### Token Sort Ratio

Step 1: Tokenise Both Steps into Words

Step 2: Sort the Tokens Alphabetically

Step 3: Join the Sorted Tokens back into string

Step 4: Compute the Levenshtein distance ratio

##### When to Use
- Word order doesn’t matter.
- Strings may have same words but different sequence.
- You want to match phrases, product names, or keywords flexibly.

In [40]:
string1 = "Apple iPhone 14 Pro Max"
string2 = "iPhone 14"

sorted_string1 = " ".join(sorted(string1.split(" ")))
sorted_string2 = " ".join(sorted(string2.split(" ")))

# Addition "Apple * Pro Max" -> 14
print(f"Levenshtein Ratio between {sorted_string1} and {sorted_string2} is {fuzz.ratio(sorted_string1, sorted_string2)}")
print(f"Token Sort Ratio between {string1} and {string2} is {fuzz.token_sort_ratio(string1, string2)}")

Levenshtein Ratio between 14 Apple Max Pro iPhone and 14 iPhone is 56
Token Sort Ratio between Apple iPhone 14 Pro Max and iPhone 14 is 56


In [None]:
string1 = "apple banana"
string2 = "banana apple"

sorted_string1 = " ".join(sorted(string1.split(" ")))
sorted_string2 = " ".join(sorted(string2.split(" ")))

print(f"Levenshtein Ratio between {sorted_string1} and {sorted_string2} is {fuzz.ratio(sorted_string1, sorted_string2)}")
print(f"Token Sort Ratio between {string1} and {string2} is {fuzz.token_sort_ratio(string1, string2)}")

Levenshtein Ratio between apple banana and apple banana is 100
Token Sort Ratio between apple banana and banana apple is 100


#### Token Set Ratio

Step 1: Tokenise Each String into Sets

Step 2: Calculate three Sets
        - Intersection
        - S1 difference S2 (S1 - S2)
        - S2 difference S1 (S2 - S1)

Step 3: Sort and Combine these sets to String

Step 4: Create Full String
        Full string 1: Intersection + S1 Difference
        Full String 2: Intersection + S2 Difference

Step 5: Calculate Ratio between
        - Intersection, Full String 1
        - Intersection, Full String 2
        - Full String 1, Full Sting 2

Step 6: Return the Max Score

In [64]:
string1 = "apple apple banana"
string2 = "apple banana"

# Step 1
token1 = string1.split(" ")
token2 = string2.split(" ")

# Step 2
intersection = set(token1).intersection(set(token2))
diff1to2 = set(token1).difference(set(token2))
diff2to1= set(token2).difference(set(token1))


# Step 3
sorted_intersection = " ".join(sorted(intersection))
sorted_1to2 = " ".join(sorted(diff1to2))
sorted_2to1 = " ".join(sorted(diff2to1))

# Step 4
combined_1to2 = sorted_intersection + " " + sorted_1to2
combined_2to1 = sorted_intersection + " " + sorted_2to1

# strip
sorted_intersection = sorted_intersection.strip()
combined_1to2 = combined_1to2.strip()
combined_2to1 = combined_2to1.strip()

# Step 5
print(f"Levenshtein Ratio between {sorted_intersection} and {combined_1to2} is {fuzz.ratio(sorted_intersection, combined_1to2)}")
print(f"Levenshtein Ratio between {sorted_intersection} and {combined_2to1} is {fuzz.ratio(sorted_intersection, combined_2to1)}")
print(f"Levenshtein Ratio between {combined_1to2} and {combined_2to1} is {fuzz.ratio(combined_1to2, combined_2to1)}")



print(f"Token Sort Ratio between {string1} and {string2} is {fuzz.token_set_ratio(string1, string2)}")

Levenshtein Ratio between apple banana and apple banana is 100
Levenshtein Ratio between apple banana and apple banana is 100
Levenshtein Ratio between apple banana and apple banana is 100
Token Sort Ratio between apple apple banana and apple banana is 100


In [65]:
string1 = "mariners vs angels"
string2 = "los angeles angels of anaheim at seattle mariners"

# Step 1
token1 = string1.split(" ")
token2 = string2.split(" ")

# Step 2
intersection = set(token1).intersection(set(token2))
diff1to2 = set(token1).difference(set(token2))
diff2to1= set(token2).difference(set(token1))


# Step 3
sorted_intersection = " ".join(sorted(intersection))
sorted_1to2 = " ".join(sorted(diff1to2))
sorted_2to1 = " ".join(sorted(diff2to1))

# Step 4
combined_1to2 = sorted_intersection + " " + sorted_1to2
combined_2to1 = sorted_intersection + " " + sorted_2to1

# strip
sorted_intersection = sorted_intersection.strip()
combined_1to2 = combined_1to2.strip()
combined_2to1 = combined_2to1.strip()

# Step 5
print(f"Levenshtein Ratio between {sorted_intersection} and {combined_1to2} is {fuzz.ratio(sorted_intersection, combined_1to2)}")
print(f"Levenshtein Ratio between {sorted_intersection} and {combined_2to1} is {fuzz.ratio(sorted_intersection, combined_2to1)}")
print(f"Levenshtein Ratio between {combined_1to2} and {combined_2to1} is {fuzz.ratio(combined_1to2, combined_2to1)}")



print(f"Token Sort Ratio between {string1} and {string2} is {fuzz.token_set_ratio(string1, string2)}")

Levenshtein Ratio between angels mariners and angels mariners vs is 91
Levenshtein Ratio between angels mariners and angels mariners anaheim angeles at los of seattle is 47
Levenshtein Ratio between angels mariners vs and angels mariners anaheim angeles at los of seattle is 51
Token Sort Ratio between mariners vs angels and los angeles angels of anaheim at seattle mariners is 91
