# Fuzzy Text Matching Example

This notebook demonstrates how to use the FuzzyPredictor for text matching using fuzzy matching approaches.

In [1]:
########################################################################
## FOR NOTEBOOKS ONLY: ADD THE PROJECT ROOT TO THE PYTHON PATH
########################################################################
import os
import sys

sys.path.insert(
    0, os.path.abspath(os.path.join(os.getcwd(), '..'))
)

## Imports

In [2]:
import numpy as np
import pandas as pd

from sklearn.metrics import (
    f1_score, precision_score, recall_score, accuracy_score
)

from sklearn.model_selection import GridSearchCV

from text_prediction.predictors.distance.fuzzy import FuzzyPredictor
from text_prediction.predictors.distance import ALGORITHMS, METHODS

## 1. Load and Prepare Data

For this example, we'll use a sample dataset of product names and their variations.


In [3]:
df = pd.read_csv("data/product_descriptions.csv")
df.head()

Unnamed: 0,X,y_true
0,iphone 13 pro max,iPhone 13 Pro Max
1,iphone 13 promax,iPhone 13 Pro Max
2,iphone 13 pro,iPhone 13 Pro
3,iphone 13,iPhone 13
4,samsung galaxy s21,Samsung Galaxy S21


## 2. Basic Text Search

In this example, we have a large set of text data and we simply want to seach for instances of a certain query. For example, in a list of reported survey responses, we want to look for instances of "iPhone 13 Pro Max" in our data.

Create a basic FuzzyPredictor that will use the levenshtein algorithm and the distance method.

In [4]:
predictor = FuzzyPredictor(algorithm="levenshtein", method="distance")

To similulate this type of search, we will fit the predictor only on the X, which resemble the survey responses.

In [5]:
predictor.fit_transform(df['X'].tolist())

['iphone 13 pro max',
 'iphone 13 promax',
 'iphone 13 pro',
 'iphone 13',
 'samsung galaxy s21',
 'samsung galaxy s21 ultra',
 'samsung s21',
 'samsung s21 ultra',
 'macbook pro m1',
 'macbook pro m1 max',
 'macbook m1',
 'macbook m1 pro',
 'airpods pro 2nd gen',
 'airpods pro 2',
 'airpods pro second generation',
 'airpods 2nd gen',
 'airpods 3rd gen',
 'airpods 3',
 'galaxy buds pro',
 'samsung buds pro',
 'galaxy buds 2',
 'samsung buds 2',
 'tide original detergent',
 'tide detergent original',
 'tide pods original',
 'tide pods spring meadow',
 'tide spring meadow detergent',
 'bounty paper towels',
 'bounty select a size',
 'bounty select size',
 'charmin ultra soft',
 'charmin toilet paper ultra soft',
 'charmin ultra strong',
 'clorox original',
 'clorox bleach',
 'clorox disinfecting wipes',
 'lysol disinfectant spray',
 'lysol spray',
 'lysol wipes',
 'dawn dish soap original',
 'dawn original',
 'dawn ultra dish soap',
 'cheerios original',
 'plain cheerios',
 'honey nut ch

Now, we will search our survey responses for the product "iPhone 13 Pro Max". We can see that there about 4 likely instances, but only two of them contain the product name. In a production framework, you would do more to clean up the text data before you search.

What's happening is that the predictor is performing a fuzzy match between the query and the product descriptions by calculating the Levenshtein distance between the two strings. The results are sorted by the distance ascending.

In [6]:
predictor.predict_proba(X="iPhone 13 Pro Max")

[[('iphone 13 pro max', np.uint32(0)),
  ('iphone 13 promax', np.uint32(1)),
  ('iphone 13 pro', np.uint32(4)),
  ('iphone 13', np.uint32(8)),
  ('airpods pro 2', np.uint32(11))]]

This returned a number of possible matches, with a score for each match. The score is the cosine similarity between the query and the product description. We could set the limit to say the top 2 matches.

In [7]:
predictor.limit = 2
predictor.predict_proba(X="iPhone 13 Pro Max")

[[('iphone 13 pro max', np.uint32(0)), ('iphone 13 promax', np.uint32(1))]]

## 3. Basic Classification Example

Let's assume you're trying to label a batch of survey responses, instead of searching the existence of some query. For example, let's say we are trying to standardize the responses using the training data provided. You could use the BOWPredictor to predict the label for each survey response.

This resembles a supervised learning problem where we have a set of features (X) and a set of labels (y). We can fit the BOWPredictor on the training data and then use it to predict the label for each survey response.

In [8]:
predictor.fit_transform(X=df['X'].tolist(), y=df['y_true'].tolist())

['iphone 13 pro max',
 'iphone 13 promax',
 'iphone 13 pro',
 'iphone 13',
 'samsung galaxy s21',
 'samsung galaxy s21 ultra',
 'samsung s21',
 'samsung s21 ultra',
 'macbook pro m1',
 'macbook pro m1 max',
 'macbook m1',
 'macbook m1 pro',
 'airpods pro 2nd gen',
 'airpods pro 2',
 'airpods pro second generation',
 'airpods 2nd gen',
 'airpods 3rd gen',
 'airpods 3',
 'galaxy buds pro',
 'samsung buds pro',
 'galaxy buds 2',
 'samsung buds 2',
 'tide original detergent',
 'tide detergent original',
 'tide pods original',
 'tide pods spring meadow',
 'tide spring meadow detergent',
 'bounty paper towels',
 'bounty select a size',
 'bounty select size',
 'charmin ultra soft',
 'charmin toilet paper ultra soft',
 'charmin ultra strong',
 'clorox original',
 'clorox bleach',
 'clorox disinfecting wipes',
 'lysol disinfectant spray',
 'lysol spray',
 'lysol wipes',
 'dawn dish soap original',
 'dawn original',
 'dawn ultra dish soap',
 'cheerios original',
 'plain cheerios',
 'honey nut ch

In [9]:
# Notice that 'iphone 12' is not in the labeled data, so we cannot
# predict it.
predictor.predict(X=['iphone 13 pro max', 'iphone 13 pro', 'iphone 12'])

[np.str_('iPhone 13 Pro Max'), np.str_('iPhone 13 Pro'), np.str_('iPhone 13')]

## 4. Basic Clustering Example

Let's assume you're trying to cluster a batch of survey responses. You could use the predictor to cluster the survey responses.

This resembles an unsupervised learning problem where we have a set of features (X) and we want to cluster the data into different groups.

In [10]:
predictor.fit_transform(X=df['X'].tolist())

['iphone 13 pro max',
 'iphone 13 promax',
 'iphone 13 pro',
 'iphone 13',
 'samsung galaxy s21',
 'samsung galaxy s21 ultra',
 'samsung s21',
 'samsung s21 ultra',
 'macbook pro m1',
 'macbook pro m1 max',
 'macbook m1',
 'macbook m1 pro',
 'airpods pro 2nd gen',
 'airpods pro 2',
 'airpods pro second generation',
 'airpods 2nd gen',
 'airpods 3rd gen',
 'airpods 3',
 'galaxy buds pro',
 'samsung buds pro',
 'galaxy buds 2',
 'samsung buds 2',
 'tide original detergent',
 'tide detergent original',
 'tide pods original',
 'tide pods spring meadow',
 'tide spring meadow detergent',
 'bounty paper towels',
 'bounty select a size',
 'bounty select size',
 'charmin ultra soft',
 'charmin toilet paper ultra soft',
 'charmin ultra strong',
 'clorox original',
 'clorox bleach',
 'clorox disinfecting wipes',
 'lysol disinfectant spray',
 'lysol spray',
 'lysol wipes',
 'dawn dish soap original',
 'dawn original',
 'dawn ultra dish soap',
 'cheerios original',
 'plain cheerios',
 'honey nut ch

Let's take a look at the first survey response and see what cluster it belongs to by seeing which values have the highest probability, which will always include itself. A score cutoff could be applied to simluate cluster assignment.

In [11]:
predictor.predict_proba(X=df['X'].tolist()[0])

[[('iphone 13 pro max', np.uint32(0)), ('iphone 13 promax', np.uint32(1))]]

## 4. Validate Performance

Let's validate the performance of the predictor as a classifier (supervised learning). We can create several version of the predictor with different parameters and validate their performance.

### Manual Grid Search

Since the predictor is not a true learner, it doesn't make sense to use a grid search with cross validation. Instead, we can manually create a list of predictors with different parameters and validate their performance.

In [12]:
all_predictors = []

for algorithm in ALGORITHMS.keys():
    for method in METHODS:
        all_predictors.append(FuzzyPredictor(
            algorithm=algorithm, method=method
        ))

print(len(all_predictors))

24


In [13]:
# Create empty list to store results.
results = []

for predictor in all_predictors:

    # Fit the predictor on the product descriptions.
    predictor.fit_transform(df['y_true'].unique().tolist())
    
    # Predict the product descriptions.
    predictions = predictor.predict(X=df['X'].tolist())
    
    # Store results in a dictionary.
    result = {
        'Algorithm': predictor.algorithm,
        'Method': predictor.method,
        'F1 Score': f1_score(df['y_true'], predictions, average='weighted'),
        'Precision': precision_score(
            df['y_true'], predictions, average='weighted', zero_division=0
        ),
        'Recall': recall_score(df['y_true'], predictions, average='weighted'),
        'Accuracy': accuracy_score(df['y_true'], predictions)
    }
    
    # Append to results list.
    results.append(result)

# Convert to DataFrame.
results_df = pd.DataFrame(results)

# Round numeric columns to 4 decimal places.
numeric_columns = ['F1 Score', 'Precision', 'Recall', 'Accuracy']
results_df[numeric_columns] = results_df[numeric_columns].round(4)

# Sort the results by F1 Score in descending order.
results_df.sort_values(by='F1 Score', ascending=False, inplace=True)

# Display the results.
results_df

Unnamed: 0,Algorithm,Method,F1 Score,Precision,Recall,Accuracy
12,jaro,distance,0.8965,0.9316,0.903,0.903
19,jaro_winkler,normalized_similarity,0.8965,0.9316,0.903,0.903
18,jaro_winkler,similarity,0.8965,0.9316,0.903,0.903
17,jaro_winkler,normalized_distance,0.8965,0.9316,0.903,0.903
16,jaro_winkler,distance,0.8965,0.9316,0.903,0.903
15,jaro,normalized_similarity,0.8965,0.9316,0.903,0.903
14,jaro,similarity,0.8965,0.9316,0.903,0.903
13,jaro,normalized_distance,0.8965,0.9316,0.903,0.903
9,indel,normalized_distance,0.8619,0.8769,0.8806,0.8806
11,indel,normalized_similarity,0.852,0.8657,0.8731,0.8731


###  Model Validation with Grid Search

However, the predictor can theoretically be plugged into an existing grid search pipeline without causing errors. It follows the sklearn estimator API, so it can be used in a grid search pipeline.

One work-around is to define the CV such that it uses the same folds for each parameter combination. To do that, we pass a CV value where the indexes are the same for both test and train datasets.

In [14]:
grid_search = GridSearchCV(
    FuzzyPredictor(),
    {
        'algorithm': list(ALGORITHMS.keys()),
        'method': METHODS
    },
    cv=[(np.arange(len(df)), np.arange(len(df)))],
    scoring='f1_weighted'
)

In [15]:
# Fit the grid search.
grid_search.fit(df['X'].tolist(), df['y_true'].tolist())

# Print results.
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# Get detailed results in a DataFrame.
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values('rank_test_score')
display(results[['params', 'mean_test_score', 'std_test_score']])

Best parameters: {'algorithm': 'jaro', 'method': 'distance'}
Best score: 0.8865671641791045


Unnamed: 0,params,mean_test_score,std_test_score
15,"{'algorithm': 'jaro', 'method': 'normalized_si...",0.886567,0.0
12,"{'algorithm': 'jaro', 'method': 'distance'}",0.886567,0.0
13,"{'algorithm': 'jaro', 'method': 'normalized_di...",0.886567,0.0
14,"{'algorithm': 'jaro', 'method': 'similarity'}",0.886567,0.0
19,"{'algorithm': 'jaro_winkler', 'method': 'norma...",0.886567,0.0
18,"{'algorithm': 'jaro_winkler', 'method': 'simil...",0.886567,0.0
17,"{'algorithm': 'jaro_winkler', 'method': 'norma...",0.886567,0.0
16,"{'algorithm': 'jaro_winkler', 'method': 'dista...",0.886567,0.0
10,"{'algorithm': 'indel', 'method': 'similarity'}",0.879104,0.0
9,"{'algorithm': 'indel', 'method': 'normalized_d...",0.86194,0.0
