# Bag-of-Words Text Matching Example

This notebook demonstrates how to use the BOWPredictor for text matching using bag-of-words approaches.

In [1]:
########################################################################
## FOR NOTEBOOKS ONLY: ADD THE PROJECT ROOT TO THE PYTHON PATH
########################################################################
import os
import sys

sys.path.insert(
    0, os.path.abspath(os.path.join(os.getcwd(), '..'))
)

## Imports

In [2]:
import numpy as np
import pandas as pd

from sklearn.metrics import (
    f1_score, precision_score, recall_score, accuracy_score
)

from sklearn.model_selection import GridSearchCV

from text_prediction.predictors.vectorized.bow import BOWPredictor

## 1. Load and Prepare Data

For this example, we'll use a sample dataset of product names and their variations.


In [3]:
df = pd.read_csv("data/product_descriptions.csv")
df.head()

Unnamed: 0,X,y_true
0,iphone 13 pro max,iPhone 13 Pro Max
1,iphone 13 promax,iPhone 13 Pro Max
2,iphone 13 pro,iPhone 13 Pro
3,iphone 13,iPhone 13
4,samsung galaxy s21,Samsung Galaxy S21


## 2. Basic Text Search

In this example, we have a large set of text data and we simply want to seach for instances of a certain query. For example, in a list of reported survey responses, we want to look for instances of "iPhone 13 Pro Max" in our data.

Let's start by creating a basic BOWPredictor with with a char_wb analyzer and we will split it into 3-grams. This is a n-gram approach, where n is the number of characters in the query.

In [4]:
predictor = BOWPredictor(analyzer="char_wb", ngram_range=(3, 3))

To similulate this type of search, we will fit the predictor only on the X, which resemble the survey responses.

In [5]:
predictor.fit_transform(X=df['X'].tolist())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 2133 stored elements and shape (134, 729)>

Now, we will search our survey responses for the product "iPhone 13 Pro Max". We can see that there about 4 likely instances, but only two of them contain the product name. In a production framework, you would do more to clean up the text data before you search.

In [6]:
predictor.predict_proba(X="iPhone 13 Pro Max")

[[('iphone 13 pro max', np.float64(0.9999999999999999)),
  ('iphone 13 pro', np.float64(0.8864052604279182)),
  ('iphone 13 promax', np.float64(0.8571428571428571)),
  ('iphone 13', np.float64(0.7559289460184545)),
  ('macbook pro m1 max', np.float64(0.45374260648651504)),
  ('macbook pro m1', np.float64(0.3086066999241839)),
  ('macbook m1 pro', np.float64(0.3086066999241839)),
  ('airpods pro 2', np.float64(0.24174688920761409)),
  ('galaxy buds pro', np.float64(0.22237479499833038)),
  ('samsung buds pro', np.float64(0.2142857142857143)),
  ('airpods pro 2nd gen', np.float64(0.2004459314343183)),
  ('airpods pro second generation', np.float64(0.15724272550828777)),
  ('honey nut cheerios', np.float64(0.1336306209562122)),
  ('progresso chicken noodle', np.float64(0.11396057645963795)),
  ('coke zero', np.float64(0.0944911182523068)),
  ('macbook m1', np.float64(0.0890870806374748)),
  ('pepsi zero sugar', np.float64(0.07142857142857144))]]

This returned a number of possible matches, with a score for each match. The score is the cosine similarity between the query and the product description. We could set the limit to say the top 5 matches.

In [7]:
predictor.limit = 4
predictor.predict_proba(X="iPhone 13 Pro Max")

[[('iphone 13 pro max', np.float64(0.9999999999999999)),
  ('iphone 13 pro', np.float64(0.8864052604279182)),
  ('iphone 13 promax', np.float64(0.8571428571428571)),
  ('iphone 13', np.float64(0.7559289460184545))]]

## 3. Basic Classification Example

Let's assume you're trying to label a batch of survey responses, instead of searching the existence of some query. For example, let's say we are trying to standardize the responses using the training data provided. You could use the BOWPredictor to predict the label for each survey response.

This resembles a supervised learning problem where we have a set of features (X) and a set of labels (y). We can fit the BOWPredictor on the training data and then use it to predict the label for each survey response.

In [8]:
predictor.fit_transform(X=df['X'].tolist(), y=df['y_true'].tolist())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 2159 stored elements and shape (101, 785)>

In [9]:
# Notice that 'iphone 12' is not in the labeled data, so we cannot
# predict it.
predictor.predict(X=['iphone 13 pro max', 'iphone 13 pro', 'iphone 12'])

[np.str_('iPhone 13 Pro Max'), np.str_('iPhone 13 Pro'), np.str_('iPhone 13')]

## 4. Basic Clustering Example

Let's assume you're trying to cluster a batch of survey responses. You could use the BOWPredictor to cluster the survey responses.

This resembles an unsupervised learning problem where we have a set of features (X) and we want to cluster the data into different groups. We can fit the BOWPredictor on the training data and then use it to cluster the survey responses.

In [10]:
predictor.fit_transform(X=df['X'].tolist())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 2133 stored elements and shape (134, 729)>

Let's take a look at the first survey response and see what cluster it belongs to by seeing which values have the highest probability, which will always include itself. A score cutoff could be applied to simluate cluster assignment.

In [11]:
predictor.predict_proba(X=df['X'].tolist()[0])

[[('iphone 13 pro max', np.float64(0.9999999999999999)),
  ('iphone 13 pro', np.float64(0.8864052604279182)),
  ('iphone 13 promax', np.float64(0.8571428571428571)),
  ('iphone 13', np.float64(0.7559289460184545))]]

## 4. Validate Performance

Let's validate the performance of the BOWPredictor as a classifier (supervised learning). We can create several version of the BOWPredictor with different parameters and validate their performance.

### Manual Grid Search

Since the BOW predictor is not a true learner, it doens't make sense to use a grid search with cross validation. Instead, we can manually create a list of BOWPredictor with different parameters and validate their performance.

In [12]:
# Let's create a list of all the BOWPredictor with different parameters.
params = [
    {"analyzer": "word", "ngram_range": (1, 1)},
    {"analyzer": "word", "ngram_range": (1, 2)},
    {"analyzer": "char_wb", "ngram_range": (1, 2)},
    {"analyzer": "char_wb", "ngram_range": (1, 3)},
    {"analyzer": "char_wb", "ngram_range": (2, 3)},
    {"analyzer": "char_wb", "ngram_range": (2, 4)},
]

# Create a list of all the BOWPredictor with different parameters.
all_predictors = [BOWPredictor(**params) for params in params]
print(len(all_predictors))

6


In [13]:
# Create empty list to store results.
results = []

for predictor in all_predictors:

    # Fit the predictor on the product descriptions.
    predictor.fit_transform(X=df['X'].tolist(), y=df['y_true'].tolist())
    
    # Predict the product descriptions.
    predictions = predictor.predict(X=df['X'].tolist())
    
    # Store results in a dictionary.
    result = {
        'Analyzer': predictor.analyzer,
        'ngrams': predictor.ngram_range,
        'F1 Score': f1_score(df['y_true'], predictions, average='weighted'),
        'Precision': precision_score(df['y_true'], predictions, average='weighted', zero_division=0),
        'Recall': recall_score(df['y_true'], predictions, average='weighted'),
        'Accuracy': accuracy_score(df['y_true'], predictions)
    }
    
    # Append to results list.
    results.append(result)

# Convert to DataFrame.
results_df = pd.DataFrame(results)

# Round numeric columns to 4 decimal places.
numeric_columns = ['F1 Score', 'Precision', 'Recall', 'Accuracy']
results_df[numeric_columns] = results_df[numeric_columns].round(4)

# Sort the results by F1 Score in descending order.
results_df.sort_values(by='F1 Score', ascending=False, inplace=True)

# Display the results.
results_df

Unnamed: 0,Analyzer,ngrams,F1 Score,Precision,Recall,Accuracy
3,char_wb,"(1, 3)",0.9488,0.959,0.9552,0.9552
4,char_wb,"(2, 3)",0.9423,0.9577,0.9478,0.9478
5,char_wb,"(2, 4)",0.9323,0.9465,0.9403,0.9403
2,char_wb,"(1, 2)",0.9229,0.9347,0.9328,0.9328
1,word,"(1, 2)",0.8868,0.9005,0.903,0.903
0,word,"(1, 1)",0.8739,0.8818,0.8955,0.8955


###  Model Validation with Grid Search

However, the predictor can theoretically be plugged into an existing grid search pipeline without causing errors. It follows the sklearn estimator API, so it can be used in a grid search pipeline.

One work-around is to define the CV such that it uses the same folds for each parameter combination. To do that, we pass a CV value where the indexes are the same for both test and train datasets.

In [14]:
grid_search = GridSearchCV(
    BOWPredictor(),
    {
        'analyzer': ['char_wb'],
        'ngram_range': [
            (1, 2), (1, 3), (2, 3), (2, 4)
        ]
    },
    cv=[(np.arange(len(df)), np.arange(len(df)))],
    scoring='f1_weighted',
)

In [15]:
# Fit the grid search.
grid_search.fit(df['X'].tolist(), df['y_true'].tolist())

# Print results.
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# Get detailed results in a DataFrame.
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values('rank_test_score')
display(results[['params', 'mean_test_score', 'std_test_score']])

Best parameters: {'analyzer': 'char_wb', 'ngram_range': (1, 3)}
Best score: 0.9487562189054726


Unnamed: 0,params,mean_test_score,std_test_score
1,"{'analyzer': 'char_wb', 'ngram_range': (1, 3)}",0.948756,0.0
2,"{'analyzer': 'char_wb', 'ngram_range': (2, 3)}",0.942289,0.0
3,"{'analyzer': 'char_wb', 'ngram_range': (2, 4)}",0.932338,0.0
0,"{'analyzer': 'char_wb', 'ngram_range': (1, 2)}",0.922921,0.0
