# Product Classification with Kesler's Construction and OneVsRest 

## Overview

Retailers struggle with product catalog hygiene. If their products do not have the correct metadata information it will be very difficult for customers to surface those products on an ecommerce website. What's more it will also be difficult for internal associates to find these products for customers in their brick-and-mortar store. The difficulties of keeping a product catalog clean come from a few issues. One is that some catalogs may have thousands if not millions of products and maintaining data quality is a large endeavor. Second is that product metadata is manually typed by a manual process either internal or by external 3rd parties (i.e. a customer posting their products in an ecommerce marketplace). Often, these customers are not well versed on inserting the correct or the best metadata. Third is that data becomes stale and it is hard to maintain the data fresh for all of the products. These and many others make data quality such a large issue. 


This notebook is an attempt to solve one particular major issue related to product catalog hygiene: populating product catgories automatically. The method attempted here comes from a Machine Learning method called Kesler's Construction and is inspired by this [blog post by Shopify](https://shopify.engineering/categorizing-products-at-scale). The method has some advantages compared to traditional multi-label classification methods for a large number of categories. An obvious one is that it maintains simplicity by having to train and maintain one model as opposed to training and maintaing many models like the One-vs-All or One-vs-Rest methods. This also leads to lower computational complexity to serve the model at scale (thousands of categories). The disadvantage is that it requires a large amount of memory and we thus had to deploy a large machine to run this notebook (X GB of RAM). 

## Getting Started

### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).

In [None]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Define Google Cloud project information (Colab only)

In [None]:
if "google.colab" in sys.modules:
    # Define project information
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
    LOCATION = "us-central1"  # @param {type:"string"}

    # Initialize Vertex AI
    import vertexai

    vertexai.init(project=PROJECT_ID, location=LOCATION)

### Install and Import Packages

In [26]:
!pip install spacy -q

[0m

In [27]:
!pip install spacy-cleaner -q

[0m

In [28]:
!python -m spacy download en_core_web_sm -q

[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [29]:
!python -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /opt/conda/lib/python3.10/site-packages/spacy[0m

NAME             SPACY            VERSION                            
en_core_web_sm   >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m



In [10]:
### Import libraries
from sklearn.feature_extraction.text import HashingVectorizer
from vertexai.preview.language_models import TextEmbeddingModel
from vertexai.preview.language_models import TextGenerationModel
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
import random
import joblib
from tqdm.notebook import tqdm
tqdm.pandas()

### Define Global Parameters

In [31]:
CATEGORY_COL="c0_name"

## Data Preparation

### Download Mercari Data from BigQuery

The Mercari data is not public. This notebook can only be currently run in the project shown. 

In [2]:
%%bigquery df_test

SELECT *, CONCAT('Name: \n ', name, ' \n ', 
                 "Description: \n ", description, ' \n ',
                 "Labels: \n ", TO_JSON_STRING(vision_api_labels)
                ) as attr FROM solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings_golden_test  
WHERE rand() < 1.0

Query is running:   0%|          |

Downloading:   0%|          |

In [3]:
%%bigquery df_train
SELECT *, CONCAT('Name: \n ', name, ' \n ', 
                 "Description: \n ", description, ' \n ',
                 "Labels: \n ", TO_JSON_STRING(vision_api_labels)
                ) as attr 
FROM solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings 
WHERE name NOT IN (
    SELECT name FROM solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings_golden_test
)
AND rand() < 1.0

Query is running:   0%|          |

Downloading:   0%|          |

In [6]:
df_train.head()

Unnamed: 0,id,name,description,brand_name,item_condition_name,c0_name,c1_name,c2_name,url,created,image_uri,vision_api_labels,attributes,scores,text_embedding,image_embedding,attr
0,m60221283928,Rae Dunn So Sweet Snowglobe,Rae Dunn 2 birds with a white birdhouse heart ...,Rae Dunn,New,Home,Home decor,Home decor accents,https://www.mercari.com/us/item/m60221283928,2023-03-05 18:14:25+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Drinkwar...","[Drinkware, Wood, Font, Table, Crown, Servewar...","[0.81246752, 0.75604391, 0.75142217, 0.7148835...","[0.0220884383, -0.0347933248, -0.00396528747, ...","[0.0160564333, -0.0159334242, -0.020018056, -0...",Name: \n Rae Dunn So Sweet Snowglobe \n Descri...
1,m23006575222,"Lazart - Elk, Bear, Buck - 36"" Metal Decorativ...","Lazart\nMade in Gainesville, Texas\n\n36"" Deco...",,New,Home,Home decor,Home decor accents,https://www.mercari.com/us/item/m23006575222,2023-02-05 14:40:57+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Rectangl...","[Rectangle, Gesture, Art, Font, Natural materi...","[0.86215866, 0.85014093, 0.83846128, 0.7386470...","[-0.0473593324, -0.0428869165, -0.0119342841, ...","[0.00635956787, 0.0201334748, -0.0440138876, -...","Name: \n Lazart - Elk, Bear, Buck - 36"" Metal ..."
2,m10973355710,Kids gloves,Winter gloves (not waterproof) never worn- wer...,Children's Place,New,Sports & outdoors,Apparel,Boys,https://www.mercari.com/us/item/m10973355710,2023-02-07 03:26:42+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Hand"",""m...","[Hand, Green, Gesture, Finger, Material proper...","[0.95952165, 0.89952242, 0.85260427, 0.8305912...","[-0.0141191799, -0.00774308294, -0.0104338424,...","[0.00855759811, 0.00632876763, -0.0186565928, ...",Name: \n Kids gloves \n Description: \n Winter...
3,m58932165593,Tigers Eye Stone Guitar Pick 3mm 905,Please note: does not come with small clear di...,Handmade,New,Handmade,Music,Other,https://www.mercari.com/us/item/m58932165593,2023-03-07 19:57:26+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Drinkwar...","[Drinkware, Artifact, Wood, Vase, Art, Tints a...","[0.91066086, 0.82712269, 0.81000954, 0.7914801...","[-0.00973090157, 0.000623123778, 0.0210940894,...","[0.0167370159, 0.0705789328, 0.00185096264, 0....",Name: \n Tigers Eye Stone Guitar Pick 3mm 905 ...
4,m55977728915,Robyn Brooks Stingray Bracelet Pink Gold Plate...,"Robyn Brooks Stingray Bracelet \nColors: Pink,...",,Good,Women,Jewelry,Bracelets,https://www.mercari.com/us/item/m55977728915,2023-04-02 03:45:24+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Body jew...","[Body jewelry, Amber, Natural material, Materi...","[0.91135097, 0.85079229, 0.84506172, 0.8010526...","[-0.0367169827, -0.00913381111, 0.0485672466, ...","[-0.0193907265, 0.0517394468, 0.0349789225, -0...",Name: \n Robyn Brooks Stingray Bracelet Pink G...


In [5]:
print("Training Data size: " + str(len(df_train)))
print("Testing Data size: " + str(len(df_test)))

Training Data size: 13266
Testing Data size: 141


In [36]:
%%bigquery category_df
select distinct(c0_name) from `solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings`

Query is running:   0%|          |

Downloading:   0%|          |

In [37]:
category_df.head()

Unnamed: 0,c0_name
0,Home
1,Books
2,Toys & Collectibles
3,Men
4,Women


In [38]:
# create dictionary of categories with label id

categories = {}
category_id = 0
for index, row in category_df.iterrows():
    categories[row[CATEGORY_COL]] = index
        
ncategories = len(categories)

### NLP clean-up code

In [40]:
import spacy
import spacy_cleaner
from spacy_cleaner.processing import removers, replacers, mutators

MODEL = spacy.load("en_core_web_sm")

PROC_PIPELINE = spacy_cleaner.Cleaner(
    MODEL,
    replacers.replace_punctuation_token,
    mutators.mutate_lemma_token,
    removers.remove_stopword_token,

)
def parse_nlp(description) -> str:
    doc = MODEL(description.lower())
    lemmas = []
    for token in doc:
        if token.lemma_ not in lemmas and not token.is_stop and token.is_alpha:
            lemmas.append(token.lemma_)

    return " ".join(lemmas)

DESCRIPTION = "Solemio Sleeveless Solid Men's Reversible Sweatshirt Price: Rs. 1,261 Fitz Blue Sweat Shirt is a must-have for any wardrobe. Ensemble with trendy embroidery on the chest and full zipped closure down front. It is marked with ribbed waistband, cuffs and two pouch pockets on the front. Classic color scheme allows you to wear this over a wide range of separates. Relaxed fit for free body movement. Brand: Fitz Color: Red Style Statement: Designed to be worn as casual as well as leisure wear. Team it with a denims and sneakers to look uber cool. Material: Polyester Cotton Wash Care: Do not bleach and tumble dry. Use gentle machine wash and gentle warm Iron. About the brand: FITZ is the company’s flagship sportswear and active wear brand with an Indian soul and international outlook. It is an evolved range of sports and activity gears dedicated to sports enthusiasts. The brand is poised to become an economic force when it comes to young people who are aware of style trends, sportswear designs reflected the spirited, celebrity-conscious sensibilities of the decade. Disclaimer: Product color may slightly vary due to photographic lighting sources or your monitor settings. Size varies from brand to brand. Kindly go through the size chart for more clarity. Fitz Blue Sweat Shirt is a must-have for any wardrobe. Ensemble with trendy embroidery on the chest and full zipped closure down front. It is marked with ribbed waistband, cuffs and two pouch pockets on the front. Classic color scheme allows you to wear this over a wide range of separates. Relaxed fit for free body movement. Brand: Fitz Color: Red Style Statement: Designed to be worn as casual as well as leisure wear. Team it with a denims and sneakers to look uber cool. Material: Polyester Cotton Wash Care: Do not bleach and tumble dry. Use gentle machine wash and gentle warm Iron. About the brand: FITZ is the company’s flagship sportswear and active wear brand with an Indian soul and international outlook. It is an evolved range of sports and activity gears dedicated to sports enthusiasts. The brand is poised to become an economic force when it comes to young people who are aware of style trends, sportswear designs reflected the spirited, celebrity-conscious sensibilities of the decade. Disclaimer: Product color may slightly vary due to photographic lighting sources or your monitor settings. Size varies from brand to brand. Kindly go through the size chart for more clarity."
print(len(DESCRIPTION))

print(parse_nlp(DESCRIPTION))
print(len(parse_nlp(DESCRIPTION)))

2459
solemio sleeveless solid man reversible sweatshirt price rs fitz blue sweat shirt wardrobe ensemble trendy embroidery chest zipped closure mark ribbed waistband cuff pouch pocket classic color scheme allow wear wide range separate relaxed fit free body movement brand red style statement design casual leisure team denim sneaker look uber cool material polyester cotton wash care bleach tumble dry use gentle machine warm iron company flagship sportswear active indian soul international outlook evolved sport activity gear dedicate enthusiast poise economic force come young people aware trend reflect spirited celebrity conscious sensibility decade disclaimer product slightly vary photographic lighting source monitor setting size kindly chart clarity
754


### Build Data Format for Kesler's Construction

In [None]:
model = TextEmbeddingModel.from_pretrained("textembedding-gecko")

nfeatures=768

# Convert data to Kesler's Construction
def convert_to_vector(description, category_i):
    x_vector = model.get_embeddings([description])[0].values
    y_position = category_i
    xy_vector = [0]*(y_position)*nfeatures + x_vector + [0]*(ncategories-(y_position+1))*nfeatures
    if len(xy_vector) != nfeatures*ncategories:
        print("Error processing one vector: ")
        print(len(xy_vector))
        print(len(x_vector))
        return None
    return (xy_vector)

X_train = []
y_train = []

count = 0

for index, row in df_train.iterrows():
    
    count+=1
    if count%1000==0:
        print ("iter " + str(count))
    
    # add positive example
    category=categories[row[CATEGORY_COL]]
    new_row = convert_to_vector(parse_nlp(row["description"]),category)
    if new_row is not None:
        X_train.append(new_row)
        y_train.append(1)
                        
    # add negative example
    random_number=category                                                    
    while random_number == category:
        random_number = random.randint(0, ncategories-1)                                              
    new_row = convert_to_vector(row["description"],random_number)
    if new_row is not None:
        X_train.append(new_row)
        y_train.append(0)
    
    # add another negative example
    random_number=category                                                    
    while random_number == category:
        random_number = random.randint(0, ncategories-1)                                              
    new_row = convert_to_vector(row["description"],random_number)
    if new_row is not None:
        X_train.append(new_row)
        y_train.append(0)


## Model Development

### Train Model

In [223]:
#clf = MultinomialNB()
clf = LogisticRegression(max_iter=100000)
#clf = SGDClassifier(loss="log_loss", penalty="l2", alpha=0.0001, max_iter=3000, tol=None, shuffle=True, verbose=0, learning_rate='adaptive', eta0=0.01, early_stopping=False)

In [224]:
# If training a new model from scratch:
clf.fit(X_train, y_train)

# Else, load from previously saved file
#model = joblib.load('categorization_kesler.joblib')

In [225]:
# Save the model
joblib.dump(clf, 'categorization_kesler.joblib')

['categorization_kesler.joblib']

### Model Prediction and Evaluation

For each test sample we are going to find the category that provides the highest probability from the model. After running through each sample, we are going to run an evaluation of the accuracy.

In [226]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

def predict(line):
    picked_cat = -1
    max_prob = 0
    for key in categories:
        pred = clf.predict_proba([convert_to_vector(line["text_embedding"], categories[key])])[0][1]
        if pred > max_prob:
            max_prob = pred
            picked_cat = key
    return picked_cat

#model accuracy
df_test["predicted_category"] = df_test.progress_apply(predict, axis=1)

y_true = df_test[CATEGORY_COL].tolist()
y_pred = df_test["predicted_category"].tolist()

print(f"Accuracy: {accuracy_score(y_true, y_pred)}")
print(f"f1: {f1_score(y_true, y_pred,average='weighted')}")

  0%|          | 0/141 [00:00<?, ?it/s]

Accuracy: 0.15602836879432624
f1: 0.10885054632275974


In [227]:
import pandas as pd
pd.crosstab(df_test[CATEGORY_COL], df_test['predicted_category'])

predicted_category,Toys & Collectibles,Women
c0_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Beauty,2,2
Books,1,0
Electronics,11,0
Handmade,0,1
Home,8,4
Kids,8,2
Men,11,3
Office,1,0
Other,1,0
Toys & Collectibles,14,5


### Quick Discussion on Kesler's Algorithm

Ultimately this algorithm is not learning the category representations well enough. I suspect that by increasing the dimensionality of the embeddings which likely will increase the amount of data needed for training. This dataset of 13K products is not unusual for a retailer but might be too small when dimensions are so large.

## Test Against One-Vs-Rest Algorithm

In [43]:
model = TextEmbeddingModel.from_pretrained("textembedding-gecko")
from sklearn.feature_extraction.text import HashingVectorizer
import random

def convert_to_vector(description):
    x_vector = model.get_embeddings([description])[0].values
    return x_vector


X_train = []
y_train_ = []


for index, row in df_train.iterrows():

    new_row = convert_to_vector(row["attr"])
    X_train.append(new_row)
    y_train_.append(row["c1_name"])

In [44]:
len(X_train)

13266

In [45]:
from sklearn.multiclass import OneVsRestClassifier

base_lr = LogisticRegression()
clf2 = OneVsRestClassifier(base_lr)

clf2.fit(X_train, y_train_)

In [46]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

def predict(line):
    pred = clf2.predict([convert_to_vector(line["attr"])])
    return pred

#model accuracy
df_test["predicted_category"] = df_test.apply(predict, axis=1)

y_true = df_test["c1_name"].tolist()
y_pred = df_test["predicted_category"].tolist()

print(type(y_true))

print(f"Accuracy: {accuracy_score(y_true, y_pred)}")
print(f"f1: {f1_score(y_true, y_pred,average='weighted')}")

<class 'list'>
Accuracy: 0.7163120567375887
f1: 0.670334236291683


### Discussion

The One-Vs-Rest algorithm performs a lot better than the Kesler Construction approach. This might be due to the lower dimensionality requirements. 

The 0.71 F1 is extermely promising for the c1_name level which has 130 possible categories on that level. 