# Product Classification with Embedding Approaches

## Overview

This research is a continuation of the categorization-kesler-onevsrest notebook. The focus of this one is to use the one-vs-rest algorithm which performed quite well to test out the performance with a few embedding approaches:
* text embedding
* image embedding
* text + image embedding

## Getting Started

Authenticate your notebook environment (Colab only)
If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using Vertex AI Workbench.

In [16]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Define Google Cloud project information (Colab only)

In [17]:
if "google.colab" in sys.modules:
    # Define project information
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
    LOCATION = "us-central1"  # @param {type:"string"}

    # Initialize Vertex AI
    import vertexai

    vertexai.init(project=PROJECT_ID, location=LOCATION)

### Install and Import Packages

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
import joblib
import numpy as np

## Data Preparation

In [19]:
%%bigquery df_test
SELECT *, CONCAT('Name: \n ', name, ' \n ', 
                 "Description: \n ", description, ' \n ',
                 "Labels: \n ", TO_JSON_STRING(vision_api_labels)
                ) as attr 
FROM solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings 
WHERE id IN (
    SELECT id FROM solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings_golden_test
    WHERE manual_validation = 1
)
AND rand() < 1.0

Query is running:   0%|          |

Downloading:   0%|          |

In [20]:
len(df_test)

116

In [21]:
df_test.head(3)

Unnamed: 0,id,name,description,brand_name,item_condition_name,c0_name,c1_name,c2_name,url,created,image_uri,vision_api_labels,attributes,scores,text_embedding,image_embedding,attr
0,m14193490298,Wooden Magnet/ Dot Art Acrylic Paint,Handfree painted mandalas in a thin wooden dis...,Handmade,New,Home,Artwork,Paintings,https://www.mercari.com/us/item/m14193490298,2023-02-06 04:48:31+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Book"",""m...","[Book, Art, Publication, Creative arts, Materi...","[0.87862146, 0.83368653, 0.81959659, 0.8156618...","[0.00826694351, 0.00405315682, 0.0570168681, -...","[0.0284959301, 0.0240893103, 0.017579874, -0.0...",Name: \n Wooden Magnet/ Dot Art Acrylic Paint ...
1,m74667116621,POSTER PRINT: FUNHOUSE,ALL POSTER PRINTS ARE 11 X 17 INCHES (( GREAT ...,,New,Home,Artwork,Posters,https://www.mercari.com/us/item/m74667116621,2023-01-25 07:01:14+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Poster"",...","[Poster, Publication, Font, Art, Book cover, B...","[0.86772853, 0.83406168, 0.81918359, 0.6839227...","[-0.0161228031, -0.0427463576, -0.0123792794, ...","[-0.0586416498, 0.0408442616, -0.012638161, -0...",Name: \n POSTER PRINT: FUNHOUSE \n Description...
2,m21554068673,Athleta Elation Purple Velvet High Rise Tight ...,Athleta Elation Purple Blue Velvet Tight Leggi...,Athleta,Like new,Women,Athletic apparel,Athletic Leggings,https://www.mercari.com/us/item/m21554068673,2023-03-27 17:33:50+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Arm"",""mi...","[Arm, Shoulder, yoga pant, Leg, Active pants, ...","[0.94758928, 0.93993074, 0.93505448, 0.9231663...","[0.0108732795, -0.0630845651, 0.036399167, -0....","[0.0493304357, -0.0114150364, 0.0233411063, 0....",Name: \n Athleta Elation Purple Velvet High Ri...


In [22]:
%%bigquery df_train
SELECT *, CONCAT('Name: \n ', name, ' \n ', 
                 "Description: \n ", description, ' \n ',
                 "Labels: \n ", TO_JSON_STRING(vision_api_labels)
                ) as attr 
FROM solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings 
WHERE id NOT IN (
    SELECT id FROM solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings_golden_test
    WHERE manual_validation = 1
)
AND rand() < 1.0

Query is running:   0%|          |

Downloading:   0%|          |

In [23]:
len(df_train)

13334

In [24]:
df_train.head(3)

Unnamed: 0,id,name,description,brand_name,item_condition_name,c0_name,c1_name,c2_name,url,created,image_uri,vision_api_labels,attributes,scores,text_embedding,image_embedding,attr
0,m99111169561,Yuji Itadori Funko FYE Exclusive,Yuji Itadori Funko FYE Exclusive,Funko,New,Toys & Collectibles,Collectibles & Hobbies,Bobbleheads,https://www.mercari.com/us/item/m99111169561,2023-01-08 18:54:11+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Toy"",""mi...","[Toy, Cartoon, Font, Fictional character, Doll...","[0.9178822, 0.90057594, 0.79287237, 0.68645483...","[-0.00359338149, -0.0278339051, 0.000319481536...","[-0.0249901433, 0.00174266251, -0.00305463048,...",Name: \n Yuji Itadori Funko FYE Exclusive \n D...
1,m45103180434,NWT Ann Taylor size xsp dress,Top to bottom-34”\nArmpit to armpit-15.5”\n#14,Ann Taylor,New,Women,Dresses,"Above knee, mini",https://www.mercari.com/us/item/m45103180434,2023-01-03 03:37:19+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""White"",""...","[White, Black, Textile, Sleeve, Grey, Pattern,...","[0.92187405, 0.89741337, 0.87661022, 0.8724504...","[0.0405451693, -0.049278032, -0.00647369912, -...","[0.0268896949, 0.0488708913, 0.0202537514, -0....",Name: \n NWT Ann Taylor size xsp dress \n Desc...
2,m64663927933,Old Navy Hoodie,Light pink size small old navy hoodie \nNever ...,Old Navy,Like new,Women,Sweaters,Hooded,https://www.mercari.com/us/item/m64663927933,2023-03-13 20:30:01+00:00,gs://genai-product-catalog/mercari_images_13K/...,"{""label_annotations"":[{""description"":""Green"",""...","[Green, Textile, Sleeve, Collar, Grey, Dress s...","[0.91382557, 0.87614417, 0.87245047, 0.8409993...","[-0.0330498442, -0.0371705927, -0.00108811446,...","[0.000557975844, 0.0179716274, 0.00222408748, ...",Name: \n Old Navy Hoodie \n Description: \n Li...


In [25]:
%%bigquery category_df
select c0_name, c1_name, c2_name from `solutions-2023-mar-107.mercari.13K_synthetic_attributes_embeddings`
group by c0_name, c1_name, c2_name

Query is running:   0%|          |

Downloading:   0%|          |

In [26]:
def get_category_name(row):
    if row["c2_name"] is None:
        cat_name = row["c0_name"]+">"+row["c1_name"]+">"+"Other"
    else:
        cat_name = row["c0_name"]+">"+row["c1_name"]+">"+row["c2_name"]
    return cat_name

df_train["category"] = df_train.apply(lambda x: get_category_name(x), axis=1)
df_test["category"] = df_test.apply(lambda x: get_category_name(x), axis=1)


## Data Modeling

In [27]:
base_lr = LogisticRegression()
model_combemb = OneVsRestClassifier(base_lr)
model_textemb = OneVsRestClassifier(base_lr)
model_imageemb = OneVsRestClassifier(base_lr)

In [28]:
# text embedding
model_textemb.fit(df_train["text_embedding"].tolist(), df_train["category"].tolist())
joblib.dump(model_textemb, 'model_textemb.pkl')

['model_textemb.pkl']

In [None]:
# image embedding model
model_imageemb.fit(df_train["image_embedding"].tolist(), df_train["category"].tolist())
joblib.dump(model_textemb, 'model_imageemb.pkl')

['model_imageemb.pkl']

In [None]:
# image + text embedding
df_train["comb_embedding"] = df_train.apply(lambda x: x["text_embedding"].tolist()+x["image_embedding"].tolist(), axis=1)
model_combemb.fit(df_train["comb_embedding"].tolist(), df_train["category"].tolist())
joblib.dump(model_combemb, 'model_combemb.pkl')

['model_combemb.pkl']

In [35]:
# if model training has already been run and saved
#model_textemb = joblib.load('model_textemb.pkl')
#model_imageemb = joblib.load('model_imageemb.pkl')
#model_combemb = joblib.load('model_combemb.pkl')

## Data Evaluation

We come up with a hierarchical metric that takes into account all level of the product category hierarchy. For example if you get level one correct then you get 1/3 score. If you get level two correct then you get another 2/3 score. And if you get the bottom level correct you get a 1 score.  

In [None]:
# hierarchical classification

def get_hierarchical_score(y_pred, y_true):
    
    if len(y_pred) !=len(y_true):
        print("erro: y_pred and y_true should have the same length")
        return None
    
    scores = []
    
    def get_score(first_cat ,second_cat):
        first_ls = first_cat.split(">")
        second_ls = second_cat.split(">")
        if len(first_ls) != 3 or len(second_ls) != 3:
            print("error: category does not have 3 levels")
            return 0
        
        score=0
        
        if first_ls[0] == second_ls[0]:
            score += 1/3
        if first_ls[1] == second_ls[1]:
            score += 1/3
        if first_ls[2] == second_ls[2]:
            score += 1/3

        return score
    
    for i in range(0,len(y_pred)):
        scores.append(get_score(y_pred[i], y_true[i]))
                      
    return np.mean(scores)

df_test["comb_embedding"] = df_test.apply(lambda x: x["text_embedding"].tolist()+x["image_embedding"].tolist(), axis=1)

# test one item
x = df_test.iloc[0:5]
x["pred"] = model_combemb.predict(x["comb_embedding"].tolist())
get_hierarchical_score(x["category"].tolist(),x["pred"].tolist())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["pred"] = model_combemb.predict(x["comb_embedding"].tolist())


0.5333333333333333

### Combined Text + Image Embedding

In [None]:
df_test["predicted_category"] = model_combemb.predict(df_test["comb_embedding"].tolist())

y_true = df_test["category"].tolist()
y_pred = df_test["predicted_category"].tolist()

print("Hierarchical Score: " + str(get_hierarchical_score(y_pred, y_true)))
print("Accuracy: " + str(accuracy_score(y_pred, y_true)))
print("f1: " + str(f1_score(y_pred, y_true,average='weighted')))

Hierarchical Score: 0.7586206896551724
Accuracy: 0.5689655172413793
f1: 0.6171358335151438


### Image Embedding

In [None]:
df_test["predicted_category"] = model_imageemb.predict(df_test["image_embedding"].tolist())

y_true = df_test["category"].tolist()
y_pred = df_test["predicted_category"].tolist()

print("Hierarchical Score: " + str(get_hierarchical_score(y_pred, y_true)))
print("Accuracy: " + str(accuracy_score(y_pred, y_true)))
print("f1: " + str(f1_score(y_pred, y_true,average='weighted')))

Hierarchical Score: 0.7241379310344828
Accuracy: 0.5258620689655172
f1: 0.5772132082476911


### Text Embedding

In [None]:
df_test["predicted_category"] = model_textemb.predict(df_test["text_embedding"].tolist())

y_true = df_test["category"].tolist()
y_pred = df_test["predicted_category"].tolist()

print("Hierarchical Score: " + str(get_hierarchical_score(y_pred, y_true)))
print("Accuracy: " + str(accuracy_score(y_pred, y_true)))
print("f1: " + str(f1_score(y_pred, y_true,average='weighted')))

Hierarchical Score: 0.6494252873563218
Accuracy: 0.4482758620689655
f1: 0.5222719808926706


### Discussion

There is a benefit to using a larger embedding space combining the text and the image!