## Wayfair Product Search

Search relevance – the relationship between users’ queries and the products returned in search results – is one of the most important performance indicators for ecommerce storefronts. However, the sheer volume of the data makes evaluating and improving search relevance a difficult proposition.

The [Wayfair Annotation Dataset (WANDS)](https://www.aboutwayfair.com/careers/tech-blog/wayfair-releases-wands-the-largest-and-richest-publicly-available-dataset-for-e-commerce-product-search-relevance) dataset includes details such as product title, product description, primary classes, product category hierarchy, various product attributes such as size and color, average customer ratings, and review numbers. It also contains the richest descriptions of the products and queries in the English language.

In this task, we will use NLP building a search engine to match user's queries and the products.

Author: Hao Xing. This code is private. Please do not distribute.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
#clone the git repo that contains the WAND data
!git clone https://github.com/wayfair/WANDS.git

Cloning into 'WANDS'...
remote: Enumerating objects: 40, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (37/37), done.[K
remote: Total 40 (delta 7), reused 23 (delta 3), pack-reused 0[K
Receiving objects: 100% (40/40), 33.32 MiB | 10.23 MiB/s, done.
Resolving deltas: 100% (7/7), done.


In [3]:
#define functions for search using Tf-IDF
def calculate_tfidf(dataframe):
    """
    Calculate the TF-IDF for combined product name and description.

    """
    # Combine product name and description to vectorize

    combined_text = dataframe['product_name'] + ' ' + dataframe['product_description']
    vectorizer = TfidfVectorizer()

    # convert combined_text to list of unicode strings
    tfidf_matrix = vectorizer.fit_transform(combined_text.values.astype('U'))
    return vectorizer, tfidf_matrix

In [4]:
#a function for getting top N matched products for a given query based on TF-IDF similarity
def get_top_products(vectorizer, tfidf_matrix, query, top_n=10):

    query_vector = vectorizer.transform([query]) # change a query to its TF-IDF representation
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    top_product_indices = cosine_similarities.argsort()[-top_n:][::-1]
    return top_product_indices

[Mean Average Precision (MAP) at K](https://www.evidentlyai.com/ranking-metrics/mean-average-precision-map) is one of the metrics that helps evaluate the quality of ranking and recommender systems. It measures both the relevance of suggested items and how good the system is at placing more relevant items at the top.

In [5]:
#define functions for evaluating retrieval performance using MAP@K
def map_at_k(true_ids, predicted_ids, k=10):

    #if either list is empty, return 0
    if not len(true_ids) or not len(predicted_ids):
        return 0.0

    score = 0.0
    num_hits = 0.0

    for i, p_id in enumerate(predicted_ids[:k]):
        if p_id in true_ids and p_id not in predicted_ids[:i]:
            num_hits += 1.0
            score += num_hits / (i + 1.0)

    return score / min(len(true_ids), k)

[Normalized Discounted Cumulative Gain (NDCG)](https://www.evidentlyai.com/ranking-metrics/ndcg-metric) is a metric that evaluates the quality of recommendation and information retrieval systems. We assume that there is some ideal ranking with all the items sorted in decreasing order of relevance. NDCG helps measure how close the algorithm output comes to this perfect order.

Please note that we use a modified NDCG@K metric, instead of the traditional NDCG@K metric, to better refect the quality of match.

Suppose that we assign a relevance score 2 to each exact match, relevance score 1 to each partial match, and relevance score 0 to each irrelevant item. For a predicted top 5 relevance score list [2, 0, 1, 1, 0], the traditional IDCG should be calculated based on the perfect ranking order [2, 1, 1, 0, 0] among the given predicted list. However, if the query has 30 exact match products, the predicted top 5 list [2, 0, 1, 1, 0] only captures 1 exact match which performs very poor. So we adjust the calculation of IDCG by using the modified perfect ranking among the ground truth. Since the query has 30 exact matches, the modified top 5 perfect ranking list is [2, 2, 2, 2, 2]. It increases the value of IDCG and decreases the value of NDCG, allowing the modified NDCG to give penalty to the prediction that do not capture many exact matches.

In addition, considering the following two cases:
- Case A: there are 200 exact matches, and we get a predicted relevance score [2,2,1,0,0,0], which only predicted 2 exact matches correctly;
- Case B: there are only 2 exact matches, and we get a predicted relevance score [2,2,1,0,0,0], which predicted all 2 exact matches correctly;

By common sense, case B should get better score than case A. The traditional NDCG score will give the same score to both cases, while our modified NDCG score will give high score to case B than case A.


In [6]:
# Calculate the Normalized Discounted Cumulative Gain (NDCG) at K (NDCG@K)

def ndcg_at_k(
        true_ids_exact: list[str],
        true_ids_partial: list[str],
        predicted_ids: list[str],
        k: int =10
    ) -> float:


    len_exact, len_partial, len_predicted = len(true_ids_exact), len(true_ids_partial), len(predicted_ids)

    #if length of true_ids (exact+partial) or length of predicted_ids less than k, replace k as the minimum of these numbers.
    k = min(len_exact+len_partial, len_predicted, k)

    if k == 0:
        print("Please check the length of true_ids_exact, true_ids_partial or predicted_ids")
        return 0

    predicted_ids_k = predicted_ids[:k]

    # creat predicted relevance score.
    predicted_rel_score_k = []
    for id in predicted_ids_k:
        if id in true_ids_exact:
            predicted_rel_score_k.append(2)  # Give a relevance score 2 if the predicted id is a exact match.
        elif id in true_ids_partial:
            predicted_rel_score_k.append(1)  # Give a relevance score 1 if the predicted id is a partial match.
        else:
            predicted_rel_score_k.append(0) # Give a relevance score 0 if the predicted id is a irrelevant.

    # creat true relevance score.
    # Note: please note that the true relevance score is created using true_ids_exact and true_ids_partial, instead of using predicted_ids.
    #       the reason is considering the following two cases:
    #       case A: there is 200 true_ids_exact, and I get a predicted relevance score [2,2,1,0,0,0], which only predicted 2 exact match correctly;
    #       case B: there is only 2 true_ids_exact, and I get a predicted relevance score [2,2,1,0,0,0], which predicted all 2 exact match correctly;
    #       case B should get better NDCG score than case A.
    true_rel_score = [2 for i in true_ids_exact] + [1 for i in true_ids_partial]

    # select the top k elements to consider.
    true_rel_score_k = true_rel_score[:k]

    # calculate array of 1/log2(i+2)
    arange = np.arange(k, dtype=np.float32)
    denom = np.log2(arange + 2.)
    gains = 1. / denom

    # calculate dcg using predicted_rel_score_k
    dcg = (np.array(predicted_rel_score_k)* gains).sum()

    # calculate idcg using true_rel_score_k
    idcg = (np.array(true_rel_score_k)* gains).sum()

    # calculate ndcg
    ndcg = dcg/idcg

    return ndcg



## Data Preparation

In [7]:
# get search queries
query_df = pd.read_csv("WANDS/dataset/query.csv", sep='\t')

In [8]:
print(query_df.shape)
query_df.head(10)

(480, 3)


Unnamed: 0,query_id,query,query_class
0,0,salon chair,Massage Chairs
1,1,smart coffee table,Coffee & Cocktail Tables
2,2,dinosaur,Kids Wall Décor
3,3,turquoise pillows,Accent Pillows
4,4,chair and a half recliner,Recliners
5,5,sofa with ottoman,Sectionals
6,6,acrylic clear chair,Dining Chairs
7,7,driftwood mirror,Wall & Accent Mirrors
8,8,home sweet home sign,Wall Décor
9,9,coffee table fire pit,Outdoor Fireplaces


In [9]:
# get products
product_df = pd.read_csv("WANDS/dataset/product.csv", sep='\t')

In [10]:
print(product_df.shape)
product_df.head()

(42994, 9)


Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
0,0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0
3,3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0
4,4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0


In [11]:
# get manually labeled groundtruth lables
label_df = pd.read_csv("WANDS/dataset/label.csv", sep='\t')

In [12]:
print(label_df.shape)
label_df.head(10)

(233448, 4)


Unnamed: 0,id,query_id,product_id,label
0,0,0,25434,Exact
1,1,0,12088,Irrelevant
2,2,0,42931,Exact
3,3,0,2636,Exact
4,4,0,42923,Exact
5,5,0,41156,Exact
6,6,0,5938,Irrelevant
7,7,0,5937,Irrelevant
8,8,0,37072,Irrelevant
9,9,0,37071,Irrelevant


In [13]:
#group the labels for each query to use when identifying exact matches
grouped_label_df = label_df.groupby('query_id')

## Product search using TF-IDF

In [14]:
# Calculate TF-IDF
vectorizer, tfidf_matrix = calculate_tfidf(product_df)

In [15]:
#check if the search results make sense

def get_top_product_ids_for_query(query):
    top_product_indices = get_top_products(vectorizer, tfidf_matrix, query, top_n=10)
    top_product_ids = product_df.iloc[top_product_indices]['product_id'].tolist()
    return top_product_ids

#define the test query
query = "sofa"

#obtain top product IDs
top_product_ids = get_top_product_ids_for_query(query)

print(f"Top products for '{query}':")
for product_id in top_product_ids:
    product = product_df.loc[product_df['product_id'] == product_id]
    print(product_id, product['product_name'].values[0])

Top products for 'sofa':
33424 malta teak patio sofa with cushions
42042 alexii patio sofa with cushions
19818 ingulu 76 '' wide reversible modular sofa & chaise
19817 mcgray 83.07 '' wide left hand facing modular sofa & chaise
37935 finesse 84 '' square arm sofa
30011 ayotunde patio sofa with cushions
31365 drekjuan patio sofa with cushions
38543 sofa bed with ottoman
29667 canyon 136 '' genuine leather pillow top arm sofa
42060 stines 131 '' wide reversible modular sofa & chaise


#### Calculate the MAP@10 score for the entire query set.

In [16]:
#implementing a function to retrieve exact match product IDs for a query_id
def get_exact_matches_for_query(query_id):
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Exact']['product_id'].values
    return exact_matches

In [17]:
#applying the function to obtain top product IDs and adding top K product IDs to the dataframe
query_df['top_product_ids'] = query_df['query'].apply(get_top_product_ids_for_query)

#adding the list of exact match product_IDs from labels_df
query_df['relevant_ids'] = query_df['query_id'].apply(get_exact_matches_for_query)

#now assign the map@k score
query_df['map@k'] = query_df.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)

In [18]:
# calculate the MAP across the entire query set
query_df.loc[:, 'map@k'].mean()

0.29320741016313934

#### Calculate the NDCG@10 score for the entire query set.

In [19]:
#implementing a function to retrieve EXACT match product IDs for a query_id
def get_exact_matches_for_query(query_id):
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Exact']['product_id'].values
    return exact_matches

#adding the list of EXACT match product_IDs from labels_df
query_df['relevant_ids_exact'] = query_df['query_id'].apply(get_exact_matches_for_query)


#implementing a function to retrieve PARTIAL match product IDs for a query_id
def get_partial_matches_for_query(query_id):
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Partial']['product_id'].values
    return exact_matches

#adding the list of PARTIAL match product_IDs from labels_df
query_df['relevant_ids_partial'] = query_df['query_id'].apply(get_partial_matches_for_query)


In [20]:
# now assign the ndcg@k score
query_df['ndcg@k'] = query_df.apply(lambda x: ndcg_at_k(x['relevant_ids_exact'], x['relevant_ids_partial'], x['top_product_ids'], k=10), axis=1)


Please check the length of true_ids_exact, true_ids_partial or predicted_ids


In [21]:
# calculate the NDCG score across the entire query set
ndcg_mean = query_df.loc[:, 'ndcg@k'].mean()
print("The NDCG@10 score across the entire query set is: ", f"{ndcg_mean: .4f}")

The NDCG@10 score across the entire query set is:   0.6170


## Use Semantic Search to Improve MAP@10 Score

#### We [semantic search](https://medium.com/@naveenjothi040/semantic-search-with-llms-3661fd2a9331) using pre-trained LLMs. The [sentence transformer all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) is chosen here as it is not too big and could be run in CPU for embedding within about 10 minutes.
#### We semantic search "query" within "product_name" and "product_description" separately, and then applied a weight 0.7 to the cos similarity score with  "product_name" and a weight 0.3 to the cos similarity score with "product_description". (The weights are obtained after some experiments.)
#### The MAP@10 score improves to 0.35.

In [22]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/163.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m153.6/163.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [23]:
from sentence_transformers import SentenceTransformer, util
import pickle
import torch
from sklearn.metrics.pairwise import cosine_similarity

In [24]:
# Reload the datasets for analysis.

# get products
product_df = pd.read_csv("WANDS/dataset/product.csv", sep='\t')

# get search queries
query_df = pd.read_csv("WANDS/dataset/query.csv", sep='\t')

# get manually labeled groundtruth lables
label_df = pd.read_csv("WANDS/dataset/label.csv", sep='\t')

#group the labels for each query to use when identifying exact matches
grouped_label_df = label_df.groupby('query_id')

print(product_df.shape, query_df.shape, label_df.shape)

(42994, 9) (480, 3) (233448, 4)


In [46]:
# The sentence transformer all-MiniLM-L6-v2 is chosen here as it is not too big and could be run in CPU for embedding in about 10 minutes.

#model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

model = tuned_model

In [47]:
product_df['product_name'] =  product_df['product_name'].apply(lambda x: str(x))
product_df['product_description'] =  product_df['product_description'].apply(lambda x: str(x))


#### Build embeddings for "product_name" and "product_description"
- The next 2 cells will take in total 13 minutes to run using CPU. After running once, the embeddings are saved as pickle files, and you don't have to run them again.

In [48]:
## Build embedding for "product_name" -- take 2 min to run in CPU for model all-MiniLM-L6-v2.

prodname_corpus_embeddings = model.encode(product_df['product_name'].to_list(), show_progress_bar=True)

with open('prodname_corpus_embeddings_all-MiniLM-L6-v2.pkl', "wb") as fOut:
    pickle.dump(prodname_corpus_embeddings, fOut)

Batches:   0%|          | 0/1344 [00:00<?, ?it/s]

In [49]:
## Build embedding for "product_description" -- take 11 min to run in CPU for model all-MiniLM-L6-v2.

proddescription_corpus_embeddings = model.encode(product_df['product_description'].to_list(), show_progress_bar=True)

with open('proddescription_corpus_embeddings_all-MiniLM-L6-v2.pkl', "wb") as fOut:
    pickle.dump(proddescription_corpus_embeddings, fOut)

Batches:   0%|          | 0/1344 [00:00<?, ?it/s]

#### Load saved embeddings

In [29]:
# load embeddings from pickle files saved before, so that the embedding steps don't need to be run repeatedly.

# load product_name embeddings
with open('prodname_corpus_embeddings_all-MiniLM-L6-v2.pkl', "rb") as file_in:
    prodname_embeddings = pickle.load(file_in)

# load product_description embeddings
with open('proddescription_corpus_embeddings_all-MiniLM-L6-v2.pkl', "rb") as file_in:
    proddescription_embeddings = pickle.load(file_in)

print(prodname_embeddings.shape, proddescription_embeddings.shape)

(42994, 384) (42994, 384)


#### Get the top products using semantic search

In [30]:
def semantic_search_get_top_products(query, top_n=10):

    query_vector = model.encode(query).reshape(1, -1)

    name_sim = cosine_similarity(query_vector, prodname_embeddings).flatten()
    des_sim = cosine_similarity(query_vector, proddescription_embeddings).flatten()

    # We weight 0.7 to cosine-similarity between “query” and “product_name”, and assigned weight 0.3 to cosine-similarity between “query” and “product_description”.
    # The weights could be further optimized in future.
    sim = 0.7 * name_sim + 0.3 * des_sim
    top_product_indices = sim.argsort()[-top_n:][::-1]

    return top_product_indices

In [31]:
out = semantic_search_get_top_products(query='sofa', top_n=10)
out

array([20392, 21389, 15793, 17856, 31606, 31121, 20279, 38543, 31563,
       26318])

In [32]:
#implementing a function to retrieve top K product IDs for a query

def semantic_search_get_top_product_ids_for_query(query):
    top_product_indices = semantic_search_get_top_products(query, top_n=10)
    top_product_ids = product_df.iloc[top_product_indices]['product_id'].tolist()
    return top_product_ids

#group the labels for each query to use when identifying exact matches
grouped_label_df = label_df.groupby('query_id')

def get_exact_matches_for_query(query_id):
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Exact']['product_id'].values
    return exact_matches

In [33]:
#applying the function to obtain top product IDs and adding top K product IDs to the dataframe
#this takes around a minute using GPU
query_df['top_product_ids'] = query_df['query'].apply(semantic_search_get_top_product_ids_for_query)

In [34]:
#implementing a function to retrieve exact match product IDs for a query_id
def get_exact_matches_for_query(query_id):
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Exact']['product_id'].values
    return exact_matches

#adding the list of exact match product_IDs from labels_df
query_df['relevant_ids'] = query_df['query_id'].apply(get_exact_matches_for_query)

#now assign the map@k score
query_df['map@k'] = query_df.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)


In [35]:
# calculate the MAP across the entire query set
mapk_change1_result = query_df.loc[:, 'map@k'].mean()
print("The new MAP@10 score using sentence transformer semantic search is: ", f"{mapk_change1_result: .4f}")

The new MAP@10 score using sentence transformer semantic search is:   0.3514


## Fine Tune LLM to imporve performance


#### We experiment tuning pre-trained sentence LLM model "all-MiniLM-L6-v2" to part of queries to examine whether the matching performance could be improved.

#### The analysis is performed as follows:


1.   Query set is split into training and test query sets. Test query set is 40% of the entire query set.
2.   Train label set is split out of the entire label set. The train label set contains all products associated to all train queries. In this way, the model tuning process does not see the test queries at all. This maps to the practical situation where test queries are completely new queries. The train label set contains the queries and the associated product names.
3.   An evaluation label set is split out from the train label set. (10% of the train label set.) This is used as the valuation set for the tuning process.
4.   The sentence LLM model is tuned using the train label set and the evaluation label set.

#### Because the test queries are blinded from the training process. Performance comparison within the test query set is a fair comparsion between the pre-trained model and the tuned model.

#### It turns out that the tuning process over-fit the training queries. Performance of the tuned model is worse than the pre-trained model.

#### MAP@10 for the test queries:


*   pre-trained model: 0.32
*   tuned model: 0.28

#### NDCG@10 for the test queries:



*   pre-trained model: 0.70
*   tuned model: 0.65

In reality, big product platforms, such as Amazon, have seen many repeated queries. Therefore, queries appeared in the training set can come up again later. To evaluate the model performance in this situation, I compared the model performance in the entire query set.

#### MAP@10 for the entire query set:


*   pre-trained model: 0.35
*   tuned model: 0.38

#### NDCG@10 for the entire query set:



*   pre-trained model: 0.71
*   tuned model: 0.70

The performance improved due to the improved performance in the training set.

#### MAP@10 for the training set:


*   pre-trained model: 0.37
*   tuned model: 0.44

#### NDCG@10 for the training set:



*  pre-trained model: 0.72
*  tuned model: 0.74











In [36]:
from sentence_transformers import util, InputExample, losses, evaluation
import torch
import math
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

In [37]:
# Reload the datasets for analysis.

# get products
product_df = pd.read_csv("WANDS/dataset/product.csv", sep='\t')

# get search queries
query_df = pd.read_csv("WANDS/dataset/query.csv", sep='\t')

# get manually labeled groundtruth lables
label_df = pd.read_csv("WANDS/dataset/label.csv", sep='\t')

#group the labels for each query to use when identifying exact matches
grouped_label_df = label_df.groupby('query_id')

print(product_df.shape, query_df.shape, label_df.shape)

(42994, 9) (480, 3) (233448, 4)


In [38]:
# split the query set into train and test sets

from sklearn.model_selection import train_test_split
train_query, test_query = train_test_split(query_df, test_size=0.4, random_state=42)
print(train_query.shape, test_query.shape)


(288, 3) (192, 3)


In [39]:
dict_map = {'Exact':1.0, 'Partial':0.5, 'Irrelevant':0.0}
label_df['score'] = label_df['label'].map(dict_map)

# train_label_df contains all queries whose id are in the train query set
# train_label_df also contains all corresponding products of each train query
train_query_id = train_query["query_id"].unique()
train_label_df = label_df[label_df["query_id"].isin(train_query_id)]
print(train_label_df.shape)

(132268, 5)


In [40]:
# add query text and product names
dict_id = dict(zip(product_df['product_id'], product_df['product_name']))
dict_desc = dict(zip(product_df['product_id'], product_df['product_description']))
dict_queryid = dict(zip(train_query['query_id'], train_query['query']))

train_label_df['query_name'] = train_label_df['query_id'].map(dict_queryid)
train_label_df['product_name'] = train_label_df['product_id'].map(dict_id)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_label_df['query_name'] = train_label_df['query_id'].map(dict_queryid)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_label_df['product_name'] = train_label_df['product_id'].map(dict_id)


In [41]:
# construct train and evaluation sets for fine tune

train_sample, eval_sample = train_test_split(train_label_df, test_size=0.1, random_state=42)
print(train_sample.shape, eval_sample.shape)

(119041, 7) (13227, 7)


In [42]:
def create_input(doc1, doc2, score):
  return InputExample(texts=[doc1, doc2], label=score)

# construct input for fine tune
inputs = train_sample.apply(
  lambda s: create_input(s['query_name'], s['product_name'], s['score']), axis=1
  ).to_list()

# construct evaluation set for fine tune
evals = eval_sample.apply(
  lambda s: create_input(s['query_name'], s['product_name'], s['score']), axis=1
)


In [43]:
# define instructions for feeding inputs to model

input_dataloader = DataLoader(inputs, shuffle=True, batch_size=16) # feed 16 records at a time to the model

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(evals, name="eval_sample")

# define loss metric to optimize for

tuned_model = SentenceTransformer('all-MiniLM-L6-v2')

loss = losses.CosineSimilarityLoss(tuned_model)

num_epochs = 1

warmup_steps = math.ceil(len(input_dataloader) * num_epochs * 0.1)

### Please note the following fine tune codes took long time to run in CPU. Better to run it in GPU. (It takes 5 mins under GPU.)

In [44]:
# tune the model on the input data

tuned_model.fit(
  train_objectives=[(input_dataloader, loss)],
  evaluator=evaluator,
  epochs=num_epochs,
  evaluation_steps=1000,
  warmup_steps=warmup_steps
  )

tuned_model.save("tuned_model_1")

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7441 [00:00<?, ?it/s]

### Evaluation the performance of the tuned model

#### To obtain the results for pre_trained model, set model = SentenceTransformer('all-MiniLM-L6-v2')
#### To obtain the results for tuned model, set model = tuned.model to calculate new embedding.

In [50]:
#applying the function to obtain top product IDs and adding top K product IDs to the dataframe
test_query['top_product_ids'] = test_query['query'].apply(semantic_search_get_top_product_ids_for_query)

#adding the list of exact match product_IDs from labels_df
test_query['relevant_ids'] = test_query['query_id'].apply(get_exact_matches_for_query)

#now assign the map@k score
test_query['map@k'] = test_query.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)

# calculate the MAP across the test query set
test_query.loc[:, 'map@k'].mean()

0.27599729938271605

#### map@k for the test query set
- pre-trained model: 0.323
- tuned model: 0.280  
#### Both pre-trained model and tuned model have not seen the queries in the test set. This result shows that tuned model over-fit the training queries.

In [51]:
#applying the function to obtain top product IDs and adding top K product IDs to the dataframe
query_df['top_product_ids'] = query_df['query'].apply(semantic_search_get_top_product_ids_for_query)

#adding the list of exact match product_IDs from labels_df
query_df['relevant_ids'] = query_df['query_id'].apply(get_exact_matches_for_query)

#now assign the map@k score
query_df['map@k'] = query_df.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)

# calculate the MAP across the entire query set
# this corresponds to the case where trained queries are submitted by other customers again
query_df.loc[:, 'map@k'].mean()

0.332091416813639

#### map@k for the entire query set

- pre-trained model: 0.351
- tuned model: 0.379

#### tuned model performance is improved, because in sample performance is improved by tuning the model

In [52]:
#applying the function to obtain top product IDs and adding top K product IDs to the dataframe
train_query['top_product_ids'] = train_query['query'].apply(semantic_search_get_top_product_ids_for_query)

#adding the list of exact match product_IDs from labels_df
train_query['relevant_ids'] = train_query['query_id'].apply(get_exact_matches_for_query)

#now assign the map@k score
train_query['map@k'] = train_query.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)

# calculate the MAP for the training query set
train_query.loc[:, 'map@k'].mean()

0.369487495100921

#### map@k for the training query set

- pre-trained model: 0.370
- tuned model: 0.443

In [53]:
# calculate NDCG@K

test_query['relevant_ids_partial'] = test_query['query_id'].apply(get_partial_matches_for_query)
query_df['relevant_ids_partial'] = query_df['query_id'].apply(get_partial_matches_for_query)
train_query['relevant_ids_partial'] = train_query['query_id'].apply(get_partial_matches_for_query)

test_query['ndcg@k'] = test_query.apply(lambda x: ndcg_at_k(x['relevant_ids'], x['relevant_ids_partial'], x['top_product_ids'], k=10), axis=1)
query_df['ndcg@k'] = query_df.apply(lambda x: ndcg_at_k(x['relevant_ids'], x['relevant_ids_partial'], x['top_product_ids'], k=10), axis=1)
train_query['ndcg@k'] = train_query.apply(lambda x: ndcg_at_k(x['relevant_ids'], x['relevant_ids_partial'], x['top_product_ids'], k=10), axis=1)

print(test_query['ndcg@k'].mean())
print(query_df['ndcg@k'].mean())
print(train_query['ndcg@k'].mean())

Please check the length of true_ids_exact, true_ids_partial or predicted_ids
Please check the length of true_ids_exact, true_ids_partial or predicted_ids
0.640291826710032
0.6617566037566125
0.6760664551209994


### NDCG@K results:
#### Test query set
-  pre-trained model: 0.702
-  tuned model: 0.648

#### Entire query set
- pre-trained model: 0.715
- tuned model: 0.704

#### Training query set
- pre-trained model: 0.724
- tuned model: 0.736