# Overview of MT Pipeline

### Step 1: Query Expansion CODE LOCATED IN queryexpansion.py
We first use DeepSeek v3 to carry out query expansion on our queries. This is implemented as a function in queryexpansion.py, but will be demonstrated here:

In [57]:
from openai import OpenAI
from dotenv import load_dotenv
import os
import json
import re

#load API key
load_dotenv(dotenv_path='../.env')
api_key = os.getenv('deepseek_API_KEY')

#set up connection
client = OpenAI(api_key=api_key, base_url="https://openrouter.ai/api/v1")


We define a function to expand our query:

In [58]:
def get_expanded_queries(user_query):
    prompt=f'''You are an expert search query optimizer. Your task is to expand the following e-commerce search query to improve retrieval of relevant products. Generate a list of semantically related terms, synonyms, and common user variations while preserving the original intent.

**Rules:**
1. Prioritize **contextual relevance** (e.g., "running shoes" → "jogging sneakers").
2. Include **common misspellings** (e.g., "earbuds" → "airbuds").
3. Add **technical/layman variants** (e.g., "4K TV" → "ultra HD television").
4. For non-English queries, provide **translations/transliterations** if applicable (e.g., "スマホ" → "smartphone").
5. Output in JSON format for easy parsing.

**Input Query:** "{user_query}"

**Output Format:**  
{{
  "original_query": "...",
  "expanded_terms": [
    {{"term": "...", "type": "synonym"}},
    {{"term": "...", "type": "misspelling"}},
    {{"term": "...", "type": "technical"}}
  ]
}}

**Example Output for "wireless headphones":**
{{
  "original_query": "wireless headphones",
  "expanded_terms": [
    {{"term": "Bluetooth headphones", "type": "synonym"}},
    {{"term": "cordless earphones", "type": "synonym"}},
    {{"term": "wireless headsets", "type": "synonym"}},
    {{"term": "airbuds", "type": "misspelling"}},
    {{"term": "noise-cancelling headphones", "type": "technical"}}
  ]
}}

**Now process this query:** "{user_query}"'''
    response = client.chat.completions.create(
        model="deepseek/deepseek-chat-v3-0324:free",
        messages=[
            {"role": "user", "content": prompt},
        ],
        temperature=0.3,
    )
    print(response) #debug
    expanded_queries_raw=response.choices[0].message.content
    if not expanded_queries_raw or expanded_queries_raw.strip() == "":
      raise ValueError("API returned an empty response")
    expanded_queries_raw = re.search(r'```json\n({.*?})\n```', expanded_queries_raw, re.DOTALL)
    if expanded_queries_raw:
      expanded_queries_raw = expanded_queries_raw.group(1)
    else:
      expanded_queries_raw = expanded_queries_raw.strip()  # fallback to raw response
      
    #print(expanded_queries_raw)
    expanded_queries=json.loads(expanded_queries_raw)
    return expanded_queries

This query should return us an expanded version of the user's original query, accounting for misspellings, vague queries, etc

In [59]:
query="running shoos"
#demo with misspelling
expanded_queries=get_expanded_queries(query)

expanded_queries

ChatCompletion(id='gen-1745148844-DAke1KpC3PwiP7GkudfC', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='```json\n{\n  "original_query": "running shoos",\n  "expanded_terms": [\n    {"term": "running shoes", "type": "misspelling"},\n    {"term": "jogging shoes", "type": "synonym"},\n    {"term": "sneakers", "type": "synonym"},\n    {"term": "athletic shoes", "type": "synonym"},\n    {"term": "trainers", "type": "synonym"},\n    {"term": "running sneakers", "type": "synonym"},\n    {"term": "running footwear", "type": "synonym"},\n    {"term": "trail running shoes", "type": "technical"},\n    {"term": "road running shoes", "type": "technical"},\n    {"term": "performance running shoes", "type": "technical"},\n    {"term": "running shooes", "type": "misspelling"},\n    {"term": "runing shoes", "type": "misspelling"},\n    {"term": "runnin shoes", "type": "misspelling"},\n    {"term": "スポーツシューズ", "type": "translation"},\n    {"term": "c

{'original_query': 'running shoos',
 'expanded_terms': [{'term': 'running shoes', 'type': 'misspelling'},
  {'term': 'jogging shoes', 'type': 'synonym'},
  {'term': 'sneakers', 'type': 'synonym'},
  {'term': 'athletic shoes', 'type': 'synonym'},
  {'term': 'trainers', 'type': 'synonym'},
  {'term': 'running sneakers', 'type': 'synonym'},
  {'term': 'running footwear', 'type': 'synonym'},
  {'term': 'trail running shoes', 'type': 'technical'},
  {'term': 'road running shoes', 'type': 'technical'},
  {'term': 'performance running shoes', 'type': 'technical'},
  {'term': 'running shooes', 'type': 'misspelling'},
  {'term': 'runing shoes', 'type': 'misspelling'},
  {'term': 'runnin shoes', 'type': 'misspelling'},
  {'term': 'スポーツシューズ', 'type': 'translation'},
  {'term': 'correr zapatos', 'type': 'translation'}]}

We then rank these expanded queries based on their types, giving the most importance to the original query

In [60]:
#weight the different output types
def assign_weights(term_type):
    weights = {
        "synonym": 0.8,
        "misspelling": 0.3,
        "technical": 0.7,
        "translation": 0.6
    }
    return weights.get(term_type, 0.5)  #default weight

def return_weighted_dict(expanded_queries, include_translations): #option to remove translations for certain pipelines
    weighted_terms = [
    {"term": expanded_queries["original_query"], "weight": 1.0}  # Original query (highest priority)
    ]

    if include_translations:
      for item in expanded_queries["expanded_terms"]:
          weighted_terms.append({
              "term": item["term"],
              "weight": assign_weights(item["type"])
          })
    else:
       for item in expanded_queries["expanded_terms"]:
          if item["type"]!="translation":
            weighted_terms.append({
                "term": item["term"],
                "weight": assign_weights(item["type"])
            })
    return weighted_terms

In [61]:
#demo with the above expansions
weighted_queries=return_weighted_dict(expanded_queries, include_translations=False)
weighted_queries

[{'term': 'running shoos', 'weight': 1.0},
 {'term': 'running shoes', 'weight': 0.3},
 {'term': 'jogging shoes', 'weight': 0.8},
 {'term': 'sneakers', 'weight': 0.8},
 {'term': 'athletic shoes', 'weight': 0.8},
 {'term': 'trainers', 'weight': 0.8},
 {'term': 'running sneakers', 'weight': 0.8},
 {'term': 'running footwear', 'weight': 0.8},
 {'term': 'trail running shoes', 'weight': 0.7},
 {'term': 'road running shoes', 'weight': 0.7},
 {'term': 'performance running shoes', 'weight': 0.7},
 {'term': 'running shooes', 'weight': 0.3},
 {'term': 'runing shoes', 'weight': 0.3},
 {'term': 'runnin shoes', 'weight': 0.3}]

### Step 2: Fine-tuning of mBART model
We first fine-tune an mBART model on our spanish, italian and chinese dataset to carry out our machine translation task. The code for fine-tuning can be found at finetune.py, while the model is saved in ./final

### Step 3: Machine Translation of expanded queries
We then translate these queries using our finetuned mBART model. Similarly, this is implemented in translate.py but showcased here. 

In [62]:
import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

lang_code_map = {
    "en": "en_XX",
    "es": "es_XX",
    "it": "it_IT", 
    "cn": "zh_CN"
}

#function to load model and tokenizer
def load_model_and_tokenizer(model_path):
    """Load the model and tokenizer from the saved checkpoint"""
    model = MBartForConditionalGeneration.from_pretrained(model_path)
    tokenizer = MBart50TokenizerFast.from_pretrained(model_path)
    return model, tokenizer

#translation function.
def translate_sentence(model, tokenizer, text, src_lang, tgt_lang):
    """Translate a single sentence"""
    # Set source and target languages
    tokenizer.src_lang = lang_code_map[src_lang]
    
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
    
    # Generate translation
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            forced_bos_token_id=tokenizer.lang_code_to_id[lang_code_map[tgt_lang]],
            max_length=64,
            num_beams=4,
            early_stopping=True,
            no_repeat_ngram_size=3,  # Prevent repeating n-grams
            repetition_penalty=2.0,   # Penalize repetition
            length_penalty=1.0,       # Balance between length and score
            temperature=0.7,          # Control randomness
            do_sample=True           # Enable sampling
        )

     # Decode the output
    translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    return translation

def translate_expanded(model, tokenizer, query_list, src_lang, tgt_lang):
    for query in query_list:
        query['term']=translate_sentence(model, tokenizer, query['term'], src_lang, tgt_lang)
    return query_list


In [63]:
#demo using expanded queries
tgt_lang='cn'
model, tokenizer = load_model_and_tokenizer("./final")
weighted_queries = translate_expanded(model, tokenizer, weighted_queries, 'en', tgt_lang)

In [64]:
weighted_queries

[{'term': '跑鞋', 'weight': 1.0},
 {'term': '跑鞋', 'weight': 0.3},
 {'term': '慢跑鞋', 'weight': 0.8},
 {'term': '耐鞋', 'weight': 0.8},
 {'term': '運動鞋', 'weight': 0.8},
 {'term': '教练员', 'weight': 0.8},
 {'term': '跑鞋', 'weight': 0.8},
 {'term': '跑鞋', 'weight': 0.8},
 {'term': '路跑鞋', 'weight': 0.7},
 {'term': '路跑鞋', 'weight': 0.7},
 {'term': '性能跑鞋', 'weight': 0.7},
 {'term': '跑鞋', 'weight': 0.3},
 {'term': '耐力鞋', 'weight': 0.3},
 {'term': '蘭寧鞋', 'weight': 0.3}]

### Step 4: Hybrid Search of expanded queries


#### 4.1 Data Loading

In [65]:
#returns a dicitonary of dfs
import pandas as pd

def get_data(data_paths):
    data = {} 
    for lang, path in data_paths.items():
        data[lang]=pd.read_pickle(path)
    return data

In [66]:
data_paths={'cn':'en_to_cn_embeddings.pkl', 'es':'en_to_sp_embeddings.pkl', 'it':'en_to_it_embeddings.pkl'}
data = get_data(data_paths)

#### BM25 Search

In [67]:
from rank_bm25 import BM25Okapi
import pandas as pd
import jieba

In [68]:
#Build BM_25 corpus
def build_BM25(data):
    #cn
    entocn_chinese_titles = data['cn']['chinese translation']
    entocn_tokenized_cn = [list(jieba.cut_for_search(title.lower())) for title in entocn_chinese_titles]
    bm25_cn = BM25Okapi(entocn_tokenized_cn)

    #es
    entoes_spanish_titles = data['es']['title_spanish']
    entoes_tokenized_es = [title.split() for title in entoes_spanish_titles]
    bm25_es = BM25Okapi(entoes_tokenized_es)

    #it
    entoit_italian_titles = data['it']['title_italian']
    entoit_tokenized_it = [title.split() for title in entoit_italian_titles]
    bm25_it = BM25Okapi(entoit_tokenized_it)

    bm25_corpus={'cn':bm25_cn, 'es':bm25_es, 'it':bm25_it}


    return bm25_corpus


#Search BM25
def search_bm25_expanded(query_list, corpus, tgt_lang='cn', top_k=5):
    #init scores as zeros

    scores = [0.0] * len(corpus[tgt_lang].doc_len)

    for query_dict in query_list:
        term=query_dict['term']
        weight=query_dict['weight']
        if tgt_lang=='cn':
            tokens=jieba.cut_for_search(term.lower())
            term_scores = corpus[tgt_lang].get_scores(tokens)        
        else:
            tokens = term.lower().split()
            term_scores = corpus[tgt_lang].get_scores(tokens)

        scores = [s + weight * ts for s, ts in zip(scores, term_scores)]

    # Get top-k ranked indices
    top_k_ids = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return top_k_ids, [scores[i] for i in top_k_ids]

In [69]:
bm25_corpus = build_BM25(data)

In [70]:
top_ids, top_scores = search_bm25_expanded(weighted_queries, bm25_corpus)

In [71]:
#remember our original search was 'shir long sleeve', mispelled on purpose.

for i, score in zip(top_ids, top_scores):
    print(f"{score:.4f} | {data['cn']['title'][i]}  | {data['cn']['chinese translation'][i]}")

39.3637 | New Balance Female 90 Lightweight Running Shoes  | New Balance 女 90轻量跑鞋 慢跑鞋- WSONIBS
37.0750 | Men Running Shoes Lightweight Sneakers Magic Baby ~ Sd8035  | 慢跑鞋 男款輕量運動鞋 魔法Baby~sd8035
36.7351 | Hole Bow Lazy Running Shoes Peach 1Ce28  | 洞洞蝴蝶結懶人慢跑鞋桃色1CE28
34.7104 | Magic Baby Children Girls Running Shoes Light Sneakers ~ Sa68305  | 魔法Baby 兒童慢跑鞋 中大童輕量運動鞋~sa68305
32.9813 | Mizuno Mizuno Running Shoes Female  | MIZUNO 女 美津浓 慢跑鞋- J1GD183001


#### Dense Search

In [72]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

In [73]:
from pinecone import Pinecone
from pinecone import ServerlessSpec
from dotenv import load_dotenv
import os

#Embeds a dense embedding representing the weighted mean of the expanded queries
def embed_expanded(query_list, model):
    query_embeddings= []
    #embed expanded queries
    for query_dict in query_list:
        embedding=model.encode(query_dict['term'],  convert_to_tensor=True).cpu().numpy() #size1024
        query_embeddings.append(embedding * query_dict["weight"])

    query_embedding = sum(query_embeddings) / len(query_embeddings)  # Weighted mean
    return query_embedding


def init_index(pc, index_name, data, embedding_col, eng_col, tgt_col, tgt_lang):
    index_name = index_name
    dimension = 1024

    if index_name not in pc.list_indexes().names():
        pc.create_index(
            name=index_name,
            dimension=dimension,
            metric="cosine",  # by cosine similarity
            spec=ServerlessSpec(
                cloud="aws",  # or "gcp"
                region="us-east-1" 
            )
        )

    index = pc.Index(index_name)

    vectors_to_upsert = []
    for _, row in data.iterrows():
        vectors_to_upsert.append({
            "id": str(_),  # Use index or generate unique IDs
            "values": row[embedding_col],  # Using Chinese embeddings
            "metadata": {
                "title": row[eng_col],
                "chinese_title": row[tgt_col],
                "embedding_type": tgt_lang  # Track which embedding was used
            }
        })

    for i in range(0, len(vectors_to_upsert), 100):
        index.upsert(vectors=vectors_to_upsert[i:i+100])

def setup_pinecone(data):
    load_dotenv(dotenv_path='../.env')
    pinecone_api_key = os.getenv('pinecone_API_KEY')
    pc = Pinecone(api_key=pinecone_api_key)

    data = data
    
    indexes={'cn':'cn-search', 'it':'it-search', 'es':'es-search'}

    #setup cn
    init_index(pc, index_name=indexes['cn'], data=data['cn'],
     embedding_col='chinese_embedding',
     eng_col='title',
     tgt_col='chinese translation',
     tgt_lang='chinese')

    #setup it
    init_index(pc, index_name=indexes['it'], data=data['it'],
     embedding_col='italian_embedding',
     eng_col='title',
     tgt_col='title_italian',
     tgt_lang='italian')

    #setup es
    init_index(pc, index_name=indexes['es'], data=data['es'],
     embedding_col='spanish_embedding',
     eng_col='title',
     tgt_col='title_spanish',
     tgt_lang='spanish')

    return indexes

def search_pinecone(query_list, embedding_model, index_name, top_k=5):
    load_dotenv(dotenv_path='../.env')
    pinecone_api_key = os.getenv('pinecone_API_KEY')
    pc = Pinecone(api_key=pinecone_api_key)
    index = pc.Index(index_name)
    query_embedding=embed_expanded(query_list, embedding_model)
    results = index.query(
            vector=query_embedding.tolist(),
            top_k=top_k,
            include_metadata=False
        )
    id_list = []
    score_list = []
    for dict in results.matches:
        id_list.append(int(dict['id']))
        score_list.append(float(dict['score']))

    return id_list, score_list


In [74]:
pinecone_indices=setup_pinecone(data)

In [75]:
top_ids_pc, top_scores_pc =search_pinecone(weighted_queries, model, pinecone_indices['cn'])

In [76]:
#remember our original search was 'shir long sleeve', mispelled on purpose.

for i, score in zip(top_ids_pc, top_scores_pc):
    print(f"{score:.4f} | {data['cn']['title'][i]}  | {data['cn']['chinese translation'][i]}")

0.7090 | Men'S Shoes Sports Shoes Shoes Running Shoes Air Cushion Shoes  | 男鞋運動鞋男休閒鞋跑步鞋氣墊鞋子
0.6922 | Running Sports Shoes Shoes Shoes Shoes  | 韓版跑步運動鞋女鞋學生單鞋女球鞋百搭休閒鞋子
0.6707 | New BALANCE 247 sports shoes running shoes black shoes Child ka247t2p no338  | New Balance 247 運動鞋 跑鞋 黑色 中童 童鞋 KA247T2P no338
0.6674 | New Balance Female 90 Lightweight Running Shoes  | New Balance 女 90轻量跑鞋 慢跑鞋- WSONIBS
0.6673 | Skechers Women When - High Running Shoes  | SKECHERS 女 Liv-High 慢跑鞋 - 99830WSL


#### RRF

In [77]:
#recap: Right now, we have BM25 results, returned as
print(top_ids, top_scores)

[902, 349, 630, 356, 156] [39.36373785264429, 37.07501466149296, 36.735098995178525, 34.71038588022547, 32.981256874273285]


In [78]:
#recap: We also have semantic results, returned as
print(top_ids_pc, top_scores_pc)

[951, 654, 782, 902, 783] [0.708953798, 0.692248642, 0.670711398, 0.66745, 0.667325675]


In [79]:
import numpy as np

def scores_to_ranking(scores: list[float]) -> list[int]:
    """Convert float scores into int rankings (1 = best)."""
    return np.argsort(scores)[::-1] + 1  # ranks start at 1

def rrf(keyword_rank: int, semantic_rank: int, k: int = 60) -> float:
    """Combine keyword rank and semantic rank into a hybrid score using RRF."""
    return 1 / (k + keyword_rank) + 1 / (k + semantic_rank)


In [80]:
def hybrid_expanded_search(query_list, bm25_corpus, pinecone_indices, embedding_model, tgt_lang='cn', top_k=5 ):
    bm25_top_ids, bm25_top_scores = search_bm25_expanded(query_list, bm25_corpus, top_k=top_k)
    pc_top_ids, pc_top_scores =search_pinecone(query_list, embedding_model, pinecone_indices[tgt_lang], top_k=top_k)
    bm25_ranks = scores_to_ranking(bm25_top_scores)
    pc_ranks = scores_to_ranking(pc_top_scores)

    # Create dictionaries for quick rank lookup
    bm25_rank_dict = {doc_id: rank for doc_id, rank in zip(bm25_top_ids, bm25_ranks)}
    pc_rank_dict = {doc_id: rank for doc_id, rank in zip(pc_top_ids, pc_ranks)}
    
    # Combine all unique document IDs from both methods
    all_doc_ids = list(set(bm25_top_ids) | set(pc_top_ids))
    
    # Calculate RRF scores for each document
    rrf_scores = []
    for doc_id in all_doc_ids:
        # Get ranks from each method (use a high rank if document not found)
        bm25_rank = bm25_rank_dict.get(doc_id, top_k * 2)  # Penalize missing documents
        pc_rank = pc_rank_dict.get(doc_id, top_k * 2)
        
        # Calculate combined RRF score
        score = rrf(bm25_rank, pc_rank)
        rrf_scores.append((doc_id, score))
    
    # Sort documents by RRF score (descending)
    rrf_scores.sort(key=lambda x: -x[1])
    
    # Extract the top_k document IDs
    #hybrid_top_ids = [doc_id for doc_id, score in rrf_scores[:top_k]]
    hybrid_top_ids = [doc_id for doc_id, score in rrf_scores]

    #hybrid_top_scores = [score for doc_id, score in rrf_scores[:top_k]]
    hybrid_top_scores = [score for doc_id, score in rrf_scores]
    
    return hybrid_top_ids, hybrid_top_scores



In [81]:
hybrid_top_id, hybrid_top_scores=hybrid_expanded_search(weighted_queries, bm25_corpus, pinecone_indices, model)

In [82]:
for i, score in zip(hybrid_top_id, hybrid_top_scores):
    print(f"{score:.4f} | {data['cn']['title'][i]}  | {data['cn']['chinese translation'][i]}")

0.0320 | New Balance Female 90 Lightweight Running Shoes  | New Balance 女 90轻量跑鞋 慢跑鞋- WSONIBS
0.0307 | Men'S Shoes Sports Shoes Shoes Running Shoes Air Cushion Shoes  | 男鞋運動鞋男休閒鞋跑步鞋氣墊鞋子
0.0304 | Running Sports Shoes Shoes Shoes Shoes  | 韓版跑步運動鞋女鞋學生單鞋女球鞋百搭休閒鞋子
0.0304 | Men Running Shoes Lightweight Sneakers Magic Baby ~ Sd8035  | 慢跑鞋 男款輕量運動鞋 魔法Baby~sd8035
0.0302 | New BALANCE 247 sports shoes running shoes black shoes Child ka247t2p no338  | New Balance 247 運動鞋 跑鞋 黑色 中童 童鞋 KA247T2P no338
0.0302 | Hole Bow Lazy Running Shoes Peach 1Ce28  | 洞洞蝴蝶結懶人慢跑鞋桃色1CE28
0.0299 | Magic Baby Children Girls Running Shoes Light Sneakers ~ Sa68305  | 魔法Baby 兒童慢跑鞋 中大童輕量運動鞋~sa68305
0.0297 | Skechers Women When - High Running Shoes  | SKECHERS 女 Liv-High 慢跑鞋 - 99830WSL
0.0297 | Mizuno Mizuno Running Shoes Female  | MIZUNO 女 美津浓 慢跑鞋- J1GD183001


In [83]:
from bert_score import score
import warnings

def calculate_bertscore(candidate, reference, lang = "en"):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        # Compute scores
        P, R, F1 = score(
            [candidate], 
            [reference], 
            lang=lang,
            model_type="bert-base-multilingual-cased",  # Multilingual BERT
            verbose=False  # Disable progress messages
        )
    return P.item(), R.item(), F1.item()


def get_final_output(query, hybrid_top_id, data, tgt_lang='cn'):
    final_output={}
    for ids in hybrid_top_id:
        if tgt_lang=='cn':
            txt=data[tgt_lang]['chinese translation'][ids]
        elif tgt_lang=='es':
            txt=data[tgt_lang]['title_spanish'][ids]
        elif tgt_lang=='it':
            txt=data[tgt_lang]['title_italian'][ids]

        acc, precision, f1 = calculate_bertscore(txt, query)
        final_output[txt]=f1
    return final_output



In [84]:
query="shir long sleeve"

In [85]:
final_output = get_final_output(query, hybrid_top_id, data, tgt_lang='cn')

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight',

In [86]:
final_output

{'New Balance 女 90轻量跑鞋 慢跑鞋- WSONIBS': 0.6317388415336609,
 '男鞋運動鞋男休閒鞋跑步鞋氣墊鞋子': 0.6080911755561829,
 '韓版跑步運動鞋女鞋學生單鞋女球鞋百搭休閒鞋子': 0.6144701242446899,
 '慢跑鞋 男款輕量運動鞋 魔法Baby~sd8035': 0.6211652159690857,
 'New Balance 247 運動鞋 跑鞋 黑色 中童 童鞋 KA247T2P no338': 0.6184868812561035,
 '洞洞蝴蝶結懶人慢跑鞋桃色1CE28': 0.6251214742660522,
 '魔法Baby 兒童慢跑鞋 中大童輕量運動鞋~sa68305': 0.630289614200592,
 'SKECHERS 女 Liv-High 慢跑鞋 - 99830WSL': 0.6048922538757324,
 'MIZUNO 女 美津浓 慢跑鞋- J1GD183001': 0.6235170960426331}

### Step 5: Evaluation Metrics

Here, we use the debug mode for the implemented search function to generate some evaluation metrics for our searches

#### Testing the final pipeline

In [87]:
from mtpipeline import init_mt_environment, mt_pipeline_search

In [88]:
mBART_model_path="./final"
data_paths={'cn':'en_to_cn_embeddings.pkl', 'es':'en_to_sp_embeddings.pkl', 'it':'en_to_it_embeddings.pkl'}
embed_model = "BAAI/bge-m3"
env_path = "../.env"

In [89]:
#run once at start of front end
mBART_model, mBART_tokenizer, data, bm25_corpus, dense_embed_model, pinecone_indices = init_mt_environment(mBART_model_path, data_paths, embed_model, env_path)

In [90]:
data['cn']['chinese translation'][0]

'OPPO A75 A75s A73 手机壳 软壳 挂绳壳 大眼兔硅胶壳'

In [91]:
query="short sleeved t-shirt"
tgt_lang = "cn" #should be 'es' for spanish, 'cn' for chinese and 'it' italy
top_k=5

In [92]:
final_output = mt_pipeline_search(query, 
                                    env_path,
                                    mBART_model,
                                    mBART_tokenizer,
                                    data,
                                    bm25_corpus,
                                    pinecone_indices,
                                    dense_embed_model,
                                    tgt_lang, #optional
                                    top_k,) #optional

Expanding queries...
Queries Expanded
Translating Queries...
Searching...
Processing Output...


Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight',

In [93]:
final_output

{'點點荷葉一字領短上衣': 0.674094021320343,
 '情侶短袖t 夏季水洗33數字短袖t': 0.6550724506378174,
 'Augelute 兒童 套裝 居家森林護肚短袖套裝 31152': 0.6553329825401306,
 '牛仔捲邊破褲 短褲': 0.6830734014511108,
 '韩制。针织洞洞感网状透气短袜': 0.6939300894737244,
 '0~2歲寶寶短袖居家套裝 魔法baby~k50475': 0.6293430328369141,
 '多彩舒適棉素面百搭大尺碼POLO短衫_薰衣紫': 0.6625270247459412,
 'LIYO理優英文字母休閒棉T恤E712003': 0.6467800736427307,
 'iFairies 中大尺碼長袖T恤上衣★ifairies【59000】【59000】': 0.6530770659446716,
 '長版口袋開襟針織外套': 0.6996235847473145}

### Metrics

To evaluate the semantic relevance of search results we used **BERTScore (F1)**, **Sentence-BERT cosine similarity**, and **METEOR**. We selected these metrics to cover more than just simple lexical overlap, aiming to capture the deeper semantic meaning and contextual alignment between the query and the retrieved results.

**BERTScore (F1)** uses contextual embeddings from a pre-trained BERT model to evaluate the similarity between a candidate sentence and query. Unlike traditional token-based metrics, BERTScore considers word usage in context, making it particularly effective at identifying semantic similarity even when different words or phrasing are used.

**Sentence-BERT cosine similarity** compares sentence-level embeddings in a shared vector space. It measures the overall semantic closeness of the query and result pairs, making it a strong indicator of whether two sentences convey similar meanings holistically.

**METEOR** offers a balance between precision and recall at the word level, incorporating stemming, synonym matching through WordNet, and alignment-based evaluation. It helps account for linguistic variation while still rewarding accurate matches, and has shown strong correlation with human judgments in evaluation studies.


In [102]:
final_output_debug = mt_pipeline_search(query, 
                                    env_path,
                                    mBART_model,
                                    mBART_tokenizer,
                                    data,
                                    bm25_corpus,
                                    pinecone_indices,
                                    dense_embed_model,
                                    tgt_lang, #optional
                                    top_k,
                                    debug=True) #optional

Expanding queries...
Queries Expanded
Translating Queries...
Searching...
Processing Output...


In [103]:
final_output_debug

Unnamed: 0,en,tgt
0,Couple Short-Sleeved T Summer Washed 33 Digita...,情侶短袖t 夏季水洗33數字短袖t
1,Polka Dots Lotus Leaf word Short Tops,點點荷葉一字領短上衣
2,Augelute Kids Set Home Forest Belly Short Slee...,Augelute 兒童 套裝 居家森林護肚短袖套裝 31152
3,Cowboy Curling Jeans Shorts,牛仔捲邊破褲 短褲
4,Korean Made. Knitted Hole Sexy Mesh Breathable...,韩制。针织洞洞感网状透气短袜
5,0 ~ 2 Girls Short Sleeve Home Suit Magic Baby ...,0~2歲寶寶短袖居家套裝 魔法baby~k50475
6,PolarStar Women Sweat Quick Dry T-shirt Black ...,PolarStar 女 排汗快干T恤『黑』P18102
7,LIYO-English letter casual cotton T-shirt,LIYO理優英文字母休閒棉T恤E712003
8,Bamboo Cotton Spaghetti Strap Vest,竹節棉細肩帶背心


In [104]:
from nltk.translate.meteor_score import meteor_score
#calculate METEOR
final_output_debug['meteor'] = final_output_debug['en'].apply(lambda x: meteor_score([query.split()], x.split()))

#Compute sentenceBERT cosine sim
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query, convert_to_tensor=True)
en_embeddings = model.encode(final_output_debug['en'].tolist(), convert_to_tensor=True)
cosine_scores = util.cos_sim(query_embedding, en_embeddings)[0]
final_output_debug['sbert_cosine'] = cosine_scores.tolist()

#calculate bertscore
P, R, F1 = score([query] * len(final_output_debug), final_output_debug['en'].tolist(), lang="en", verbose=False)
final_output_debug['bertscore_f1'] = F1.tolist()


Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [105]:
final_output_debug

Unnamed: 0,en,tgt,meteor,sbert_cosine,bertscore_f1
0,Couple Short-Sleeved T Summer Washed 33 Digita...,情侶短袖t 夏季水洗33數字短袖t,0.506757,0.703328,0.860155
1,Polka Dots Lotus Leaf word Short Tops,點點荷葉一字領短上衣,0.147059,0.327517,0.810884
2,Augelute Kids Set Home Forest Belly Short Slee...,Augelute 兒童 套裝 居家森林護肚短袖套裝 31152,0.520833,0.504298,0.818984
3,Cowboy Curling Jeans Shorts,牛仔捲邊破褲 短褲,0.16129,0.365026,0.837606
4,Korean Made. Knitted Hole Sexy Mesh Breathable...,韩制。针织洞洞感网状透气短袜,0.0,0.276768,0.822082
5,0 ~ 2 Girls Short Sleeve Home Suit Magic Baby ...,0~2歲寶寶短袖居家套裝 魔法baby~k50475,0.480769,0.304197,0.847186
6,PolarStar Women Sweat Quick Dry T-shirt Black ...,PolarStar 女 排汗快干T恤『黑』P18102,0.138889,0.404821,0.830408
7,LIYO-English letter casual cotton T-shirt,LIYO理優英文字母休閒棉T恤E712003,0.15625,0.550334,0.856207
8,Bamboo Cotton Spaghetti Strap Vest,竹節棉細肩帶背心,0.0,0.366832,0.833241


A high BERTscore suggests strong semantic similarity between the query and target translations, whereas a greater variability in sentence-BERT suggests a potential sensitivity to global smenatic shifts. As METEOR is more reliant on token-level overlap, it scores much lower. However, as the results are still semantically similar, this is alright.