# Task Overview

Verify the labelling of a small subset of product and search query data by appropriately prompting an LLM with the relevant product information and associated query.

A product-query pair is given the label "E" for situations where: “the item is relevant for the query and satisfies all the query specifications”.

Not all product-query pairs in the dataset with the label "E" actually meet this definition, e.g. a query might be for "X in size Y" and the product X associated with the query is actually of size Z.

There are two goals within this assignment:
1. identify if records labelled "E" truly do meet the definition
2. for records that do not, formulate a new query so that the label "E" would now be considered correct


## Approach Expectations

The assignment instructions are to use LLMs, either one prompted LLM call or multiple LLM calls in coordination.

The task can be completed with local LLMs on a 16GB Macbook.

The solution should be general enough to handle unseen queries.

## Extra notes:
If the query mentions something the product information does not mention the label "E" is still valid.

If the product information has additional information the query does not have the label can still be "E".


## Dataset

The sample to use for this task involves downloading the Amazon esci-data dataset and filtering down to all records with label "E" and query matching one of: 
* “aa batteries 100 pack”
* “kodak photo paper 8.5 x 11 glossy”
* “dewalt 8v max cordless screwdriver kit, gyroscopic”

Create the dataframe df_example_products described in the Amazon [esci-data](https://github.com/amazon-science/esci-data) repository.

For each query, the dataset provides a list of up to 40 potentially relevant results, together with ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant) indicating the relevance of the product to the query. 

Restrict this dataset to only the three queries above, and the ESCI label “E”.

## Output Requirements

A table with query_id, product_id, is_accurate (boolean indicating if original label was accurate), improved_query (if label found inaccurate, a refolumation of the query that would make the relationship label “E”)

The submission should include:
* A link to repository with working, runable code. This can be a notebook
* Documentation to explain any design decisions taken
* Runnable code is a requirement of this exercise

In [1]:
import requests
from time import time

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve



# Gather the dataset

In [2]:
df_examples = pd.read_parquet('shopping_queries_dataset_examples.parquet')
df_products = pd.read_parquet('shopping_queries_dataset_products.parquet')
df_sources = pd.read_csv("shopping_queries_dataset_sources.csv")


df_examples_products = pd.merge(
    df_examples,
    df_products,
    how='left',
    left_on=['product_locale','product_id'],
    right_on=['product_locale', 'product_id']
)

print(f"full dataset size = {df_examples_products.shape[0]}\n")
print(f"Example row:\n{df_examples_products.iloc[0]}\n")

test_queries = [
    "aa batteries 100 pack",
    "kodak photo paper 8.5 x 11 glossy", 
    "dewalt 8v max cordless screwdriver kit, gyroscopic"
]


subset = (
    df_examples_products.loc[
        (df_examples_products["esci_label"] == "E") &
        (df_examples_products["query"].isin(test_queries))
    ]
)

print(f"subset data size = {subset.shape}\n")
print(f"columns: {subset.columns.tolist()}\n")

print(subset.groupby(["query", "query_id"]).size(), "\n")

print(f"example ID is unique: {subset.example_id.is_unique}")
print(f"product ID is unique: {subset.product_id.is_unique}")


# Write out to csv and hand label if the esci_label = E is accurate
# dropping the following columns: 'product_locale','esci_label', 'small_version', 'large_version', 'split',
subset[['example_id', 'query', 'product_title', 'product_description', 'product_bullet_point',
       'product_brand', 'product_color']].to_csv("sample.csv", index=False)

# Load sample.csv into a spreadsheet
# Label each record "EXACT=1" if the product-query pair exactly match and "EXACT=0" if not.
df_labelled = pd.read_csv("labelled.csv").merge(
    subset[["example_id", "query_id", "product_id"]].drop_duplicates(),
    on="example_id"
)
print(df_labelled["EXACT"].value_counts(dropna=False), "\n")

summary = (
    df_labelled.groupby(["query"]).EXACT.mean().reset_index().rename(columns={"EXACT": "% accurate"})
    .merge(df_labelled.groupby(["query"]).size().reset_index().rename(columns={0: "count"}))
)
summary

full dataset size = 2621288

Example row:
example_id                                                              0
query                                                       revent 80 cfm
query_id                                                                0
product_id                                                     B000MOO21W
product_locale                                                         us
esci_label                                                              I
small_version                                                           0
large_version                                                           1
split                                                               train
product_title           Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...
product_description                                                  None
product_bullet_point    WhisperCeiling fans feature a totally enclosed...
product_brand                                                   Panaso

Unnamed: 0,query,% accurate,count
0,aa batteries 100 pack,0.75,8
1,"dewalt 8v max cordless screwdriver kit, gyrosc...",0.5,6
2,kodak photo paper 8.5 x 11 glossy,0.8,10


# Task Part One: Develop a method to flag suspect "E" labels

Possible approaches:

## Simple encoding of query and product

Embed the query and the product information using SentenceTransformers and then calculate the similarity of the embeddings using cosine similarity. Use MiniLM as the transformer because its training objective was based on sentence similarity.

This approach has no chance of proving to be good enough to solve the assignment, however I will apply it first. This is because it will allow me to get a feel for the problem, and more importantly provide a baseline that any more sophisticated approach should easily exceed.

## Single step: prompt the LLM to perform a label classification task

Treat the problem as binary classification task and just ask the LLM: does this product exactly match the query?

I also don't expect this approach to be the solution that completes the assignment because it doesn't provide an output that can be built upon to provide the query reformulation for the second part of the task. I will still implement this approach because it will help me develop a good prompt for evaluating the "E" label and give an idea of the expected improvement upon using simple embeddings similarity.


## Multi-step prompt chained reasoning

Prompt the LLM in several steps. First have it separately analyse the product information, summarising the key attributes. Second have it analyse the query information, the examining the intent and specifics of the product being searched for. Then using the output from the previous two stages have the LLM evaluate if the product matches the query.

This naturally leaves an option to reformulate the query if the product-query match was found to be inaccurate. We can prompt the LLM again with the analyses from the first two steps and request that it reformulates a better query.



### 1. Embedding + Similarity Search

This simple approach barely manages to improve the accuracy of the "E" label. It shows some improvement with the "dewalt 8v max cordless screwdriver kit, gyroscopic" query, but shows zero (or worse) improvement with the other two queries. It seems that the mismatches with these queries are from numeric or non-word attributes (the paper size or the battery type) and these are not what the model was trained to understand.


In [3]:
# baseline heuristic

# all-MiniLM-L6-v2 is a small encoder-only Transformer
model = SentenceTransformer("all-MiniLM-L6-v2")

def score_similarity(query, product_title):
    embeddings = model.encode([query.lower(), product_title.lower()])
    return util.cos_sim(embeddings[0], embeddings[1]).item()

df_labelled['sim_score'] = df_labelled.apply(lambda row: score_similarity(row['query'], row['product_title']), axis=1)


# Evaluate the performance of the similarity scores
print(f"All queries ROC = {roc_auc_score(df_labelled['EXACT'], df_labelled['sim_score']):.2f}\n")
for query in test_queries:
    tmp = df_labelled.loc[df_labelled["query"] == query]
    print(f"query = {query}:")
    print(f"ROC = {roc_auc_score(tmp['EXACT'], tmp['sim_score']):.2f}\n")

# Select best threshold to assign positive match
precisions, recalls, thresholds = precision_recall_curve(df_labelled["EXACT"], df_labelled["sim_score"])
f1_scores = 2 * (precisions * recalls) / (precisions + recalls)
best_threshold = thresholds[np.argmax(f1_scores)]
print(f"best similarity threshold: {best_threshold:.4f}")

df_labelled["simple_embed_label"] = df_labelled["sim_score"] > best_threshold
df_labelled["simple_embed_label_accuracy"] = df_labelled["simple_embed_label"] == df_labelled["EXACT"]

summary.merge(
    df_labelled
    .groupby(["query"])
    .simple_embed_label_accuracy
    .mean()
    .reset_index()
    .rename(columns={
        "% accurate": "original % accurate",
        "simple_embed_label_accuracy": "simple label % accurate"
    })
)

All queries ROC = 0.64

query = aa batteries 100 pack:
ROC = 0.33

query = kodak photo paper 8.5 x 11 glossy:
ROC = 0.50

query = dewalt 8v max cordless screwdriver kit, gyroscopic:
ROC = 1.00

best similarity threshold: 0.5989


Unnamed: 0,query,% accurate,count,simple label % accurate
0,aa batteries 100 pack,0.75,8,0.5
1,"dewalt 8v max cordless screwdriver kit, gyrosc...",0.5,6,0.833333
2,kodak photo paper 8.5 x 11 glossy,0.8,10,0.8


In [4]:
(
    df_labelled
    .loc[
        df_labelled["query"] == "dewalt 8v max cordless screwdriver kit, gyroscopic", 
        ["product_title", "EXACT", "simple_embed_label"]
    ]
)

Unnamed: 0,product_title,EXACT,simple_embed_label
8,"DEWALT XTREME 12V MAX Cordless Screwdriver, 1/...",0,True
9,"ENERTWIST Cordless Screwdriver, 8V Max 10Nm El...",0,False
10,DEWALT DCF680N2 8V Max Gyroscopic Screwdriver ...,1,True
11,DEWALT DCB095 8V MAX Battery Charger,0,False
12,"DEWALT 8V MAX Cordless Screwdriver Kit, Gyrosc...",1,True
13,"DEWALT 8V MAX Cordless Screwdriver Kit, Gyrosc...",1,True


### 2. Single prompt

To prompt LLMs I downloaded the [LM Studio application](https://lmstudio.ai/) and the "meta-llama-3.1-8b-instruct". I loaded this model in LM Studio which launched it as an API on my local machine.

I'm using the "v1/chat/completions" endpoint because this endpoint treats the prompt as a user instruction and should follow my directives strictly.

The result is much improved now we are using a large instruction-tuned generative model compared to a small embedding-focused encoder model (MiniLM). However it still inaccurately considers a AAA battery to be an exact match to the "aa batteries 100 pack" query.



In [5]:
# LM Studio endpoint
ENDPOINT = "http://localhost:1234/v1/chat/completions"

HEADERS = {
    "Content-Type": "application/json"
}

def call_llm(payload):
    response = requests.post(ENDPOINT, headers=HEADERS, json=payload)
    response.raise_for_status()
    r = response.json()
    return {
        "response": r["choices"][0]["message"]["content"].strip(),
        "usage": r["usage"]
    }


def create_payload(
    messages, 
    model="meta-llama-3.1-8b-instruct", 
    max_tokens=200,
    temperature=0.5,
    top_p=0.95,
    n=1,
    stream=False):
    
    payload = {
        "messages": messages,
        "model": model,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "n": n,
        "stream": stream
        }
    return payload


def create_user_message(query, product_title, product_description, product_bullet_point, product_brand, product_color):
    return {
    "role": "user",
    "content": f"""Here is the product information:
    title: {product_title}
    description: {product_description}
    bullet point: {product_bullet_point}
    brand: {product_brand}
    color: {product_color}

Here is the query: "{query}"

Does the product information exactly match the information asked for in the query?
Make sure you check all the attributes
"""}


context = {
    "role": "system", 
    "content": """You are a helpful assistant that answers only "Yes" or "No".

You will be given product information and a query.

Your only job is to analyse the information and decide if the product information exactly matches the information asked for in the query.
When you analyse the information, compare the product attributes to attributes given in the query.
Make sure that identical attributes have identical values.

It is considered a match if the query contains extra information that is not included in
the product information, and vice versa, it is still considered a match if the product information
contains extra information that is not included in the query.

If any information relating to the same attribute appears to be different you must respond "No".

You must only respond with "Yes" or "No", no other words or characters.
"""
}

results = []
for index, record in df_labelled.iterrows():
    
    query = record["query"]
    product_title = record["product_title"]
    product_description = record["product_description"]
    product_bullet_point = record["product_bullet_point"]
    product_brand = record["product_brand"]
    product_color = record["product_color"]

    truth_label = record["EXACT"]

    user_message = create_user_message(
        query, 
        product_title, 
        product_description, 
        product_bullet_point, 
        product_brand, 
        product_color)
    
    payload = create_payload(messages=[context, user_message])
    result = call_llm(payload)
    results.append(result)
    
    # Print the model's response
    model_label_int = None
    model_label = result["response"]
    print(f"result = {model_label}, truth = {truth_label}")
    if model_label not in ["Yes", "No"]:
        print("response failed")
    elif model_label == "Yes":
        model_label_int = 1
    elif model_label == "No":
        model_label_int = 0

    if truth_label != model_label_int:
        print(f"WARNING! accuracy error:")
        print(f"query was {query}")
        print(f"product title: {product_title}\n")


result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 1
result = No, truth = 0
result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 0
query was aa batteries 100 pack
product title: Amazon Basics 100 Pack AAA High-Performance Alkaline Batteries, 10-Year Shelf Life, Easy to Open Value Pack

result = No, truth = 0
result = No, truth = 0
result = Yes, truth = 1
result = No, truth = 0
result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 1
result = No, truth = 0
result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 1
result = No, truth = 0
result = Yes, truth = 1
result = Yes, truth = 1


#### Test how well method generalises

* Pull another subset of queries labelled "E" and apply the best the method
* Evaluate performance, if similar to poc subset move on, if significantly worse attempt to tweak poc

The method is struggling when the query text has significant differences to the product information test even though the product can be considered an exact match to the query, e.g. "ankle socks for women size 8" query not considered to match "Saucony Women's Performance Heel Tab Athletic Socks (8 & 16, Grey Assorted (8 Pairs), Shoe Size: 5-10". It also struggled to tell the difference between a "no show" sock and an "ankle" sock.

I expect that the multi-step approach I will try next may help resolve these kind of problems.



In [6]:
# Want to avoid overly general queries like "laptop" so filter for queries that contain more than 5 words
df_examples_products["long query"] = df_examples_products["query"].apply(lambda x: len(x.split()) > 5)

# order by more frequently appearing queries in the dataset
# hoping this means the query corresonds to a common user need
query_counts = (
    df_examples_products
    .loc[
        (df_examples_products["esci_label"] == "E") &
        (df_examples_products["long query"] == True)
    ]
    .groupby("query")
    .size()
    .sort_values(ascending=False)
)

# eyeball and select some that might be trickier
validation_queries = [
    "yoga outfits for women 2 piece set high waist", 
    "ankle socks for women size 8", 
    "500 piece jigsaw puzzles for adults",
]

validation_df = (
    df_examples_products
    .loc[
        (df_examples_products["esci_label"] == "E") &
        (df_examples_products["query"].isin(validation_queries))
    ]
).sample(25)

validation_df[['example_id', 'query', 'product_title', 'product_description', 'product_bullet_point',
       'product_brand', 'product_color']].to_csv("validation_sample.csv", index=False)

# handlabelled validation set: I did modify some queries so that the label "E" became wrong
validation_df_labelled = pd.read_csv("validation_sample_labelled.csv")
validation_df_labelled.head()

Unnamed: 0,example_id,query,product_title,product_description,product_bullet_point,product_brand,product_color,EXACT
0,114383,500 piece jigsaw puzzles for adults,Springbok's 500 Piece Jigsaw Puzzle The Dog Pa...,,500 PIECE PUZZLE FOR ADULTS - Featuring a fini...,Springbok,Multi,1
1,114374,1000 piece jigsaw puzzles for adults,Springbok 500 Piece Jigsaw Puzzle Simpler Time...,,500 PIECE PUZZLE FOR ADULTS - Featuring a fini...,Springbok,Multi,0
2,2242454,yoga outfits for women 2 piece set low waist,HYZ Women's Workout 2 Piece Outfits High Waist...,<b>HYZ Women's Seamless High Waist Workout Leg...,GREAT FABRIC: Workout sets for women 2 piece m...,HYZ,Grey,0
3,205041,ankle socks for women size 8,Saucony Women's Performance Heel Tab Athletic ...,,Weave Type: Knit,Saucony,Grey Assorted (8 Pairs),1
4,2242452,yoga outfits for women 2 piece set high waist,Women’s Two Piece Outfits Yoga Pants Set Seaml...,,✅ 【SUPER VALUE 2 PIECE WORKOUT OUTFITS】1 short...,PINKSAVIOR,Light Blue Yoga Set,1


In [7]:
validation_results = []
for index, record in validation_df_labelled.iterrows():
    
    query = record["query"]
    product_title = record["product_title"]
    product_description = record["product_description"]
    product_bullet_point = record["product_bullet_point"]
    product_brand = record["product_brand"]
    product_color = record["product_color"]

    truth_label = record["EXACT"]

    user_message = create_user_message(
        query, 
        product_title, 
        product_description, 
        product_bullet_point, 
        product_brand, 
        product_color)
    
    payload = create_payload(messages=[context, user_message])
    result = call_llm(payload)
    validation_results.append(result)
    
    # Print the model's response
    model_label_int = None
    model_label = result["response"]
    print(f"result = {model_label}, truth = {truth_label}")
    if model_label not in ["Yes", "No"]:
        print("response failed")
    elif model_label == "Yes":
        model_label_int = 1
    elif model_label == "No":
        model_label_int = 0

    if truth_label != model_label_int:
        print(f"WARNING! accuracy error:")
        print(f"query was {query}")
        print(f"product title: {product_title}\n")



result = Yes, truth = 1
result = No, truth = 0
result = No, truth = 0
result = No, truth = 1
query was ankle socks for women size 8
product title: Saucony Women's Performance Heel Tab Athletic Socks (8 & 16, Grey Assorted (8 Pairs), Shoe Size: 5-10

result = Yes, truth = 1
result = Yes, truth = 0
query was ankle socks for women size 8
product title: Hanes Women's 10-Pair Value Pack No Show Socks

result = Yes, truth = 1
result = Yes, truth = 1
result = No, truth = 0
result = No, truth = 0
result = No, truth = 0
result = No, truth = 0
result = No, truth = 0
result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 1
result = No, truth = 1
query was ankle socks for women size 8
product title: Hanes Ultimate Women's 6-Pack Ankle Socks, White, 5-9

result = Yes, truth = 1
result = No, truth = 0
result = Yes, truth = 1
result = No, truth = 0
result = No, truth = 0
result = Yes, truth = 1
result = Yes, truth = 1
result = Yes, truth = 1


### 3. Multi-step

Instead of attempting to evaluate the product-query match accuracy and reformulate the query in a single shot it makes more sense to complete the task using multiple steps of reasoning.

In the first couple of steps I want the LLM to broadly understand the content and what is important about the information in the query and the information about the product separately and provide a summary for each.

In the next step I want the LLM to evaluate if the product is a match for the query based upon the summaries it has created. This step provides a simple Yes/No output and if the match is found to be accurate then the process can end here.

In the final step, if the match was found to be inaccurate, the product and query summaries generated in the first couple of steps can be reused to produce a reformulation of the original query that can now would be considered a match to the product.

This approach proves to be a bit worse at evaluating the accuracy of the "E" label. The summarisation of the query and product could introduce some hallucination or omit relevant information making the match evaluation step more likely to fail. Possibly further refinement of the prompts for the first two steps could reduce these errors, or usage of a larger higher performance model could improve the LLM responses increasing the accuracy of the evaluation step.

To reformulate the query I instructed the LLM to keep as close to the original query as possible such that it just "corrects" the specifications that aren't a match to the product information. Generally this approach performed ok however it trusts that the LLM understands the importance and semantics of each term in the query. This relies on the quality of the query and product summarisation steps which themselves can contain inaccuracies. Again, further refinement of the prompts and/or using a more powerful LLM model could go a long way to solving the errors.



In [8]:
context = {
    "role": "system", 
    "content": """
You are a meticulous and detail-oriented assistant specializing in semantic search and query refinement for product searches
made on Amazon.

Do not explain your answers unless explicitly asked.
Do not provide notes to explain your responses.
Follow all formatting instructions exactly.
"""}


def agentic_query_reformulation(original_query, matched_product, true_label, verbose=False):
    
    print(f"\nOriginal Query: {original_query}")
    print(f"Matched Product: {matched_product['product_title']}")

    # Analyse the query
    prompt1 = f"""Analyze the content in the following search query: '{original_query}'. 
Identify the main intent, key entities, and any potential ambiguities or areas for clarification.
Understand the specifications in the query semantically.
"""
    messages = [context, {"role": "user", "content": prompt1}]
    payload = create_payload(messages)
    query_analysis = call_llm(payload)["response"]
    if verbose:
        print("\nQUERY ANALYSIS")
        print(query_analysis)

    # Analyse the product
    prompt2 = f"""Summarize the main specifications of the following product: '{matched_product.to_dict()}'.
Understand what the key entities of the product are.
Make sure you differentiate between important entities and minor product details.
Understand the product specifications semantically.
"""
    messages = [context, {"role": "user", "content": prompt2}]
    payload = create_payload(messages)
    product_analysis = call_llm(payload)["response"]
    if verbose:
        print("\nPRODUCT ANALYSIS")
        print(product_analysis)

    # Determine if product is an exact match to the query
    prompt3 = f"""You will determine if a product is an accurate match to a query using
the following output from an analysis of the query: '{query_analysis}', and using
the following output from an analysis of product information: '{product_analysis}'
    
Do the main specifications of the product information correspond to the specifications included in the query?

How to handle ambiguities: if the product information does not mention a specification
contained in the query it can still be considered to match, also if the product information has
additional information or additional products beyond what are requested in the query the product can still be considered a match.

If the product is an accurate match to a query respond "Yes".
If the product is not an accurate match to the query respond "No".

You must only respond with "Yes" or "No", no other words or characters.
Do not return anything other than "Yes" or "No".
"""
    messages = [context, {"role": "user", "content": prompt3}]
    payload = create_payload(messages)
    match_analysis = call_llm(payload)["response"]

    if match_analysis not in ["Yes", "No"]:
        print("response failed: did not return Yes or No")
        print(f"response was: {match_analysis}")
        print("labelling query as inaccurate to continue ... ")
        model_label = 0
    elif match_analysis == "Yes":
        model_label = 1
    elif match_analysis == "No":
        model_label = 0

    if model_label != true_label:
        print("LABEL VERIFICATION WRONG! (continuing anyway)")

    reformulated_query = None
    if model_label == 0:
        print("Match not found to be accurate: refining query ...")
        prompt4 = f"""You are tasked with rewriting a query so a product now satifies all of the specifications given in the query.
        
The following is the original query you are to rewrite: '{original_query}'
Use the following query analysis '{query_analysis}' and product analysis '{product_analysis}' to help you rewrite the query.

Do not add new specifications to the original query, instead edit terms within the original query 
so that there are no conflicts with product specifications you found in the product analysis.

Only provide the rewritten query in your response. Do not provide any other information.
Respond only with the rewritten query.
"""
        messages = [{"role": "user", "content": prompt4}]
        payload = create_payload(messages)
        reformulated_query = call_llm(payload)["response"]
        
    return {
        "query_analysis": query_analysis,
        "product_analysis": product_analysis,
        "model_label": model_label,
        "reformulated_query": reformulated_query
    }


start_time = time()
output = []
for index, record in df_labelled.iterrows():
    
    original_query = record["query"]
    matched_product = record[['product_title', 'product_description', 'product_bullet_point', 'product_brand', 'product_color']]
    true_label = record["EXACT"]
    
    prompt_start_time = time()
    results = agentic_query_reformulation(original_query, matched_product, true_label, verbose=True)
    time_taken = time() - prompt_start_time
    print(f"Analysis took {time_taken:.2f} secs")

    if results["model_label"] == true_label:
        print("Match analysis was a success!")
    else:
        print("Match analysis failed!")
        if results["model_label"] == 0:
            print("Analysis should have found the match to be accurate")
        elif results["model_label"] == 1:
            print("Analysis should have found the match to be inaccurate")
            
    if results["reformulated_query"]:
        print(f"Analysis found label to be inaccurate (model_label = {results['model_label']}), so reformulating new query:")
        print(f"new query: {results['reformulated_query']}")
    else:
        print("Analysis found label to be accurate, not reformulating query :+1:")
    
    output_row = {
        "query_id": record["query_id"],
        "product_id": record["product_id"],
        "is_accurate": results["model_label"],
        "improved_query": results["reformulated_query"],
        # extra columns
        "truth_label": record["EXACT"],
        "original_query": original_query,
        "product_title": record["product_title"],
        "query_analysis": results["query_analysis"],
        "product_analysis": results["product_analysis"],
    }
    output.append(output_row)
time_taken = (time() - start_time)/60.
print(f"Total time taken: {time_taken:.2f} mins")
output_df = pd.DataFrame(output)
output_df.to_csv("output_df.csv", index=False)
output_df




Original Query: aa batteries 100 pack
Matched Product: Energizer Advanced AA Alkaline Bulk Battery - 100 Count

QUERY ANALYSIS
**Search Query Analysis**

* **Main Intent:** The user is looking to purchase a large quantity of AA batteries (at least 100 units).
* **Key Entities:**
	+ **Product Type:** AA batteries
	+ **Quantity:** 100 pack
* **Potential Ambiguities or Areas for Clarification:**
	+ Battery type (e.g., alkaline, nickel-cadmium, lithium)
	+ Brand or manufacturer preference
	+ Price range or budget constraint
	+ Additional features (e.g., long-lasting, leak-proof)
* **Semantic Specifications:**
	+ The user is likely looking for a pack of 100 AA batteries that can be used as a consumable item.
	+ The quantity is the primary concern, and the user may prioritize price over other factors.

**Refined Query Suggestions**

* "AA alkaline batteries 100 count"
* "Long-lasting AA batteries 100 pack"
* "Cheap AA batteries 100 pack"

PRODUCT ANALYSIS
**Product Specifications:**

1. **B

Unnamed: 0,query_id,product_id,is_accurate,improved_query,truth_label,original_query,product_title,query_analysis,product_analysis
0,6014,B01G1RYHAO,1,,1,aa batteries 100 pack,Energizer Advanced AA Alkaline Bulk Battery - ...,**Search Query Analysis**\n\n* **Main Intent:*...,**Product Specifications:**\n\n1. **Battery Ty...
1,6014,B07FP5DNBG,1,,1,aa batteries 100 pack,"IMPECCA AA Batteries, All Purpose Alkaline Bat...",**Main Intent:** \n- The user is looking to pu...,**Product Specifications:**\n\n1. **Battery Ty...
2,6014,B07F7RH8D4,1,,1,aa batteries 100 pack,Allmax AA Maximum Power Alkaline Batteries (10...,**Search Query Analysis**\n\n* **Main Intent:*...,**Product Title:** Allmax AA Maximum Power Alk...
3,6014,B01B8R6PF2,1,,1,aa batteries 100 pack,Amazon Basics 100 Pack AA High-Performance Alk...,**Search Query Analysis**\n\n* **Main Intent:*...,**Product Title:** Amazon Basics 100 Pack AA H...
4,6014,B00LHSAARW,1,,0,aa batteries 100 pack,"Rayovac AA Alkaline Double A Batteries, 60 Count",**Main Intent:** \n- The user is looking to pu...,**Product Specifications:**\n\n1. **Battery Ty...
5,6014,B00KMDL8U6,1,,1,aa batteries 100 pack,Energizer AA Max Alkaline E91 Batteries Made i...,**Main Intent:** \n- The user is looking to pu...,**Main Specifications:**\n\n1. **Battery Type*...
6,6014,B004SCA15K,1,,1,aa batteries 100 pack,"ACDelco 100-Count AA Batteries, Maximum Power ...",**Main Intent:** \n- The user is looking to pu...,**Product Specifications:**\n\n* **Battery Typ...
7,6014,B01B8R6V2E,0,'AAA batteries 100 pack',0,aa batteries 100 pack,Amazon Basics 100 Pack AAA High-Performance Al...,**Main Intent:** \n- The primary intention of ...,**Product Specifications:**\n\n1. **Battery Ty...
8,32814,B07TWK2S22,0,"'Dewalt 12v max cordless screwdriver kit, gyro...",0,"dewalt 8v max cordless screwdriver kit, gyrosc...","DEWALT XTREME 12V MAX Cordless Screwdriver, 1/...",**Main Intent:** \n- The user is looking to pu...,**Product Specifications:**\n\n1. **Power Sour...
9,32814,B0812ZHY5N,1,,0,"dewalt 8v max cordless screwdriver kit, gyrosc...","ENERTWIST Cordless Screwdriver, 8V Max 10Nm El...",**Main Intent:** \n- Purchase a Dewalt product...,**Key Entities:**\n\n1. **Product Name**: ENER...


In [9]:
output_df.is_accurate.value_counts(dropna=False)

is_accurate
1    21
0     3
Name: count, dtype: int64

In [10]:
# wrong label evaulations
print(f"{output_df.loc[output_df.is_accurate != output_df.truth_label].shape[0]} labels wrongly evaluated")
output_df.loc[output_df.is_accurate != output_df.truth_label]

4 labels wrongly evaluated


Unnamed: 0,query_id,product_id,is_accurate,improved_query,truth_label,original_query,product_title,query_analysis,product_analysis
4,6014,B00LHSAARW,1,,0,aa batteries 100 pack,"Rayovac AA Alkaline Double A Batteries, 60 Count",**Main Intent:** \n- The user is looking to pu...,**Product Specifications:**\n\n1. **Battery Ty...
9,32814,B0812ZHY5N,1,,0,"dewalt 8v max cordless screwdriver kit, gyrosc...","ENERTWIST Cordless Screwdriver, 8V Max 10Nm El...",**Main Intent:** \n- Purchase a Dewalt product...,**Key Entities:**\n\n1. **Product Name**: ENER...
15,58953,B085F42SV6,1,,0,kodak photo paper 8.5 x 11 glossy,"Kodak photo paper 8.5 x 11 matte, 100 count 39...",**Main Intent:** The user is looking to purcha...,**Product Specifications:**\n\n* **Brand:** KO...
21,58953,B000EZ0CTK,1,,0,kodak photo paper 8.5 x 11 glossy,"Kodak Photo Paper for inkjet printers, Matte F...",**Main Intent:** The user intends to purchase ...,**Product Specifications:**\n\n* **Type**: Pho...


In [11]:
print(output_df.loc[
    output_df.is_accurate != output_df.truth_label,
    ["query_analysis", "product_analysis"]].iloc[0].query_analysis)

**Main Intent:** 
- The user is looking to purchase a large quantity of AA batteries.

**Key Entities:**
1. **Product Type:** AA batteries
2. **Quantity:** 100 pack

**Potential Ambiguities or Areas for Clarification:**
1. **Battery type specification**: No specific battery type (e.g., alkaline, nickel-metal hydride) is mentioned.
2. **Brand preference**: The user does not specify a preferred brand.
3. **Expiration date or shelf life**: There is no mention of the need for fresh batteries or a specific expiration date.

**Semantic Specifications:**
- The user is looking for AA batteries in a package size of 100 units.
- No other specifications (e.g., price, color) are mentioned.


In [12]:
print(output_df.loc[
    output_df.is_accurate != output_df.truth_label,
    ["query_analysis", "product_analysis"]].iloc[0].product_analysis)

**Product Specifications:**

1. **Battery Type**: Double A (AA)
2. **Quantity**: 60 count
3. **Brand**: Rayovac
4. **Power Longevity**: Up to 10 years
5. **Leak Prevention**: Designed to prevent damaging leaks
6. **Country of Origin**: Made in the USA with US and global parts

**Key Entities:**

1. **Product Name**: Rayovac AA Alkaline Double A Batteries
2. **Battery Size**: AA
3. **Packaging Quantity**: 60 count
4. **Brand**: Rayovac


In [13]:
new_queries = output_df.loc[output_df.improved_query.notnull()]
for index, row in new_queries.iterrows():
    print(f"Product: {row['product_title']}:")
    print(f"Original query was: {row['original_query']}")
    print(f"New query: {row['improved_query']}")
    print("---\n")

Product: Amazon Basics 100 Pack AAA High-Performance Alkaline Batteries, 10-Year Shelf Life, Easy to Open Value Pack:
Original query was: aa batteries 100 pack
New query: 'AAA batteries 100 pack'
---

Product: DEWALT XTREME 12V MAX Cordless Screwdriver, 1/4-Inch, Tool Only (DCF601B):
Original query was: dewalt 8v max cordless screwdriver kit, gyroscopic
New query: 'Dewalt 12v max cordless screwdriver kit, gyroscopic stabilization'
---

Product: DEWALT DCB095 8V MAX Battery Charger:
Original query was: dewalt 8v max cordless screwdriver kit, gyroscopic
New query: 'Dewalt 8V Max cordless screwdriver kit, stabilizing'
---

