In [11]:
%reload_ext autoreload
%autoreload 2


# Week 5: Systematically Improving Your RAG Application

> **Prerequisites**: Please make sure that you've ran `1. Generate Dataset.ipynb` to generate the dataset that we'll be using in this notebook. It'll also help you to get familiar with data that we're working with for this specific case study

In this notebook, we'll explore some of the limitations of semantic search and how we can use metadata filtering to improve the retrieval of complex queries.

## Why this matters

As user queries become more complex, it becomes increasingly important to extract specific criteria that user queries need. To do so, we need to use query understanding to extract and map user queries to specific filters we support.

Let's take the simple example of - I want a medium-sized black cotton T-shirt under $50. This requires multiple criterias to be extracted - size, material and price. In order to be able to do so, we need to have the indexes and data that support these filters.

## What you'll learn

Through hands-on experiments with our e-commerce dataset, you'll discover how to:

1. Identify the limitations of semantic search
- See why semantic search alone falls short for complex queries
- Understand the impact of missing structured metadata
- Measure baseline performance using recall and MRR metrics

2. Extend existing metadata fields using LLMs
- Extend your product catalog with new metadata fields systematically
- Validate that newly generated attributes conform to your existing taxonomy
- Scale metadata generation across your entire catalog while maintaining consistency

3. Implement Query Understanding
- Extract structured filters from natural language queries
- Map arbitrary user requirements to your predefined taxonomy
- Implement a retrieval pipeline that combines semantic search with metadata filtering


By the end of this notebook, you'll have a good understanding of how to use language models to implement query understanding for more accurate and useful recomendations with semantic search. More specifically, you'll know how to measure and quantify the improve in performance with these additional filters for more accurate and useful recommendations.

## Limitations Of Semantic Search

### Ingesting Our Dataset

We'll use the `ivanleomk/ecommerce-taxonomy` dataset that we previously uploaded to Hugging Face. We want to use the same dataset so that the results are consistent.

Our dataset contains the following fields

- `image` : This is an image of the item
- `brand` : The brand of the item
- `title` : The title of the item
- `description` : A description of the item
- `category` : The category of the item
- `subcategory` : The subcategory of the item
- `product_type` : The product type of the item
- `attributes` : A comma separated list of attributes of the item

Let's now see how we can load these questions into our dataset and how retrieval doesn't work well when we only use semantic search with no metadata filtering.

In [3]:
from datasets import load_dataset

ds = load_dataset("ivanleomk/ecommerce-taxonomy")["train"]
ds[0]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=768x1024>,
 'title': 'Lace Detail Sleeveless Top',
 'brand': 'H&M',
 'description': "Elevate your casual wardrobe with this elegant sleeveless top featuring intricate lace detailing at the neckline. Perfect for both day and night, it's crafted from a soft, breathable fabric for all-day comfort.",
 'category': 'Women',
 'subcategory': 'Tops',
 'product_type': 'Tank Tops',
 'attributes': '[{"name": "Sleeve Length", "value": "Sleeveless"}, {"name": "Neckline", "value": "Crew Neck"}]',
 'material': 'Cotton',
 'pattern': 'Solid',
 'id': 1,
 'price': 181.04}

In [4]:
from lancedb.pydantic import LanceModel, Vector
import lancedb
from lancedb.embeddings import get_registry


# Create Embedding Function
func = get_registry().get("openai").create(name="text-embedding-3-small")


# Define a Model that will be used as the schema for our collection
class Item(LanceModel):
    id: int
    title: str
    description: str = func.SourceField()
    brand: str
    category: str
    subcategory: str
    product_type: str
    attributes: str
    material: str
    pattern: str
    price: float
    vector: Vector(func.ndims()) = func.VectorField()


db = lancedb.connect("./lancedb")
table_name = "items"

if table_name not in db.table_names():
    table = db.create_table(table_name, schema=Item, mode="overwrite")
    entries = []
    for row in ds:
        entries.append(
            {
                "id": row["id"],
                "title": row["title"],
                "description": row["description"],
                "brand": row["brand"],
                "category": row["category"],
                "subcategory": row["subcategory"],
                "product_type": row["product_type"],
                "attributes": row["attributes"],
                "material": row["material"],
                "pattern": row["pattern"],
                "price": row["price"],
            }
        )

    table.add(entries)

table = db.open_table(table_name)

Now, let's see what we get back when we search for items and specify specific filters using semantic search.

In this case, we're using a user query of `I need a polyester sweatshirt under $100?`. This requires the following filters to be applied - material, product type and price.

We can see that when the relevant items are retrieved, they are not relevant to the user's query.

- We're getting back items that are made of cotton instead of polyester
- We're getting back items that are not sweatshirts
- We're getting back items that are over $100 - some are even bottoms.


In [5]:
user_query = "I need a polyester sweatshirt under a 100 bucks"

In [6]:
import pandas as pd

items = [
    {
        "title": item["title"],
        "description": item["description"],
        "category": item["category"],
        "subcategory": item["subcategory"],
        "material": item["material"],
        "price": item["price"],
    }
    for item in table.search(query=user_query).limit(10).to_list()
]

pd.DataFrame(items)

Unnamed: 0,title,description,category,subcategory,material,price
0,Nike Air Sweatshirt,Stay comfortable and stylish with the Nike Air...,Women,Tops,Polyester,96.37
1,Women's Ribbed Long Sleeve Sweater,"This elegant ribbed sweater offers a snug fit,...",Women,Tops,Cotton,237.07
2,Sleeveless Button-Down Blouse,This classic sleeveless button-down blouse is ...,Women,Tops,Cotton,353.57
3,Embellished Graphic T-Shirt,This stylish oversized T-shirt from Replay fea...,Women,Tops,Cotton,176.58
4,White Crew Neck T-Shirt,This classic white crew neck t-shirt offers a ...,Women,Tops,Cotton,229.9
5,Classic Women's Crew Neck T-Shirt,This timeless crew neck t-shirt offers a relax...,Women,Tops,Cotton,14.6
6,Long Sleeve Crew Neck T-Shirt,This classic long sleeve crew neck t-shirt com...,Women,Tops,Cotton,364.7
7,Striped Long Sleeve Top,This classic striped long sleeve top features ...,Women,Tops,Cotton,81.09
8,Silk T-Shirt,"This luxurious silk T-shirt features a soft, l...",Women,Tops,Silk,99.63
9,Striped Short Sleeve T-Shirt,This classic striped short-sleeve t-shirt is p...,Women,Tops,Cotton,266.22


Let's now see what happens when we apply the following filters on the material of the item and the product type of the item.

In [7]:
import lancedb

db = lancedb.connect("./lancedb")
table = db.open_table("items")

In [None]:
import pandas as pd

items = [
    {
        "title": item["title"],
        "description": item["description"],
        "category": item["category"],
        "subcategory": item["subcategory"],
        "material": item["material"],
        "product_type": item["product_type"],
        "price": item["price"],
    }
    for item in table.search(query=user_query)
    .limit(10)
    # Apply a prefilter to only get polyester sweatshirts
    .where(
        "material='Polyester' AND product_type='Sweatshirts' AND price < 100",
        prefilter=True,
    )
    .to_list()
]

pd.DataFrame(items)

Unnamed: 0,title,description,category,subcategory,material,product_type,price
0,Nike Air Sweatshirt,Stay comfortable and stylish with the Nike Air...,Women,Tops,Polyester,Sweatshirts,96.37


With these 3 new filters that we're applying on the results, we're now getting back items that fit the user's specific requirements - which were polyester sweatshirts under $100. As our dataset grows and we support more items, we'll then progressively add in more filters such as the brand or occasion etc.

As queries become more complex, metadata filtering will only become more important. Now let's see how we can use LLMs to automatically label and extract data to support a new category of filters that we'd like to support. 

## LLM Metadata Generation

> If you'd like to skip the labelling process, you can download the labelled dataset from huggingface at `ivanleomk/labelled-ecommerce-taxonomy` which you can find [here](https://huggingface.co/datasets/ivanleomk/labelled-ecommerce-taxonomy)

We've seen how semantic search falls short when we have complex queries that require not results to be relevant but valid. 

In the example above, valid items are those that meet the user's requirements and have the following characteristics

- Material: Polyester
- Product Type: Sweatshirts
- Price: Under $100

Imagine that we now want to allow users to search for items based on an occasion - say a wedding or interview that they need to attend. We don't have this information in our current dataset and getting a human to label this would be a tedious process.

Therefore, we can leverage LLMs to automatically label and extract this information for us. To make things easier, we've already defined a `occasions` field in our `taxonomy.yml` file.

We'll label our entire dataset in three steps

1. First we'll read in the taxonomy file to get a list of valid occasions. By reading from a config file, we can prevent potential errors such as typos and ensure that the occasions are valid when we generate them and use them to perform filtering. 

2. Next, we'll use a language model and batch requests to label each item in our dataset with the appropriate occasions using its metadata and its image.

3. Finally, we'll save this as a new column in our dataset and upload it to Hugging Face to save.

Once we've done so, we'll then work towards implementing an occasion filter in our retrieval pipeline and see how it compares to semantic search. 

### Generating Labels

We can see here that we've got a list of occasions here that items can be worn for. Note that since it's completely possible for an item to be worn for multiple occasions

Eg. a black dress can be worn for both formal and casual occasions.

Therefore, what we'll do is to allow the model to output a list of occasions that the item can be worn for.

In [9]:
from helpers import process_taxonomy_file

# Read in taxonomy data
taxonomy_data = process_taxonomy_file("./taxonomy.yml")

In [10]:
from pydantic import BaseModel, field_validator, ValidationInfo
import instructor
from openai import AsyncOpenAI
from asyncio import Semaphore, wait_for
from tqdm.asyncio import tqdm_asyncio as asyncio
import tempfile
from rich import print


# Define a Pydantic model to validate the occasions
class Occasions(BaseModel):
    occasions: list[str]

    @field_validator("occasions")
    def validate_occasions(cls, v, info: ValidationInfo):
        if not v:
            raise ValueError("Occasions cannot be empty")

        context = info.context
        occasions = context["taxonomy_data"]["occasions"]

        # Since the model can output a list of occasions, we need to check that each occasion is valid
        for occasion in v:
            if occasion not in occasions:
                raise ValueError(f"Invalid occasion: {occasion}")
        return v


client = instructor.from_openai(AsyncOpenAI())


async def generate_occasions(
    item: dict,
    client: instructor.AsyncInstructor,
    semaphore: Semaphore,
    taxonomy_data: dict,
):
    async with semaphore:
        with tempfile.NamedTemporaryFile(suffix=".jpg") as temp_file:
            # Save the PIL image to the temporary file
            item["image"].save(temp_file.name)
            occasions = await wait_for(
                client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[
                        {
                            "role": "user",
                            "content": [
                                """
                                You're an expert AI system that excels at identifying potential occasions that an item can be worn for.

                                Here is a list of valid occasions that we'd like you to consider:
                                : {{ taxonomy_data['occasions'] }}
                                
                                Consider the following details about the item:
                                
                                Title: {{ title }}
                                Description: {{ description }}
                                Brand: {{ brand }}
                                Material: {{ material }}
                                Pattern: {{ pattern }}
                                Attributes: {{ attributes }}
                                
                                Please analyze these details along with the image to determine appropriate occasions that this specific item in the image can be worn for.
                            """,
                                instructor.Image.from_path(temp_file.name),
                            ],
                        }
                    ],
                    response_model=Occasions,
                    context={
                        "taxonomy_data": taxonomy_data,
                        "title": item["title"],
                        "description": item["description"],
                        "brand": item["brand"],
                        "material": item["material"],
                        "pattern": item["pattern"],
                        "attributes": item["attributes"],
                    },
                ),
                timeout=30,  # 30 second timeout
            )

            return {
                **item,
                "occasions": occasions.occasions,
            }


sem = Semaphore(10)
print(await generate_occasions(ds[0], client, sem, taxonomy_data))

We can see that we've now got a list of occasions that the item can be worn for so let's apply this to our dataset.

We'll label all of our items in the dataset, save it to hugging face and then ingest it into a new LanceDB database. 

In [21]:
from datasets import Dataset
import json

coros = [
    generate_occasions(item, client, sem, taxonomy_data)
    for item in ds
    if item["category"]
]
occasions = await asyncio.gather(*coros)

labelled_dataset = Dataset.from_list(
    [
        {
            **item,
            "occasions": json.dumps(item["occasions"]),
        }
        for item in occasions
    ]
)
labelled_dataset.push_to_hub("ivanleomk/labelled-ecommerce-taxonomy")

100%|██████████| 191/191 [01:05<00:00,  2.91it/s]
Map: 100%|██████████| 191/191 [00:00<00:00, 16796.21 examples/s]/s]
Creating parquet from Arrow format: 100%|██████████| 2/2 [00:00<00:00, 303.21ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:03<00:00,  3.58s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/ivanleomk/labelled-ecommerce-taxonomy/commit/c098a983271ba27ece46fe780ac004093eccffa6', commit_message='Upload dataset', commit_description='', oid='c098a983271ba27ece46fe780ac004093eccffa6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/ivanleomk/labelled-ecommerce-taxonomy', endpoint='https://huggingface.co', repo_type='dataset', repo_id='ivanleomk/labelled-ecommerce-taxonomy'), pr_revision=None, pr_num=None)

### Reingesting our data

Now that we've labelled our dataset, let's load it back in and ingest it into a new LanceDB database.

In [11]:
from datasets import load_dataset

labelled_dataset = load_dataset("ivanleomk/labelled-ecommerce-taxonomy")["train"]

In [12]:
from lancedb.pydantic import LanceModel, Vector
import lancedb
from lancedb.embeddings import get_registry


# Create Embedding Function
func = get_registry().get("openai").create(name="text-embedding-3-small")


# Define a Model that will be used as the schema for our collection
class Item(LanceModel):
    id: int
    title: str
    description: str = func.SourceField()
    brand: str
    category: str
    subcategory: str
    product_type: str
    attributes: str
    occasions: str
    material: str
    pattern: str
    price: float
    vector: Vector(func.ndims()) = func.VectorField()


db = lancedb.connect("./lancedb")
table_name = "labelled_items"

if table_name not in db.table_names():
    labelled_table = db.create_table(table_name, schema=Item, mode="overwrite")
    entries = []
    for row in labelled_dataset:
        entries.append(
            {
                "id": row["id"],
                "title": row["title"],
                "description": row["description"],
                "brand": row["brand"],
                "category": row["category"],
                "subcategory": row["subcategory"],
                "product_type": row["product_type"],
                "attributes": row["attributes"],
                "occasions": row["occasions"],
                "material": row["material"],
                "pattern": row["pattern"],
                "price": row["price"],
            }
        )

    labelled_table.add(entries)

labelled_table = db.open_table(table_name)

In [13]:
# Verify that we've inserted the right number of rows into the dataset
labelled_table.count_rows()

191

## Query Understanding

Now that we've generated these new occasion labels as metadata using a language model, let's see how we can use this to improve our retrieval evals.

We'll do so in three steps

1. **Generate Synthetic Queries** : We'll generate queries that require metadata filtering and measure the recall and mrr of semantic search

2. **Extract Query Filters** : Once we've generated these queries and calculated an initial baseline, we'll then implement a function that takes a user's query and maps it to a list of predefined filters.

3. **Evaluate** : We'll then evaluate the performance of our retrieval pipeline with metadata filtering against our initial baseline using metrics such as recall and mrr.

Once we've done so, we'll see how metadata filtering helps improve the recall and mrr of retrieval pipelines on average by removing irrelevant items.


### Generating Synthetic Queries

When users come to our application, more often than not, they'll come with specific requirements in mind.

Even if they're just looking for a casual t-shirt, they'll have preferences such as 

- material
- price
- specific categories/occasions that they're looking to wear these items for

We can simulate these by generating queries that contain these requirements as seen below.

In [14]:
from datasets import load_dataset

ds = [item for item in load_dataset("ivanleomk/labelled-ecommerce-taxonomy")["train"]]

In [16]:
from pydantic import BaseModel
from openai import AsyncOpenAI
import instructor
import random
from rich import print
from pydantic_evals import Case

# Define possible user intents
user_intent = [
    "searching for clothing that matches their style and needs",
    "shopping for a specific occasion",
    "looking for a particular type of clothing",
]

# Define potential filters that can be used in queries
potential_filters = [
    "material",
    "price",
    "category",
    "subcategory",
    "product_type",
    "occasions",
    "material",
    "pattern",
    "attributes",
]


class UserQuery(BaseModel):
    chain_of_thought: str
    query: str


async def generate_query(
    client, item, user_intent, potential_filters, semaphore: Semaphore
):
    async with semaphore:
        with tempfile.NamedTemporaryFile(suffix=".jpg") as tmp:
            # Save the image to the temp file
            item["image"].save(tmp.name)

            # Create the chat completion with a timeout
            response = await wait_for(
                client.chat.completions.create(
                    model="gpt-4o",
                    response_model=UserQuery,
                    messages=[
                        {
                            "role": "system",
                            "content": """You are helping generate realistic shopping queries where the provided item would be the perfect answer.
                            The generated query should naturally lead to recommending this specific item.""",
                        },
                        {
                            "role": "user",
                            "content": [
                                """
Generate a natural shopping query where this item would be the perfect recommendation.
The query should match someone who is {{user_intent}}.

Item Details:
- Title: {{title}}
- Description: {{description}}
- Brand: {{brand}}
- Material: {{material}}
- Pattern: {{pattern}}
- Attributes: {{attributes}}
- Occasions: {{occasions}}
- Price: {{price}}
- Category: {{category}}
- Subcategory: {{subcategory}}
- Product Type: {{product_type}}

Consider these aspects when generating the query: {{potential_filters}}

Requirements:
- Query should be 20-30 words
- Conversational tone
- The query should describe the aspects above that make the item above a perfect match for the user's requirements
- Try to mention things which might be synonyms for the item and avoid mentioning it directly. Instead we should use specific attributes that the item has in order to make it a good fit. Make sure to use the exact attribute name so that it's unambigious
- for the price range, keep it to 15 bucks on either side of the price max
- If there is a slight mismatch between the item's title and the subcategory/product type, lean towards the subcategory/product type. (Eg. Title is a "Black Women's Top" but the subcategory is "Sweater", then make the query about a "Black Sweater")

For an office blouse that costs $120:
"Need something elegant for my new corporate job. Looking for a silk top with long sleeves and a modest neckline, under $150." ( Within the price range here )

For casual wear that costs $65:
"Shopping for my weekend brunches. Need a cotton top that's both comfy and stylish, maybe with some interesting pattern - ideally something between 40-100 bucks if possible." ( 65 is less than 69 )

Remember: The query should describe what someone would be looking for if this exact item would be their perfect match!
                                """,
                                instructor.Image.from_path(tmp.name),
                            ],
                        },
                    ],
                    context={
                        "user_intent": user_intent,
                        "potential_filters": potential_filters,
                        "title": item["title"],
                        "description": item["description"],
                        "brand": item["brand"],
                        "material": item["material"],
                        "pattern": item["pattern"],
                        "attributes": item["attributes"],
                        "price": item["price"],
                        "category": item["category"],
                        "subcategory": item["subcategory"],
                        "product_type": item["product_type"],
                        "occasions": item["occasions"],
                    },
                ),
                timeout=30,  # 30 second timeout
            )
            return Case(
                inputs=response.query,
                expected_output=[item["id"]],
                metadata={
                    **{k: v for k, v in item.items() if k not in {"image", "id"}}
                },
            )


sem = Semaphore(10)
client = instructor.from_openai(AsyncOpenAI())
user_intent = random.choice(user_intent)
filters = random.sample(potential_filters, random.randint(2, len(potential_filters)))
print(await generate_query(client, ds[0], user_intent, filters, sem))

In [32]:
from tqdm.asyncio import tqdm_asyncio as asyncio


coros = [generate_query(client, item, user_intent, filters, sem) for item in ds[50:70]]

queries = await asyncio.gather(*coros)

100%|██████████| 20/20 [00:10<00:00,  1.82it/s]


We'll then save our dataset to a Pydantic Evals Dataset and read from it down the line

In [None]:
from pydantic_evals import Dataset

dataset = Dataset(cases=queries)
dataset.to_file("./data/queries.yml")



### Calculating Initial Baseline

Now that we've generated an initial set of queries, let's see how semantic search performs on them.


In [22]:
from pydantic_evals import Dataset

dataset = Dataset.from_file("./data/queries.yml")

In [23]:
from helpers import get_metrics_at_k, task
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
from lancedb.table import Table
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import logfire
import asyncio


@dataclass
class RagMetricsEvaluator(Evaluator):
    async def evaluate(self, ctx: EvaluatorContext[str, str]) -> dict[str, float]:
        predictions = ctx.output
        labels = ctx.expected_output
        metrics = get_metrics_at_k(
            metrics=["mrr", "recall"], sizes=[1, 3, 5, 10, 15, 25]
        )
        return {
            metric: score_fn(predictions, labels)
            for metric, score_fn in metrics.items()
        }


async def retrieve_results(
    question: str,
    table: Table,
    pool: ThreadPoolExecutor,
    max_k=25,
    reranker=None,
):
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(
        pool,
        partial(task, user_query=question, table=table, max_k=max_k, reranker=reranker),
    )


logfire.configure(
    send_to_logfire=True,
    environment="experimentation",
    service_name="query-understanding",
    console=False,
)

<logfire._internal.main.Logfire at 0x10ed1f5e0>

In [62]:
import concurrent

dataset.evaluators = []
dataset.add_evaluator(RagMetricsEvaluator())

db = lancedb.connect("./lancedb")
original_table = db.open_table("labelled_items")

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    retrieval_result_without_filters = await dataset.evaluate(
        partial(retrieve_results, table=original_table, pool=executor, max_k=25)
    )

In [63]:
from tabulate import tabulate


def format_results(result):
    return tabulate(
        [[item, round(value, 4)] for item, value in result.averages().scores.items()],
        headers=["Metric", "Score"],
    )


print(format_results(retrieval_result_without_filters))

### Extracting Query Filters

We can see that semantic search performs decently on our initial set of queries, at k=15, we have recall@15 of ~0.79 and mrr@15 of ~0.75.

But if we look at the returned items itself, you'll notice that some should not have been returned in the queries.

Eg. Query : "Looking for a chic blouse for dinner dates and evening events. Needs to be long sleeve with floral pattern and crew neck, ideally under $50."

- First item returned has a price of 89 which is above the user's budget
- Within the 25 returned items, we had a few items that were of different subcategories such as Tank Tops.

In many other examples, you might see this as items being returned from the wrong category, having a price that doesn't match the user's budget or simply put items that don't match the user's requirements.

Query Understanding is one way that we can address this. 

We use `instructor` here because we can use Pydantic to validate the extracted filters.That gives us an easy way to validate if the filter values are valid. 

We want to use a ValidationInfo instead of hard-coding it into the code because that allows us to potentially reuse this for other datasets with different taxonomies



In [32]:
from typing import Optional
from pydantic import BaseModel, model_validator


class Attribute(BaseModel):
    name: str
    values: list[str]


class QueryFilters(BaseModel):
    attributes: list[Attribute]
    material: Optional[list[str]]
    min_price: Optional[float] = None
    max_price: Optional[float] = None
    subcategory: str
    category: str
    product_type: list[str]
    occasions: list[str]

    @model_validator(mode="after")
    def validate_attributes(self):
        # Validate category exists in taxonomy
        if self.category not in taxonomy_data["taxonomy_map"]:
            raise ValueError(
                f"Invalid category: {self.category}. Valid categories are {taxonomy_data['taxonomy_map'].keys()}"
            )

        # Validate subcategory exists under category
        if self.subcategory not in taxonomy_data["taxonomy_map"][self.category]:
            raise ValueError(
                f"Invalid subcategory {self.subcategory} for category {self.category}. Valid subcategories are {taxonomy_data['taxonomy_map'][self.category]}"
            )

        # Validate product types
        valid_types = taxonomy_data["taxonomy_map"][self.category][self.subcategory][
            "product_type"
        ]
        for product_type in self.product_type:
            if product_type not in valid_types:
                raise ValueError(
                    f"Invalid product type: {product_type}. Valid product types are {valid_types}"
                )

        # Validate attributes
        valid_attrs = taxonomy_data["taxonomy_map"][self.category][self.subcategory][
            "attributes"
        ]
        for attr in self.attributes:
            if attr.name not in valid_attrs:
                raise ValueError(f"Invalid attribute name: {attr.name}")
            for value in attr.values:
                if value not in valid_attrs[attr.name]:
                    raise ValueError(
                        f"Invalid value {value} for attribute {attr.name}. Valid values are {valid_attrs[attr.name]}"
                    )

        # Validate occasions
        for occasion in self.occasions:
            if occasion not in taxonomy_data["occasions"]:
                raise ValueError(
                    f"Invalid occasion: {occasion}. Valid Occasions are {taxonomy_data['occasions']}"
                )

        # Validate materials if provided
        if self.material:
            for material in self.material:
                if material not in taxonomy_data["materials"]:
                    raise ValueError(
                        f"Invalid material: {material}. Valid Materials are {taxonomy_data['materials']}"
                    )

        return self

Now that we've finished defining our Pydantic model, let's see how we can use it to extract filters from a user's query.

In [33]:
from openai import OpenAI
import instructor
from helpers import process_taxonomy_file
from rich import print

client = instructor.from_openai(OpenAI())
taxonomy_data = process_taxonomy_file("taxonomy.yml")
query = "I want a Tank-Top that's got a short sleeve or sleeveless which is under 100 bucks for an interview"

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that extracts user requirements from a query. Refer to this taxonomy for valid categories, subcategories, product types and attributes: {{taxonomy}}. If a filter isn't needed, just return an empty list. Refer to the occasions list for valid occasions if the user is looking for an occasion specific item: {{occasions}}. Refer to the materials list for valid materials if the user is looking for a specific material: {{materials}} else just return an empty list. Attributes with the same name should just be grouped together. If the user is looking for a specific attribute, just return the attribute name and the values that the user is looking for",
        },
        {"role": "user", "content": query},
    ],
    context={
        "taxonomy_map": taxonomy_data["taxonomy_map"],
        "taxonomy": taxonomy_data["taxonomy"],
        "occasions": taxonomy_data["occasions"],
        "materials": taxonomy_data["materials"],
    },
    response_model=QueryFilters,
)

print(resp)

In [69]:
import openai
import lancedb
import pandas as pd


async def extract_query_filters(
    query: str, client: openai.AsyncOpenAI, taxonomy_data: dict
) -> QueryFilters:
    return await client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": """
                    You are a helpful assistant that extracts user requirements from a query.
                    
                    Use these references:
                    - Taxonomy: {{ taxonomy }}
                    - Valid occasions: {{ occasions }} 
                    - Valid materials: {{ materials }}
                    
                    Guidelines:
                    - If a filter isn't needed, return an empty list
                    - Only add attributes and filters that a user has mentioned explicitly
                    - Only use values from the provided taxonomy, occasions and materials lists. 
                    - If the attribute exists on multiple types, make sure that you only look at the specific types listed under the subcategory you have chosen
                    - If the user hasn't mentioned a specific product type, lets just use all of them
                    - if the user gives a range (Eg. around 50), just give a buffer of 20 on each side (Eg. 30-70)
                    - if the user gives a vague price (Eg. I have a high budget), just set max price to 1000
                    - only classify an item as unisex if the user has explicitly mentioned it and default to Women's categories by default.
                    - If you're looking at blouses, make sure to include tank tops along the way and vice versa
                    - if the user mentions user bottoms and doesn't specify a specific length - let's include both short and long bottoms such as jeans, shorts and pants
                    - make sure to look carefully at the user's query to determine if they've specified a specific fit - eg. regular, relaxed, cropped. ( Relaxed and Relaxed should always go together)

                    Extract the requirements and format them according to the QueryFilters model.
                """,
            },
            {"role": "user", "content": query},
        ],
        context={
            "taxonomy_map": taxonomy_data["taxonomy_map"],
            "taxonomy": taxonomy_data["taxonomy"],
            "occasions": taxonomy_data["occasions"],
            "materials": taxonomy_data["materials"],
        },
        response_model=QueryFilters,
    )

We then define some tests to make sure the filters can be extracted as expected.

In [55]:
import instructor
import openai

client = instructor.from_openai(openai.AsyncOpenAI())
taxonomy_data = process_taxonomy_file("taxonomy.yml")

db = lancedb.connect("./lancedb")
table = db.open_table("labelled_items")

test_query = "I need a polyester sweatshirt under a 100 bucks"
generated_filter = await extract_query_filters(test_query, client, taxonomy_data)
print(generated_filter)

Since lancedb doesn't support prefiltering by list[str], we need to manually filter the items based on the product_type, material and occasions. 

This is known as a post-filtering step. We want to do so  because this is the easiest way to handle the filtering.

In [None]:
import json


def retrieve_and_filter(query: str, table, filters: QueryFilters, max_k=100):
    query_parts = []

    # We do a prefilter on category,price and material since these will always be provided
    query_parts.append(f"category='{filters.category}'")
    query_parts.append(f"subcategory='{filters.subcategory}'")

    if filters.min_price:
        query_parts.append(f"price >= {filters.min_price}")
    if filters.max_price:
        query_parts.append(f"price <= {filters.max_price}")

    query_string = " AND ".join(query_parts)
    items = (
        table.search(query=query)
        .where(query_string, prefilter=True)
        .limit(max_k)
        .to_list()
    )

    items = [
        {
            **item,
            "attributes": json.loads(item["attributes"]),
            "occasions": json.loads(item["occasions"]),
        }
        for item in items
    ]

    if filters.product_type:
        items = [item for item in items if item["product_type"] in filters.product_type]

    if filters.material:
        items = [item for item in items if item["material"] in filters.material]

    if filters.occasions:
        items = [
            item
            for item in items
            if any(occasion in item["occasions"] for occasion in filters.occasions)
        ]

    if filters.attributes:
        for attr in filters.attributes:
            if not attr.values:
                continue
            curr_items = []
            for item in items:
                attr_name = attr.name
                attr_values = attr.values
                item_attr_values = item["attributes"]
                for item_attr in item_attr_values:
                    if (
                        item_attr["name"] == attr_name
                        and item_attr["value"] in attr_values
                    ):
                        curr_items.append(item)
                        break

            items = curr_items

    return items


results = retrieve_and_filter(test_query, table, generated_filter)
pd.DataFrame(results)

Unnamed: 0,id,title,description,brand,category,subcategory,product_type,attributes,occasions,material,pattern,price,vector,_distance
0,53,Nike Air Sweatshirt,Stay comfortable and stylish with the Nike Air...,Nike,Women,Tops,Sweatshirts,"[{'name': 'Sleeve Length', 'value': 'Long Slee...","[Everyday Wear, Casual Outings, Activewear, Sp...",Polyester,Solid,96.37,"[0.038344867527484894, 0.02495894581079483, -0...",1.212469


### Evaluating Retrieval

We want to evaluate the retrieval performance of our model. To do so, we'll generate some synthetic queries that require the use of metadata filtering and evaluate the recall and MRR of our model before and after employing these filters.

We want to use the same queries because that gives us an easy way to compare against the initial baseline.

In [None]:
from pydantic_evals.dataset import set_eval_attribute
from lancedb.table import Table


async def generate_filters_and_retrieve_items(
    user_query, table: Table, max_k: int, pool: ThreadPoolExecutor
) -> dict:
    filters = await extract_query_filters(user_query, client, taxonomy_data)
    set_eval_attribute(
        "filters",
        filters.model_dump(),
    )

    loop = asyncio.get_running_loop()
    retrieval_results = await loop.run_in_executor(
        pool,
        partial(
            retrieve_and_filter,
            query=user_query,
            table=table,
            filters=filters,
            max_k=max_k,
        ),
    )

    return [item["id"] for item in retrieval_results]


with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    retrieval_result_with_filters = await dataset.evaluate(
        partial(
            generate_filters_and_retrieve_items,
            table=original_table,
            max_k=25,
            pool=executor,
        )
    )

In [89]:
import pandas as pd

scores = []
experiment_results = [retrieval_result_without_filters, retrieval_result_with_filters]

for result in experiment_results:
    result_scores = {}
    for score_name, score in result.averages().scores.items():
        result_scores[score_name] = round(score, 3)
    scores.append(result_scores)

df = pd.DataFrame(scores, index=["Without Filters", "With Filters"])

# Calculate percentage improvement
pct_improvement = {}
for metric in df.columns:
    base = df.loc["Without Filters", metric]
    improved = df.loc["With Filters", metric]
    pct_improvement[metric] = round(((improved - base) / base) * 100, 1)

# Create a more compact display with improvement percentages
result_df = pd.DataFrame(
    {
        "Without Filters": df.loc["Without Filters"],
        "With Filters": df.loc["With Filters"],
        "% Improvement": pct_improvement,
    }
)

display(result_df)

Unnamed: 0,Without Filters,With Filters,% Improvement
mrr@1,0.395,0.816,106.6
mrr@3,0.456,0.851,86.6
mrr@5,0.472,0.856,81.4
mrr@10,0.49,0.86,75.5
mrr@15,0.492,0.86,74.8
mrr@25,0.495,0.86,73.7
recall@1,0.395,0.816,106.6
recall@3,0.553,0.895,61.8
recall@5,0.632,0.921,45.7
recall@10,0.763,0.947,24.1


### Results Analysis

We can see that the structured metadata extraction approach shows strong performance on both MRR (Mean Reciprocal Rank) and Recall metrics. More concretely, we see that recall@5,10,15 shows an improvement of ~30% on average while mrr@5,10,15 shows an improvement of ~77% on average.

 
This substantial improvement occurs because metadata filtering effectively narrows the search space to only relevant items based on structured attributes like price, material, and product type. For example, with filters, our recall@10 reaches 0.95, which exceeds the recall@25 (0.84) of the baseline semantic search without filters. 

This means we can retrieve more relevant items with fewer results when using structured metadata, leading to a more efficient and accurate retrieval system.

This demonstrates that structured metadata extraction is an effective approach for product search and filtering. In the next portion, we'll look at how we can improve on retrieval performance by improving the captioning of our images.


### Better Captioning

We can improve on the retrieval performance by improving the captioning of our images. Previously what we did was to simply embed the description of the text itself but we can do better by concatenating and adding in metadata such as price, category, subcategory and occasions before embedding it.

Let's see how this impacts our retrieval performance. We'll do so in 2 steps

1. First, we'll re-ingest our data so that we use our new formatted text
2. Secondly, we'll then benchmark our retrieval against the original description that we embedded

We'll use recall and mrr to quantify the improvement as usual with k = [1,3,5,10,15,25]

In [76]:
from datasets import load_dataset

ds = [item for item in load_dataset("ivanleomk/labelled-ecommerce-taxonomy")["train"]]
ds[0]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=768x1024>,
 'title': 'Lace Detail Sleeveless Top',
 'brand': 'H&M',
 'description': "Elevate your casual wardrobe with this elegant sleeveless top featuring intricate lace detailing at the neckline. Perfect for both day and night, it's crafted from a soft, breathable fabric for all-day comfort.",
 'category': 'Women',
 'subcategory': 'Tops',
 'product_type': 'Tank Tops',
 'attributes': '[{"name": "Sleeve Length", "value": "Sleeveless"}, {"name": "Neckline", "value": "Crew Neck"}]',
 'material': 'Cotton',
 'pattern': 'Solid',
 'id': 1,
 'price': 181.04,
 'occasions': '["Everyday Wear", "Casual Outings", "Smart Casual", "Dinner Dates", "Partywear"]'}

In [77]:
from rich import print
import json


def format_text(item):
    attributes = json.loads(item["attributes"])
    attributes_str = "\n  -".join(
        [f"{attr['name']}: {attr['value']}" for attr in attributes]
    )
    return f"""
title: {item["title"]}
description: {item["description"]}
brand: {item["brand"]}
category: {item["category"]}
subcategory: {item["subcategory"]}
product_type: {item["product_type"]}
price: {item["price"]}
occasions: {",".join(json.loads(item["occasions"]))}
attributes: {attributes_str}
""".strip()


print(format_text(ds[0]))

In [78]:
from lancedb.pydantic import LanceModel, Vector
import lancedb
from lancedb.embeddings import get_registry


# Create Embedding Function
func = get_registry().get("openai").create(name="text-embedding-3-small")


# Define a Model that will be used as the schema for our collection
class ItemWithConcatendatedDescription(LanceModel):
    id: int
    title: str
    description: str
    text: str = func.SourceField()
    brand: str
    category: str
    subcategory: str
    product_type: str
    attributes: str
    occasions: str
    material: str
    pattern: str
    price: float
    vector: Vector(func.ndims()) = func.VectorField()


db = lancedb.connect("./lancedb")
table_name = "concat_items"

if table_name not in db.table_names():
    concat_items = db.create_table(
        table_name, schema=ItemWithConcatendatedDescription, mode="overwrite"
    )
    entries = []
    for row in ds:
        entries.append(
            {
                "id": row["id"],
                "title": row["title"],
                "description": row["description"],
                "brand": row["brand"],
                "category": row["category"],
                "subcategory": row["subcategory"],
                "product_type": row["product_type"],
                "attributes": row["attributes"],
                "material": row["material"],
                "pattern": row["pattern"],
                "price": row["price"],
                "text": format_text(row),
                "occasions": row["occasions"],
            }
        )

    concat_items.add(entries)

concat_table = db.open_table(table_name)

We want to see the impact of the better captioning on the retrieval performance.

Therefore, we can just use the same dataset as before but using a different table that we parameterize.

In [81]:
tables = [db.open_table("items"), db.open_table("concat_items")]

results = []

for query_table in tables:
    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        retrieval_result = await dataset.evaluate(
            partial(
                retrieve_results,
                table=query_table,
                max_k=25,
                pool=executor,
            )
        )
        results.append(retrieval_result)

In [82]:
for result in results:
    print(format_results(result))

Let's see what happens if we were to combine this with our metadata filtering approach - we can see that we get an increase of ~10-15% in recall for MRR and ~30% increase in recall. Notice how recall@25 has an absolute increase of ~5% which means that with the new filters, we're surfacing new relevant items that didn't apppear in the top 25 items before.

In [87]:
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    retrieval_result = await dataset.evaluate(
        partial(
            generate_filters_and_retrieve_items,
            table=concat_table,
            max_k=25,
            pool=executor,
        )
    )


In [88]:
print(format_results(retrieval_result))

Here are the final results visualised in a single table.

| Metric     | Description | Concatenated Info | Concatenated + Query Understanding |
|------------|-------------|-------------------|-----------------------------------|
| **mrr@1**  | 0.39 | 0.53 (+35.9%) | 0.82 (+110.3%) |
| **mrr@3**  | 0.46 | 0.56 (+21.7%) | 0.85 (+84.8%) |
| **mrr@5**  | 0.47 | 0.59 (+25.5%) | 0.85 (+80.8%) |
| **mrr@10** | 0.49 | 0.60 (+22.4%) | 0.86 (+75.5%) |
| **mrr@15** | 0.49 | 0.61 (+24.5%) | 0.86 (+75.5%) |
| **mrr@25** | 0.49 | 0.61 (+24.5%) | 0.86 (+75.5%) |
| **recall@1**  | 0.39 | 0.53 (+35.9%) | 0.82 (+110.3%) |
| **recall@3**  | 0.55 | 0.61 (+10.9%) | 0.89 (+61.8%) |
| **recall@5**  | 0.63 | 0.71 (+12.7%) | 0.89 (+41.3%) |
| **recall@10** | 0.76 | 0.84 (+10.5%) | 0.95 (+25.0%) |
| **recall@15** | 0.79 | 0.87 (+10.1%) | 0.95 (+20.3%) |
| **recall@25** | 0.84 | 0.95 (+13.1%) | 0.95 (+13.1%) |



## Conclusion

In this notebook, we looked at how combining semantic search with metadata filtering improves retrieval performance - in our specific case, we achieved ~30% higher recall@5,10,15 and ~77% higher MRR@5,10,15 with query understanding. 

By enhancing item descriptions with structured metadata and mapping user queries to the same taxonomy through query understanding, we were able to build on the evaluation frameworks that we established in Week 1 to quantiatively measure the improvements. By identifying query patterns (Week 4) and adding relevant metadata filters, we ensure our system can handle real-world queries that require both relevance and specific criteria.

In short, better retrieval requires both improved search and relevant indexes.

In the next notebook, we'll look at how we can take this even further by implementing  text-to-sql capabilities, allowing our system to answer complex questions about product availability and user preferences. This sets us up for Week 6, where we'll explore how to reliably handle tool calling by looking at how to measure the performance of our function calling.