# 🔍 Predicting Item Prices from Descriptions (Part 3)
---
-  Data Curation & Preprocessing
- Model Benchmarking – Traditional ML vs LLMs
- ➡️E5 Embeddings & RAG
- Fine-Tuning GPT-4o Mini
- Evaluating LLaMA 3.1 8B Quantized
- Fine-Tuning LLaMA 3.1 with QLoRA
- Evaluating Fine-Tuned LLaMA
- Summary & Leaderboard

---

# 🧠 Part 3: E5 Embeddings & RAG

- 🧑‍💻 Skill Level: Advanced
- ⚙️ Hardware: ⚠️ GPU required for embeddings (400K items) - use Google Colab
- 🛠️ Requirements: 🔑 HF Token, Open API Key
- Tasks:
    - Preprocessed item descriptions
    - Generated and stored embeddings in ChromaDB
    - Trained XGBoost on embeddings, pushed to HF Hub, and ran predictions
    - Predicted prices with GPT-4o Mini using RAG

Is Word2Vec enough for XGBoost, or do contextual E5 embeddings perform better?

Does retrieval improve price prediction for GPT-4o Mini?

Let’s find out.

⚠️ This notebook assumes basic familiarity with RAG and contextual embeddings.
We use the same E5 embedding space for both XGBoost and GPT-4o Mini with RAG, enabling a fair comparison.
Embeddings are stored and queried via ChromaDB — no LangChain is used for creation or retrieval.

---
📢 Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)

In [None]:
# Install required packages in Google Colab
%pip install -q tqdm huggingface_hub numpy sentence-transformers datasets chromadb xgboost

In [None]:
# imports

import math
import chromadb
import re
import joblib
import os
from tqdm import tqdm
import gc
from huggingface_hub import login, HfApi
import numpy as np
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from google.colab import userdata
from xgboost import XGBRegressor
from openai import OpenAI
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Mount Google Drive to access persistent storage

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Google Colab User Data
# Ensure you have set the following in your Google Colab environment:
openai_api_key = userdata.get("OPENAI_API_KEY")
hf_token = userdata.get('HF_TOKEN')

In [None]:
openai = OpenAI(api_key=openai_api_key)
login(hf_token, add_to_git_credential=True)

# Configuration
ROOT = "/content/drive/MyDrive/deal_finder"
CHROMA_PATH = f"{ROOT}/chroma"

In [None]:
# Helper class for evaluating model predictions

GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
RESET = "\033[0m"
COLOR_MAP = {"red":RED, "orange": YELLOW, "green": GREEN}

class Tester:

    def __init__(self, predictor, data, title=None, size=250):
        self.predictor = predictor
        self.data = data
        self.title = title or predictor.__name__.replace("_", " ").title()
        self.size = size
        self.guesses = []
        self.truths = []
        self.errors = []
        self.sles = []
        self.colors = []

    def color_for(self, error, truth):
        if error<40 or error/truth < 0.2:
            return "green"
        elif error<80 or error/truth < 0.4:
            return "orange"
        else:
            return "red"

    def run_datapoint(self, i):
        datapoint = self.data[i]
        guess = self.predictor(datapoint)
        truth = datapoint["price"]
        error = abs(guess - truth)
        log_error = math.log(truth+1) - math.log(guess+1)
        sle = log_error ** 2
        color = self.color_for(error, truth)
        # title = datapoint["text"].split("\n\n")[1][:20] + "..."
        self.guesses.append(guess)
        self.truths.append(truth)
        self.errors.append(error)
        self.sles.append(sle)
        self.colors.append(color)
        # print(f"{COLOR_MAP[color]}{i+1}: Guess: ${guess:,.2f} Truth: ${truth:,.2f} Error: ${error:,.2f} SLE: {sle:,.2f} Item: {title}{RESET}")

    def chart(self, title):
        # max_error = max(self.errors)
        plt.figure(figsize=(12, 8))
        max_val = max(max(self.truths), max(self.guesses))
        plt.plot([0, max_val], [0, max_val], color='deepskyblue', lw=2, alpha=0.6)
        plt.scatter(self.truths, self.guesses, s=3, c=self.colors)
        plt.xlabel('Ground Truth')
        plt.ylabel('Model Estimate')
        plt.xlim(0, max_val)
        plt.ylim(0, max_val)
        plt.title(title)

        # Add color legend
        from matplotlib.lines import Line2D
        legend_elements = [
            Line2D([0], [0], marker='o', color='w', label='Accurate (green)', markerfacecolor='green', markersize=8),
            Line2D([0], [0], marker='o', color='w', label='Medium error (orange)', markerfacecolor='orange', markersize=8),
            Line2D([0], [0], marker='o', color='w', label='High error (red)', markerfacecolor='red', markersize=8)
        ]
        plt.legend(handles=legend_elements, loc='upper right')

        plt.show()


    def report(self):
        average_error = sum(self.errors) / self.size
        rmsle = math.sqrt(sum(self.sles) / self.size)
        hits = sum(1 for color in self.colors if color=="green")
        title = f"{self.title} Error=${average_error:,.2f} RMSLE={rmsle:,.2f} Hits={hits/self.size*100:.1f}%"
        self.chart(title)

    def run(self):
        self.error = 0
        for i in range(self.size):
            self.run_datapoint(i)
        self.report()

    @classmethod
    def test(cls, function, data):
        cls(function, data).run()


## 📥 Load Dataset

In [None]:
# #If you face NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported run:
# %pip install -U datasets

In [None]:
HF_USER = "lisekarimi"
DATASET_NAME = f"{HF_USER}/pricer-data"

dataset = load_dataset(DATASET_NAME)
train = dataset['train']
test = dataset['test']

In [None]:
print(train[0]["text"])

In [None]:
print(train[0]["price"])

## 📦 Embed + Save Training Data to Chroma
- No LangChain used.
- We use `intfloat/e5-small-v2` for embeddings:
    - Fast, high-quality, retrieval-tuned
    - **Requires 'passage:' prefix**
- We embed item descriptions and store them in ChromaDB, with price saved as metadata.

In [None]:
# Load embedding model
model_embedding = SentenceTransformer("intfloat/e5-small-v2", device='cuda')

In [None]:
# Init Chroma
client = chromadb.PersistentClient(path=CHROMA_PATH)
collection = client.get_or_create_collection(name="price_items")

In [None]:
# Format description function (no price in text)
def description(item):
    text = item["text"].replace("How much does this cost to the nearest dollar?\n\n", "")
    text = text.split("\n\nPrice is $")[0]
    return f"passage: {text}"

description(train[0])

In [None]:
batch_size = 300    # how many items to insert into Chroma at once
encode_batch_size = 1024  # how many items to encode at once in GPU memory

for i in tqdm(range(0, len(train), batch_size), desc="Processing batches"):

    end_idx = min(i + batch_size, len(train))

    # Collect documents and metadata
    documents = [description(train[j]) for j in range(i, end_idx)]
    metadatas = [{"price": train[j]["price"]} for j in range(i, end_idx)]
    ids = [f"doc_{j}" for j in range(i, end_idx)]

    # GPU batch encoding
    vectors = model_embedding.encode(
        documents,
        batch_size=encode_batch_size,
        show_progress_bar=False,
        normalize_embeddings=True
    ).tolist()

    # Insert into Chroma
    collection.add(
        ids=ids,
        documents=documents,
        embeddings=vectors,
        metadatas=metadatas
    )

print("✅ Embedding and storage to ChromaDB completed.")

In [None]:
# Now flush and clean
print("🧹 Cleaning up and saving ChromaDB...")
client = None
gc.collect()

Our ChromaDB is currently saved in a persistent Google Drive path; for a production-ready app, we recommend uploading it to AWS S3 for better reliability and scalability.

🧩 Now that we've generated the E5 embeddings, let's use them for both **XGBoost regression** and **GPT-4o Mini with RAG** .

## 📈 Embedding-Based Regression with XGBoost

In [None]:
# Step 1: Load vectors and prices from Chroma
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
prices = [meta['price'] for meta in result['metadatas']]

In [None]:
# Step 2: Train XGBoost model
xgb_model = XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1, verbosity=0)
xgb_model.fit(vectors, prices)

In [None]:
# Step 3: Serialize XGBoost model locally for Hugging Face upload
MODEL_DIR = os.path.join(ROOT, "models")
MODEL_FILENAME = "xgboost_model.pkl"
LOCAL_MODEL = os.path.join(MODEL_DIR, MODEL_FILENAME)

os.makedirs(MODEL_DIR, exist_ok=True)
joblib.dump(xgb_model, LOCAL_MODEL)

In [None]:
# Step 4: Push serialized XGBoost model to Hugging Face Hub
api = HfApi(token=hf_token)
REPO_NAME = "smart-deal-finder-models"
REPO_ID = f"{HF_USER}/{REPO_NAME}"

# Create the model repo if it doesn't exist
api.create_repo(repo_id=REPO_ID, repo_type="model", private=True, exist_ok=True)

# Upload the saved model
api.upload_file(
    path_or_fileobj=LOCAL_MODEL,
    path_in_repo=MODEL_FILENAME,
    repo_id=REPO_ID,
    repo_type="model"
)

In [None]:
# Step 5: Define the predictor
def xgb_predictor(datapoint):
    doc = description(datapoint)
    vector = model_embedding.encode([doc], normalize_embeddings=True)[0]
    return max(0, xgb_model.predict([vector])[0])

🔔 Reminder: In Part 2, XGBoost with Word2Vec (non-contextual embeddings) achieved:
- Avg. Error: ~$107
- RMSLE: 0.83
- Accuracy: 29.20%

🧪 Now, let’s see if contextual embeddings improve XGBoost.

In [None]:
# Step 4: Run the Tester on a subset of test data
tester = Tester(xgb_predictor, test)
tester.run()

Xgb Predictor Error=$110.68 RMSLE=0.93 Hits=30.4%

Results are nearly the same. In this setup, switching to contextual embeddings didn’t yield performance gains for XGBoost.

## 🚰 Retrieval-Augmented Pipeline – GPT-4o Mini

- Preprocess: clean the input text (description(item))
- Embed: generate embedding vector (get_embedding(item))
- Retrieve: find similar items from ChromaDB (find_similar_items)
- Build Prompt: create the LLM prompt using context and masked target (build_messages)
- Predict: get price estimate from LLM (estimate_price)

In [None]:
test[1]

In [None]:
# Step 1: Preprocess test item text
# (uses the same `description(item)` function as during training)
description(test[1])

In [None]:
# Step 2: Embed a test item
def get_embedding(item):
    return model_embedding.encode([description(item)])

In [None]:
# Step 3: Query Chroma for similar items
def find_similars(item):
    results = collection.query(query_embeddings=get_embedding(item).astype(float).tolist(), n_results=5)
    documents = results['documents'][0][:]
    prices = [m['price'] for m in results['metadatas'][0][:]]
    return documents, prices

In [None]:
documents, prices = find_similars(test[1])
documents, prices

In [None]:
# Step 4: Format similar items as context
def format_context(similars, prices):
    message = "To provide some context, here are some other items that might be similar to the item you need to estimate.\n\n"
    for similar, price in zip(similars, prices):
        message += f"Potentially related product:\n{similar}\nPrice is ${price:.2f}\n\n"
    return message

In [None]:
print(format_context(documents, prices))

In [None]:
# Step 5: Mask the price in the test item
def mask_price_value(text):
    return re.sub(r"(\n\nPrice is \$).*", r"\1", text)

In [None]:
# Step 6: Build LLM messages
def build_messages(datapoint, similars, prices):

    system_message = "You estimate prices of items. Reply only with the price, no explanation."

    context = format_context(similars, prices)

    prompt = mask_price_value(datapoint["text"])
    prompt = prompt.replace(" to the nearest dollar", "").replace("\n\nPrice is $", "")

    user_prompt = context + "And now the question for you:\n\n" + prompt

    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Price is $"}
    ]

In [None]:
build_messages(test[1], documents, prices)

In [None]:
# Step 7: Run prediction
def get_price(s):
    s = s.replace('$','').replace(',','')
    match = re.search(r"[-+]?\d*\.\d+|\d+", s)
    return float(match.group()) if match else 0

def gpt_4o_mini_rag(item):
    documents, prices = find_similars(item)
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=build_messages(item, documents, prices),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
print(test[1]["price"])
print(gpt_4o_mini_rag(test[1]))

🔔 Reminder: In Part 2, GPT-4o Mini (without RAG) achieved:
- Avg. Error: ~$99
- RMSLE: 0.75
- Accuracy: 44.8%

🧪 Let’s find out if RAG can boost GPT-4o Mini’s price prediction capabilities.
  

In [None]:
Tester.test(gpt_4o_mini_rag, test)

Gpt 4O Mini Rag Error=$59.54 RMSLE=0.42 Hits=69.2%

🎉 **GPT-4o Mini + RAG shows clear gains:**  
Average error dropped from **$99 → $59.54**, RMSLE from **0.75 → 0.42**, and accuracy rose from **48.8% → 69.2%**.  

Adding retrieval-based context led to a strong performance boost for GPT-4o Mini.

Now the question is — can fine-tuning push it even further, surpass RAG, and challenge larger models?

🔜 See you in the [next notebook](https://github.com/lisekarimi/lexo/blob/main/09_part4_ft_gpt4omini.ipynb)