# Testing mixedbread model with normalization
We will run everything up to embedding same as in 02 notebook.

In [1]:
from utils import load_processed_data
docs_for_splitter, questions_ground_truth = load_processed_data("../data/processed/squad_processed.pkl")

from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer
from langchain_text_splitters import RecursiveCharacterTextSplitter

tokenizer = AutoTokenizer.from_pretrained("mixedbread-ai/mxbai-embed-large-v1")
splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer,chunk_size=512, chunk_overlap=75)
splits = splitter.split_documents(docs_for_splitter)

This time, we will normalize the vectors, and use cosine similarity when retrieving documents.

In [2]:
import os
from langchain_chroma import Chroma

persist_directory = "./chroma/03_mxb_with_normalization"
embed = HuggingFaceEmbeddings(model_name="mixedbread-ai/mxbai-embed-large-v1",
                              encode_kwargs={'normalize_embeddings': True})
if os.path.exists(persist_directory):
    print("Loading existing embeddings...")
    vectorstore = Chroma(
        persist_directory=persist_directory, 
        embedding_function=embed
    )
else:
    print("No existing index found. Generating embeddings...")
    vectorstore = Chroma.from_documents(
        documents=splits,
        embedding=embed,
        persist_directory=persist_directory,
        collection_metadata={"hnsw:space": "cosine"}
    )

retriever = vectorstore.as_retriever()

No existing index found. Generating embeddings...


In [3]:
from utils import evaluate_retrieval

results = evaluate_retrieval(questions_ground_truth,retriever)

Starting evaluation on 2265 questions...


100%|██████████| 2265/2265 [09:29<00:00,  3.98it/s]


--- Evaluation Results ---
MRR: 0.8201
Hit Rate@1: 74.35%
Hit Rate@3: 88.26%
Hit Rate@5: 91.83%
Hit Rate@7: 94.35%
Hit Rate@10: 95.63%





We se marginal imporvement over the non-normalized vectors and L2 similarity.