# Testing mixedbread model


In [1]:
from utils import load_processed_data
docs_for_splitter, questions_ground_truth = load_processed_data("../data/processed/squad_processed.pkl")

Splitting paragraphs into docs that will be embeded. Here I used the same tokenizer from the mixedbread mode with RecursiveCharacterTextSplitter, this way I hope to utilize the whole context window and minimize number of sentences that are split across chunks.

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer
from langchain_text_splitters import RecursiveCharacterTextSplitter

tokenizer = AutoTokenizer.from_pretrained("mixedbread-ai/mxbai-embed-large-v1")
splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer,chunk_size=512, chunk_overlap=75)
splits = splitter.split_documents(docs_for_splitter)

print(f"Number of chunks: {len(splits)}")


Number of chunks: 581


For the first test we will look at how mixedbread emüçûing model performs.<br>
Above we see that there is 581 chunks that came from 580 documents. As most paragraphs fit into the context window, the overlap wouldn't actually make a difference.<br>
Embedding the documents:

In [3]:
import os
from langchain_chroma import Chroma

persist_directory = "./chroma/02_mxb"
embed = HuggingFaceEmbeddings(model_name="mixedbread-ai/mxbai-embed-large-v1")
if os.path.exists(persist_directory):
    print("Loading existing embeddings...")
    vectorstore = Chroma(
        persist_directory=persist_directory, 
        embedding_function=embed
    )
else:
    print("No existing index found. Generating embeddings...")
    vectorstore = Chroma.from_documents(
        documents=splits,
        embedding=embed,
        persist_directory=persist_directory
    )

retriever = vectorstore.as_retriever()

Loading existing embeddings...


With a quick test, we see that we retrieved the correct paragraph at 1st place.

In [4]:
id = 1003
print(f"Question: {questions_ground_truth[id]['question']}")
print(f"Target paragraph id: {questions_ground_truth[id]['ground_truth_para_id']}")

print(retriever.invoke(questions_ground_truth[id]['question']))

Question: What did Regis McKenna call the "1984" ad that was aired during the Super Bowl?
Target paragraph id: 270
[Document(id='a080ab18-1134-45a5-b1db-5d1fb99cf51d', metadata={'para_id': 270, 'article': 'Macintosh'}, page_content='After the Lisa\'s announcement, John Dvorak discussed rumors of a mysterious "MacIntosh" project at Apple in February 1983. The company announced the Macintosh 128K‚Äîmanufactured at an Apple factory in Fremont, California‚Äîin October 1983, followed by an 18-page brochure included with various magazines in December. The Macintosh was introduced by a US$1.5 million Ridley Scott television commercial, "1984". It most notably aired during the third quarter of Super Bowl XVIII on January 22, 1984, and is now considered a "watershed event" and a "masterpiece." Regis McKenna called the ad "more successful than the Mac itself." "1984" used an unnamed heroine to represent the coming of the Macintosh (indicated by a Picasso-style picture of the computer on her whit

In [6]:
from utils import evaluate_retrieval

results = evaluate_retrieval(questions_ground_truth,retriever)

Starting evaluation on 2265 questions...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2265/2265 [10:00<00:00,  3.77it/s]


--- Evaluation Results ---
MRR: 0.8140
Hit Rate@1: 73.42%
Hit Rate@3: 88.21%
Hit Rate@5: 92.01%
Hit Rate@7: 93.82%
Hit Rate@10: 95.23%



