# Making some noise
So we saw I got some big results in the metrics, this is expected as there is only so much data to sift through. Idea here is to get text that is similar with the general topics and similar in the way of typing, so I will pull 19 articles and chunk them. With that we will test the results both with and without the cross-encoder reranker.

In [1]:
import wikipedia
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer
from langchain_core.documents import Document

similar_articles = [
    "Operating_system", "Central_processing_unit", "Network_security", 
    "Logic_gate", "Relational_database", "Virtual_machine", 
    "Compiler", "Data_structure", "Ethernet", "Cryptography", 
    "Internet_protocol_suite", "Algorithm", "Embedded_system", 
    "Random-access_memory", "Microprocessor", "Cloud_computing",
    "Flash_memory", "Integrated_circuit", "Silicon_Valley"
]

# Same as in notebook 04
tokenizer = AutoTokenizer.from_pretrained("google/embeddinggemma-300m")
splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer,chunk_size=512, chunk_overlap=75)

noise_docs = []
print(f"Fetching {len(similar_articles)} new articles for noise")

for title in similar_articles:
    try:
        page = wikipedia.page(title, auto_suggest=False)
        chunks = splitter.split_text(page.content)
        
        for chunk in chunks:
            noise_docs.append(Document(
                page_content=chunk,
                metadata={
                    "para_id": 9999,
                    "article": title,
                    "is_noise": True
                }
            ))
    except Exception as e:
        print(f"Skipping {title}: {e}")

print(f"Created {len(noise_docs)} chunks of thematic noise from wikipedia articles.")

Fetching 19 new articles for noise


Token indices sequence length is longer than the specified maximum sequence length for this model (2077 > 2048). Running this sequence through the model will result in indexing errors


Created 369 chunks of thematic noise from wikipedia articles.


With 369 chunks, we will have just under 40% of noise chunks. For now we will treat them as wrong answers even though some questions might be answerable with this noise.

In [2]:
from utils import load_processed_data
docs_for_splitter, questions = load_processed_data()
splits = splitter.split_documents(docs_for_splitter)

In [3]:
ready_chunks = splits + noise_docs
print(f"Chunks for Chroma: {len(ready_chunks)}")

Chunks for Chroma: 950


We've prepared everything for creating the vectorstore. After this, next move is testing.

In [2]:
import os
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

persist_directory = "./chroma/06_testing_with_noise"
embed = HuggingFaceEmbeddings(model_name="google/embeddinggemma-300m",
                              encode_kwargs={'normalize_embeddings': True})
if os.path.exists(persist_directory):
    print("Loading existing embeddings...")
    vectorstore = Chroma(
        persist_directory=persist_directory, 
        embedding_function=embed
    )
else:
    print("No existing index found. Generating embeddings...")
    vectorstore = Chroma.from_documents(
        documents=ready_chunks,
        embedding=embed,
        persist_directory=persist_directory,
        collection_metadata={"hnsw:space": "cosine"}
    )

Loading existing embeddings...


In [1]:
from utils import load_mini_question_set
questions_mini = load_mini_question_set()

In [3]:
from langchain_classic.retrievers import ContextualCompressionRetriever
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

retriever = vectorstore.as_retriever()

model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=model, top_n=10)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=retriever
)

In [4]:
from utils import evaluate_retrieval

In [5]:
results_no_rerank = evaluate_retrieval(questions_mini, retriever)

Starting evaluation on 350 questions...


100%|██████████| 350/350 [00:35<00:00,  9.99it/s]


--- Evaluation Results ---
MRR: 0.8088
Hit Rate@1: 72.29%
Hit Rate@3: 87.71%
Hit Rate@5: 91.71%
Hit Rate@7: 94.57%
Hit Rate@10: 95.71%





In [6]:
results_rerank = evaluate_retrieval(questions_mini, compression_retriever)

Starting evaluation on 350 questions...


100%|██████████| 350/350 [02:19<00:00,  2.51it/s]


--- Evaluation Results ---
MRR: 0.9060
Hit Rate@1: 86.57%
Hit Rate@3: 95.14%
Hit Rate@5: 95.71%
Hit Rate@7: 95.71%
Hit Rate@10: 95.71%





These results make sense, with more noise, simple system with no ranking dropped Top 1 metric by 6% when the the noise is added, we can see the progression from top 1 to top 10 is slower when noise is added, but the difference between top 10 with and without noise is 1%, meaning the noise that was added was able to muddy the water just enough to lower the top 1 and top 3, but if we retrieve enough documents, we will almost always get the target paragraph.<br>
Looking at difference between reranker systems with and without noise, difference is still noticable at top 1, but at top 3 the decrease is negligable. This makes sense, as I mentioned above, with only so much noise, we were able to still get the right document in our retrieved 10, meaning if reranker is effective enough he will be able to cut through the noise and more effectively bring us the target paragraph.