# Overview

This code implements Contextual RAG System that combines vector-based similarity search with keyword-based BM25 retrieval. The approach aims to leverage the strengths of both methods to improve the overall quality and relevance of document retrieval.

#Motivation
Traditional retrieval methods often rely on either semantic understanding (vector-based) or keyword matching (BM25). Each approach has its strengths and weaknesses. Fusion retrieval aims to combine these methods to create a more robust and accurate retrieval system that can handle a wider range of queries effectively. The aim of this notbook to compare Contextual Retrieval implementation with "simple/traditional" implemintation

# Key Components
 - "m-ric/huggingface_doc_qa_eval" Hugging face dataset
 - Pinecone Vector store for embedding storage
 - OpenAI embeddings
 - OpenAI summary model and generation model (Can be any other model)
 - BM25 index creation for keyword-based retrieval
 - Custom fusion retrieval function that combines both methods

# Method Details
Based on the code in the notebook/file, I can enhance the Method Details section to more accurately reflect the implementation:

# Method Details

## Document Preprocessing
1. The dataset "m-ric/huggingface_doc_qa_eval" is loaded and filtered to keep only high-quality question/answer pairs (standalone_score >= 4)
2. Documents are split into chunks using RecursiveCharacterTextSplitter with:
   - Chunk size: 800 characters
   - Overlap: 200 characters
   - Custom markdown separators to maintain document structure

## Document Contextualization
1. Each chunk is enriched with contextual information using OpenAI GPT model:
   - A prompt template guides the model to analyze how each chunk relates to its parent document
   - Generated context is concise (3-4 sentences) and captures the chunk's role within the broader document
   - The context is prepended to the chunk text for enhanced retrieval

## Vector Store Creation
1. OpenAI embeddings (text-embedding-3-small model) are used to create vector representations of:
   - Regular chunks (without context)
   - Contextualized chunks (with prepended context)
2. Two separate Pinecone vector stores (ServerlessSpec) are created:
   - One for regular chunks
   - One for contextualized chunks
   - Both use cosine similarity metric and 1536 dimensions

## BM25 Index Creation
1. Two BM25Okapi indexes are created using NLTK word tokenization:
   - One for regular chunks
   - One for contextualized chunks
2. This enables keyword-based retrieval alongside vector-based methods

## Fusion Retrieval Function
The fusion_rank_search function combines multiple retrieval approaches:

1. Initial Retrieval:
   - Performs both vector-based (Pinecone) and BM25-based retrieval
   - Gets top-k (default 20) results from each method
   - Normalizes scores from both methods to a common scale (0-1)

2. Score Combination:
   - Weighted combination using the weight_sparse (alpha) parameter
   - Aggregates scores for documents appearing in both result sets
   - Normalizes combined scores by the number of methods that retrieved each document

3. Reranking:
   - Uses BAAI/bge-reranker-v2-m3 model to rerank the combined results
   - Query-document pairs are scored by the reranker
   - Final ranking is based on reranker scores

4. Returns the top-k (default 5) documents after reranking

## Evaluation
 Using BERTScore metrics to compare the effectiveness of regular vs. contextualized retrieval approaches.

# Benefits of this Approach
1. Improved Retrieval Quality: By combining semantic and keyword-based search, the system can capture both conceptual similarity and exact keyword matches.
2. Flexibility: The alpha parameter allows for adjusting the balance between vector and keyword search based on specific use cases or query types.
3. Robustness: The combined approach can handle a wider range of queries effectively, mitigating weaknesses of individual methods.
4. Customizability: The system can be easily adapted to use different vector stores or keyword-based retrieval methods.

# Conclusion
Fusion retrieval represents a powerful approach to document search that combines the strengths of semantic understanding and keyword matching. By leveraging both vector-based and BM25 retrieval methods, it offers a more comprehensive and flexible solution for information retrieval tasks. This approach has potential applications in various fields where both conceptual similarity and keyword relevance are important, such as academic research, legal document search, or general-purpose search engines.
Averaged results show slightly better performance contextual retrivale vs. regular. There are several parameters that can be played with (chunking size, chunk ovelap, alpha for fusion score calculations) and have impact on final result.


In [None]:
# !pip install sentence_transformers -qU
!pip install rank_bm25 -qU
!pip install datasets -qU
!pip install pinecone[grpc] -qU
!pip install langchain_core -qU
!pip install langchain -qU
!pip install langchain_groq -qU
!pip install langchain-google-genai -qU
!pip install langchain-openai  -qU
# ==0.2.9
!pip install bert-score  -qU

# Importing libraries

In [None]:
import numpy as np
import nltk
from rank_bm25 import BM25Okapi
from sklearn.metrics.pairwise import cosine_similarity
from datasets import load_dataset
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
import pinecone
import pandas as pd # for dataframe
import getpass
from google.colab import userdata
import os

In [None]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

# Loading dataset

In [None]:
# Load dataset from Hugging Face
dataset = load_dataset("m-ric/huggingface_doc_qa_eval")

In [None]:
df = pd.DataFrame(dataset['train'])
print(df.head())

                                             context  \
0   `tokenizers-linux-x64-musl`\n\nThis is the **...   
1  !--Copyright 2023 The HuggingFace Team. All ri...   
2   Paper Pages\n\nPaper pages allow people to fi...   
3   Datasets server API\n\n> API on 🤗 datasets\n\...   
4  !--Copyright 2022 The HuggingFace Team. All ri...   

                                            question  \
0  What architecture is the `tokenizers-linux-x64...   
1  What is the purpose of the BLIP-Diffusion mode...   
2  How can a user claim authorship of a paper on ...   
3  What is the purpose of the /healthcheck endpoi...   
4  What is the default context window size for Lo...   

                                              answer  \
0                          x86_64-unknown-linux-musl   
1  The BLIP-Diffusion model is designed for contr...   
2  By clicking their name on the corresponding Pa...   
3                          Ensure the app is running   
4                                         127 

## **Taking only best question/answer pairs**

In [None]:
best_answers_df = df[df['standalone_score'] >= 4]
print(best_answers_df.head())

                                             context  \
0   `tokenizers-linux-x64-musl`\n\nThis is the **...   
1  !--Copyright 2023 The HuggingFace Team. All ri...   
2   Paper Pages\n\nPaper pages allow people to fi...   
3   Datasets server API\n\n> API on 🤗 datasets\n\...   
4  !--Copyright 2022 The HuggingFace Team. All ri...   

                                            question  \
0  What architecture is the `tokenizers-linux-x64...   
1  What is the purpose of the BLIP-Diffusion mode...   
2  How can a user claim authorship of a paper on ...   
3  What is the purpose of the /healthcheck endpoi...   
4  What is the default context window size for Lo...   

                                              answer  \
0                          x86_64-unknown-linux-musl   
1  The BLIP-Diffusion model is designed for contr...   
2  By clicking their name on the corresponding Pa...   
3                          Ensure the app is running   
4                                         127 

In [None]:
best_answers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   context            65 non-null     object
 1   question           65 non-null     object
 2   answer             65 non-null     object
 3   source_doc         65 non-null     object
 4   standalone_score   65 non-null     int64 
 5   standalone_eval    65 non-null     object
 6   relatedness_score  65 non-null     int64 
 7   relatedness_eval   65 non-null     object
 8   relevance_score    65 non-null     int64 
 9   relevance_eval     65 non-null     object
dtypes: int64(3), object(7)
memory usage: 5.2+ KB


# **Logging into Huggng Face**

In [None]:
from datasets import Dataset
from huggingface_hub import login


hf_token = userdata.get("HuggingFace")
if not hf_token:
  # Login to Hugging Face (you'll need your token)
  hf_token = input("Please enter your Hugging Face token: ")
login(hf_token)


# **Saving best_answers_df to Hugging face to prevent change**

In [None]:
best_answers_ds = Dataset.from_pandas(best_answers_df)
# Push to Hugging Face Hub
best_answers_ds.push_to_hub(
    "AIEnthusiast369/hf_doc_qa_eval_best_answers",
    private=False
)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/AIEnthusiast369/hf_doc_qa_eval_best_answers/commit/a392db43b5c76874b43811b2b4e75b39f47be7d3', commit_message='Upload dataset', commit_description='', oid='a392db43b5c76874b43811b2b4e75b39f47be7d3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/AIEnthusiast369/hf_doc_qa_eval_best_answers', endpoint='https://huggingface.co', repo_type='dataset', repo_id='AIEnthusiast369/hf_doc_qa_eval_best_answers'), pr_revision=None, pr_num=None)

# Extract contexts from the dataset

In [None]:
texts = best_answers_df['context'].tolist()

# **Setting up Embedding model**

## **sentence-transformers**

In [None]:
# # load ' sentence-transformers/all-MiniLM-L6-v2' embedding model from Hugging Face
# from transformers import AutoTokenizer, AutoModel
# model_name = 'sentence-transformers/all-MiniLM-L6-v2'
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# max_seq_length = tokenizer.model_max_length
# embedding_model = AutoModel.from_pretrained(model_name)

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

## **openai**

In [None]:
openai_api_key = userdata.get("OPENAI_API_KEY")
if not openai_api_key:
  openai_api_key = getpass("Please enter your OPENAI API KEY: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

max_seq_length = embedding_model.embedding_ctx_length
# index_dimensions = embedding_model.dimensions
index_dimensions = 1536 # default setting of text-embedding-3-small
print(f'max_seq_length:{max_seq_length}, index_dimensions:{index_dimensions}')

max_seq_length:8191, index_dimensions:1536


# Defining text splitter

###openai

In [None]:
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]
# Use RecursiveCharacterTextSplitter to split documents into chunks
chunk_overlap = 200
chunk_size = 800
print('chunk_size',chunk_size)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=MARKDOWN_SEPARATORS,
)

chunk_size 800


# **Definining ProcessedDocument & Chunk**

In [None]:
class Chunk:
    def __init__(self, text: str):
        self.text = text
        self.context = None

class ProcessedDocument:
    def __init__(self, text: str, chunks: list[Chunk]):
        self.text = text
        self.chunks = chunks


In [None]:
docs_processed: list[ProcessedDocument] = []
for text in texts:
    # text = doc.page_content  # Extract the text content from the Document
    chunks = text_splitter.split_text(text)  # Split the text into chunks (strings)
    print(f"Number of chunks for document #{len(docs_processed)}: {len(chunks)}")
    processed_doc = ProcessedDocument(
        text,
        [Chunk(chunk_text) for chunk_text in chunks]
    )
    docs_processed.append(processed_doc)
print(f"Number of Processed document: {len(docs_processed)}")

Number of chunks for document #0: 1
Number of chunks for document #1: 6
Number of chunks for document #2: 5
Number of chunks for document #3: 2
Number of chunks for document #4: 12
Number of chunks for document #5: 5
Number of chunks for document #6: 29
Number of chunks for document #7: 2
Number of chunks for document #8: 40
Number of chunks for document #9: 26
Number of chunks for document #10: 5
Number of chunks for document #11: 3
Number of chunks for document #12: 16
Number of chunks for document #13: 3
Number of chunks for document #14: 7
Number of chunks for document #15: 1
Number of chunks for document #16: 22
Number of chunks for document #17: 2
Number of chunks for document #18: 20
Number of chunks for document #19: 27
Number of chunks for document #20: 24
Number of chunks for document #21: 20
Number of chunks for document #22: 40
Number of chunks for document #23: 16
Number of chunks for document #24: 1
Number of chunks for document #25: 2
Number of chunks for document #26: 1

In [None]:
# Count total chunks
total_chunks = sum(len(doc.chunks) for doc in docs_processed)
print(f"Total number of chunks across all documents: {total_chunks}")

Total number of chunks across all documents: 882


# **Define summary chain**

In [None]:
from langchain.prompts import PromptTemplate
from google.colab import userdata

### **OpenAI**

In [None]:
from langchain_openai import ChatOpenAI


model_chat_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model=model_chat_name)
sum_provider = 'OPENAI'

In [None]:
prompt_template = ChatPromptTemplate.from_messages([
    ("system",
            """You are an AI assistant specializing in document summarization and contextualization. Your task is to provide brief, relevant context for a specific chunk of text based on a larger document. Here's how to proceed:
"""),
    ("human", """
First, carefully read and analyze the following document:

<document>
{document}
</document>

Now, consider this specific chunk of text from the document:

<chunk>
{chunk}
</chunk>

Your goal is to provide a concise context for this chunk, situating it within the whole document. Follow these guidelines:

1. Analyze how the chunk relates to the overall document's themes, arguments, or narrative.
2. Identify the chunk's role or significance within the broader context of the document.
3. Determine what information from the rest of the document is most relevant to understanding this chunk.

Compose your response as follows:
- Provide 3-4 sentences maximum of context.
- Begin directly with the context, without any introductory phrases.
- Use language like "Focuses on..." or "Addresses..." to describe the chunk's content.
- Ensure the context would be helpful for improving search retrieval of the chunk.

Important notes:
- Do not use phrases like "this chunk" or "this section" in your response.
- Do not repeat the chunk's content verbatim; provide context from the rest of the document.
- Avoid unnecessary details; be succinct and relevant.
- Do not include any additional commentary or meta-discussion about the task itself.

 Remember, your goal is to provide clear, concise, and relevant context that situates the given chunk within the larger document.
            """
     )
])


In [None]:
def create_context_chain(llm):
    return prompt_template | llm

context_chain = create_context_chain(llm)

In [None]:
def get_context(text: str, chunk: str) -> str:
    if len(chunk.strip()) <= 0 or len(text.strip()) <= 0:
        print(f"Chunk or text is empty")
        raise Exception("Chunk or text is empty")
    context= context_chain.invoke({"document": text, "chunk": chunk})
    return context.content

In [None]:
def generate_context(docs_processed: list[ProcessedDocument]):
    for i, doc in enumerate(docs_processed):
        print(f'processing document index {i}')
        for chunk in doc.chunks:
            # print(chunk.text)
            context: str = get_context(text= doc.text, chunk= chunk.text)
            chunk.context = context
            # print(f"chunk with context: Context: \n\n {chunk.context} \n\n Chunk: {chunk.text}")

# **Testing chain**

In [None]:
page = """
 Convert weights to safetensors

PyTorch model weights are commonly saved and stored as `.bin` files with Python's [`pickle`](https://docs.python.org/3/library/pickle.html) utility. To save and store your model weights in the more secure `safetensor` format, we recommend converting your weights to `.safetensors`.
The easiest way to convert your model weights is to use the [Convert Space](https://huggingface.co/spaces/diffusers/convert), given your model weights are already stored on the Hub. The Convert Space downloads the pickled weights, converts them, and opens a Pull Request to upload the newly converted `.safetensors` file to your repository.
<Tip warning={true}>
For larger models, the Space may be a bit slower because its resources are tied up in converting other models. You can also try running the [convert.py](https://github.com/huggingface/safetensors/blob/main/bindings/python/convert.py) script (this is what the Space is running) locally to convert your weights.
Feel free to ping [@Narsil](https://huggingface.co/Narsil) for any issues with the Space.
</Tip>
"""
chunk = """
Convert weights to safetensors
PyTorch model weights are commonly saved and stored as `.bin` files with Python's [`pickle`](https://docs.python.org/3/library/pickle.html) utility. To save and store your model weights in the more secure `safetensor` format, we recommend converting your weights to `.safetensors`.
The easiest way to convert your model weights is to use the [Convert Space](https://huggingface.co/spaces/diffusers/convert), given your model weights are already stored on the Hub. The Convert Space downloads the pickled weights, converts them, and opens a Pull Request to upload the newly converted `.safetensors` file to your repository.
<Tip warning={true}>
For larger models, the Space may be a bit slower because its resources are tied up in converting other models. You can also try running the [convert.py](https://github.com/huggingface/safetensors/blob/main/bindings/python/convert.py) script (this is what the Space is running) locally to convert your weights.
Feel free to ping [@Narsil](https://huggingface.co/Narsil) for any issues with the Space.
</Tip>
"""

In [None]:
test_context = get_context(text = page, chunk=chunk)

In [None]:
print(test_context)

The document discusses converting PyTorch model weights saved as `.bin` files with `pickle` to a more secure `safetensor` format by using the Convert Space tool or running a conversion script locally. It emphasizes the importance of converting weights to `.safetensors` for security reasons. Additionally, the document provides a tip about potential delays in using the Convert Space tool due to resource constraints and offers an alternative method for conversion.


In [None]:
# temp_docs = docs_processed[1:2]
# generate_context(temp_docs)
generate_context(docs_processed)

processing document index 0
processing document index 1
processing document index 2
processing document index 3
processing document index 4
processing document index 5
processing document index 6
processing document index 7
processing document index 8
processing document index 9
processing document index 10
processing document index 11
processing document index 12
processing document index 13
processing document index 14
processing document index 15
processing document index 16
processing document index 17
processing document index 18
processing document index 19
processing document index 20
processing document index 21
processing document index 22
processing document index 23
processing document index 24
processing document index 25
processing document index 26
processing document index 27
processing document index 28
processing document index 29
processing document index 30
processing document index 31
processing document index 32
processing document index 33
processing document inde

## Save processed documents to file

In [None]:
import joblib
from datetime import datetime
from google.colab import files
import glob
import os

def save_download_object(object, filename):
    joblib.dump(object, filename)
    print(f"Saved object to {filename}")
    files.download(filename)
    print(f"Downloaded {filename}")

def create_timestamp() -> str:
    return datetime.now().strftime("%Y%m%d_%H%M%S")


def create_filename_timestamp(filename, extension = "joblib") -> str:
    timestamp = create_timestamp()
    return f"{filename}_{timestamp}.{extension}"

In [None]:
chunk_texts = []
document_texts = []
contexts = []

# Extract data from docs_processed
for doc in docs_processed:
    for chunk in doc.chunks:
        chunk_texts.append(chunk.text)
        contexts.append(chunk.context)
        document_texts.append(doc.text)

# Create dictionary for dataset
dataset_dict = {
    'chunk': chunk_texts,
    'document': document_texts,
    'context': contexts
}

# **Saving Context + Chunks to dataset**

In [None]:
# Convert to Hugging Face Dataset
dataset = Dataset.from_dict(dataset_dict)

# Push to Hugging Face Hub
dataset.push_to_hub(
    f"AIEnthusiast369/hf_doc_qa_eval_chunk_size_{chunk_size}_open_ai",  # Replace with your username and desired dataset name
    private=False  # Set to False if you want it public
)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/AIEnthusiast369/hf_doc_qa_eval_chunk_size_800_open_ai/commit/7be3854af236da891ed8ecbd7299e0c9f0a3299a', commit_message='Upload dataset', commit_description='', oid='7be3854af236da891ed8ecbd7299e0c9f0a3299a', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/AIEnthusiast369/hf_doc_qa_eval_chunk_size_800_open_ai', endpoint='https://huggingface.co', repo_type='dataset', repo_id='AIEnthusiast369/hf_doc_qa_eval_chunk_size_800_open_ai'), pr_revision=None, pr_num=None)

# **Loading chunks with context dataset**
*Yuu need to run it only in case of notebook timing out and you loose state*

In [None]:
chunked_dataset = load_dataset("AIEnthusiast369/hf_doc_qa_eval_chunk_size_800_open_ai")
chunks_from_ds=True

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/348 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/659k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/882 [00:00<?, ? examples/s]

In [None]:
if chunks_from_ds:
   best_answers_ds = load_dataset("AIEnthusiast369/hf_doc_qa_eval_best_answers", split="train")
   best_answers_df = best_answers_ds.to_pandas()


README.md:   0%|          | 0.00/647 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/289k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/65 [00:00<?, ? examples/s]

# **Creating contextualized chunks**

In [None]:
chunks_with_context = []
chunks_regular=[]

if chunks_from_ds:
  chuncked_ds = chunked_dataset['train']
  for i in range(len(chuncked_ds)):
      row = chuncked_ds[i]
      chunk = row['chunk']
      chunks_regular.append(chunk)
      context = row['context']
      if context:
              chunks_with_context.append(
                f"{context} \n\n {chunk}"
              )
else:
  for doc in docs_processed:
      for chunk in doc.chunks:
          chunks_regular.append(chunk.text)
          if chunk.context:  # Only include chunks that have a context
              chunks_with_context.append(
                f"{chunk.context} \n\n {chunk.text}"
              )
print(f'Len of regular chunks: {len(chunks_regular)}')
print(f'Len of chunks with context: {len(chunks_with_context)}')

Len of regular chunks: 882
Len of chunks with context: 882


# **Setting up Indeses**

In [None]:
def create_bm25(chunks: list[str]):
    print("Creating BM25 model...")
    tokenized_chunks = [nltk.word_tokenize(chunk) for chunk in chunks]
    bm25 = BM25Okapi(tokenized_chunks)

    return bm25

In [None]:
from pinecone import Pinecone, ServerlessSpec

pinecone_api_key = userdata.get("PINECONE_API_KEY")
if not pinecone_api_key:
  pinecone_api_key = input("Please enter your PINECONE API KEY: ")

spec=ServerlessSpec(
    cloud="aws",
    region="us-east-1"
  )

EMBEDDING_INDEX_CONTEXTUAL: str = "test-rag-openai-contextual"
EMBEDDING_INDEX_REGULAR: str = "test-rag-openai-regular"

pc = Pinecone(api_key=pinecone_api_key)

In [None]:
from typing import Any, List
from time import sleep

def wait_for_index(index_name):
    while True:
        desc = pc.describe_index(index_name)
        if desc['ready']:
            print("Index is ready!")
            break
        sleep(5)

def create_pinecone_indexes(pinecone, embedding_model, index_name: str, chunks: list[str], specs: ServerlessSpec, dimensions, index_names: List[str]) -> Any:

    if index_name not in index_names:
        pc.create_index(index_name, dimension=dimensions, metric="cosine", spec=specs)
        wait_for_index(index_name)

    # Connect to Pinecone indexes
    embedding_index = pc.Index(index_name)


    # Semantic Embeddings using a Pre-trained Transformer Model
    embeddings = embedding_model.embed_documents(chunks)
    # Store embeddings in Pinecone
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
        embedding_index.upsert([(str(i), embedding, {"text": chunk})])

    print(f'len(embeddings)={len(embeddings)}, len(embeddings[0])={len(embeddings[0])}')
    return embedding_index


# **Creating Indeses**

In [None]:
if not pc.has_index(EMBEDDING_INDEX_CONTEXTUAL):
   create_pinecone_indexes(pc, embedding_model, EMBEDDING_INDEX_CONTEXTUAL, chunks_with_context, spec, 1536, index_names)
if not pc.has_index(EMBEDDING_INDEX_REGULAR):
   create_pinecone_indexes(pc, embedding_model, EMBEDDING_INDEX_REGULAR, chunks_regular, spec, 1536, index_names)
bm25_regular = create_bm25(chunks_regular)
bm25_contextual = create_bm25(chunks_with_context)

Creating BM25 model...
Creating BM25 model...


# **Definining Reranker**

### **Hugging Face**

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

RERANKER_MODEL = 'BAAI/bge-reranker-v2-m3'
tokenizer = AutoTokenizer.from_pretrained(RERANKER_MODEL)
model = AutoModelForSequenceClassification.from_pretrained(RERANKER_MODEL)
model.eval()

def get_reranker_score(pairs):
    with torch.no_grad():
        inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
        scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
        print(f'reranker scores {scores}')
        return scores


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/795 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

# **Fusion Rank Search**

In [None]:
from collections import defaultdict
def fusion_rank_search(
    query: str,
    bm25,
    chunks: list[str],
    model,
    embedding_index,
    weight_sparse: float,
    k: int = 5,
    reranker_cutoff: int = 20  # Number of top results to rerank
):
    # Get BM25 results
    tokenized_query = nltk.word_tokenize(query)
    bm25_scores = np.array(bm25.get_scores(tokenized_query))  # Already numpy array
    bm25_top_indices = np.argsort(bm25_scores)[::-1][:reranker_cutoff]

    # Get dense results using OpenAI embeddings
    query_embedding = model.embed_query(query)

    # Query Pinecone index
    dense_results = embedding_index.query(
        vector=query_embedding,
        top_k=reranker_cutoff,
        include_values=True
    )

    # Extract scores and indices from Pinecone results and convert to numpy arrays
    dense_scores = np.array([match['score'] for match in dense_results['matches']])
    dense_indices = np.array([int(match['id']) for match in dense_results['matches']])

    # Normalize scores (now all operations use numpy)
    bm25_scores_norm = (bm25_scores[bm25_top_indices] - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores))
    dense_scores_norm = (dense_scores - np.min(dense_scores)) / (np.max(dense_scores) - np.min(dense_scores))

    # Create combined results
    combined_results = {}

    # Add BM25 results
    for idx, score in zip(bm25_top_indices, bm25_scores_norm):
        combined_results[idx] = {'score': weight_sparse * score, 'count': 1}

    # Add dense results
    for idx, score in zip(dense_indices, dense_scores_norm):
        if idx in combined_results:
            combined_results[idx]['score'] += (1 - weight_sparse) * score
            combined_results[idx]['count'] += 1
        else:
            combined_results[idx] = {'score': (1 - weight_sparse) * score, 'count': 1}

    # Calculate final scores
    for idx in combined_results:
        combined_results[idx]['final_score'] = combined_results[idx]['score'] / combined_results[idx]['count']

    # Sort by final score
    sorted_results = sorted(combined_results.items(), key=lambda x: x[1]['final_score'], reverse=True)

    # Return top k results with their chunks
    final_results = []
    for idx, scores in sorted_results[:k]:
        final_results.append({
            'id': str(idx),
            'score': scores['final_score'],
            'metadata': {'text': chunks[idx]}
        })

    return final_results


# **Evaluate Rag**

In [None]:
from tqdm import tqdm
import pandas as pd
import bert_score # Import bert_score

def evaluate_rag_system(
    best_answers_df: pd.DataFrame,
    bm25,
    chunks: list[str],
    embedding_model,
    embedding_index,
    generate_amswer,
    weight_sparse: float,
    n_samples: int = None,  # Optional: limit number of samples for testing
    reranker_cutoff: int = 20
):


    # Initialize results storage
    results = []

    # Get subset of dataframe if n_samples is specified
    eval_df = best_answers_df.head(n_samples) if n_samples else best_answers_df

    # Lists to store all references and candidates for batch BERTScore computation
    all_references = []
    all_candidates = []

    # Iterate through questions and answers
    for idx, row in tqdm(eval_df.iterrows(), total=len(eval_df), desc="Evaluating Questions"):
        query = row['question']
        reference_answer = row['answer']

        try:
            # Get relevant context using fusion ranking
            retrieved_results = fusion_rank_search(
                query=query,
                bm25=bm25,
                chunks=chunks,
                model=embedding_model,
                embedding_index=embedding_index,
                k=5,
                weight_sparse=0.1,
                reranker_cutoff=reranker_cutoff
            )

            # Prepare pairs for reranking
            pairs = [(query, result['metadata']['text']) for result in retrieved_results]

            # Get reranker scores - use them directly for final ranking
            rerank_scores = get_reranker_score(pairs)

            # Update results with reranker scores
            for result, rerank_score in zip(retrieved_results, rerank_scores):
                result['metadata']['rerank_score'] = float(rerank_score)
                # Use reranker score as the final score
                result['score'] = float(rerank_score)

            # Resort based on reranker scores
            retrieved_results.sort(key=lambda x: x['score'], reverse=True)

            # Prepare context for LLM
            context = "\n".join([res['metadata']['text'] for res in retrieved_results])

            # Generate answer using LLM
            generated_answer = generate_amswer(context, query)

            # Store answers for batch BERTScore computation
            all_references.append(reference_answer)
            all_candidates.append(generated_answer)

            # Store intermediate results
            result = {
                'question': query,
                'reference_answer': reference_answer,
                'generated_answer': generated_answer,
                'retrieved_contexts': [res['metadata']['text'] for res in retrieved_results],
                'context_scores': [res['score'] for res in retrieved_results]
            }
            results.append(result)

        except Exception as e:
            print(f"Error processing question {idx}: {str(e)}")
            continue

    # Calculate BERTScore for all pairs at once
    P, R, F1 = bert_score.score(
        all_candidates,
        all_references,
        lang="en",
        verbose=True,
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )

    # Add BERTScore metrics to results
    for idx, (p, r, f1) in enumerate(zip(P, R, F1)):
        results[idx].update({
            'bertscore_precision': p.item(),
            'bertscore_recall': r.item(),
            'bertscore_f1': f1.item()
        })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    # Calculate and print average scores
    avg_scores = {
        'Average BERTScore Precision': results_df['bertscore_precision'].mean(),
        'Average BERTScore Recall': results_df['bertscore_recall'].mean(),
        'Average BERTScore F1': results_df['bertscore_f1'].mean()
    }

    return results_df, avg_scores

In [None]:
def print_evaluation_results(results_df, avg_scores):
    print("\nAverage Scores:")
    for metric, score in avg_scores.items():
        print(f"{metric}: {score:.4f}")

    print("\nDetailed Results Sample (first 3):")
    for idx, row in results_df.head(3).iterrows():
        print("\nQuestion:", row['question'])
        print("Reference Answer:", row['reference_answer'])
        print("Generated Answer:", row['generated_answer'])
        print(f"BERTScore Precision: {row['bertscore_precision']:.4f}")
        print(f"BERTScore Recall: {row['bertscore_recall']:.4f}")
        print(f"BERTScore F1: {row['bertscore_f1']:.4f}")
        # print("\nRetrieved Contexts:")
        # for context, score in zip(row['retrieved_contexts'], row['context_scores']):
        #     print(f"Score: {score:.4f}")
        #     print(f"Context: {context[:200]}...")

# **Compare Rag Evaluations**

In [None]:
from typing import Tuple

def compare_rag_evaluations(best_answers_df: pd.DataFrame,
                          set1_params: dict,
                          set2_params: dict,
                          generate_amswer,
                          weight_sparse: float,
                          n_samples: int = None) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Compare RAG evaluation results between two parameter sets.

    Args:
        best_answers_df: DataFrame with questions and answers
        set1_params: Dictionary with parameters for first evaluation
        set2_params: Dictionary with parameters for second evaluation
        llm_chain: The LLM chain to use for evaluation
        n_samples: Optional number of samples to evaluate

    Returns:
        DataFrame with comparison results
    """
    # Run evaluations for both sets
    results1_df, avg_scores1 = evaluate_rag_system(
        best_answers_df=best_answers_df,
        weight_sparse=weight_sparse,
        bm25=set1_params['bm25'],
        chunks=set1_params['chunks'],
        embedding_model=set1_params['embedding_model'],
        embedding_index=set1_params['embedding_index'],
        generate_amswer=generate_amswer,
        n_samples=n_samples
    )

    print_evaluation_results(results1_df, avg_scores1)

    results2_df, avg_scores2 = evaluate_rag_system(
        best_answers_df=best_answers_df,
        weight_sparse=weight_sparse,
        bm25=set2_params['bm25'],
        chunks=set2_params['chunks'],
        embedding_model=set2_params['embedding_model'],
        embedding_index=set2_params['embedding_index'],
        generate_amswer=generate_amswer,
        n_samples=n_samples
    )

    print_evaluation_results(results2_df, avg_scores2)
    # Create comparison DataFrame
    comparison = pd.DataFrame({
        'Metric': ['BERTScore Precision', 'BERTScore Recall', 'BERTScore F1'],
        'Contextual': [
            avg_scores1['Average BERTScore Precision'],
            avg_scores1['Average BERTScore Recall'],
            avg_scores1['Average BERTScore F1']
        ],
        'Regular': [
            avg_scores2['Average BERTScore Precision'],
            avg_scores2['Average BERTScore Recall'],
            avg_scores2['Average BERTScore F1']
        ]
    })

    # Calculate differences
    comparison['Difference'] = comparison['Contextual'] - comparison['Regular']

        # Calculate differences
    comparison['Difference'] = comparison['Contextual'] - comparison['Regular']

    # Calculate percentage difference
    # Formula: ((new - old) / old) * 100
    comparison['Difference %'] = ((comparison['Contextual'] - comparison['Regular']) / comparison['Regular'] * 100).round(2)

    # Format numbers to 4 decimal places
    for col in ['Contextual', 'Regular', 'Difference', 'Difference %']:
        comparison[col] = comparison[col].round(4)

    return comparison, results1_df, results2_df

# **Defining Answer Generation Chain**

## **OpenAi**

In [None]:
prompt_template_answer = ChatPromptTemplate.from_messages([
    ("system",
            """You are an AI assistant specialized in answering user queries based solely on provided context. Your primary goal is to provide clear, concise, and relevant answers without adding, making up, or hallucinating any information.
            """
     ),
    ("human","""Now, consider the following context carefully:
      <context>
      {context}
      </context>

      Here is the user's query:
      <query>
      {query}
      </query>

      Before answering, please follow these steps:

      1. Analyze the user's query and the provided context:
        a. Identify the key elements of the user's query.
        b. Find and quote relevant information from the context.
        c. Explicitly link the quoted information to the query elements.
        d. Formulate a potential answer based only on the context.
        e. Explicitly check that your answer doesn't include any information not present in the context.
        f. If the context doesn't contain enough information to answer the query, note this.

      2. After your analysis process, provide your final answer or response. Do not include your analysis steps in your final answer or response, only the result.

      If the context does not contain enough information to answer the user's query confidently and accurately, your final response should be: "I do not have enough information to answer this question based on the provided context."

      Remember, it's crucial that your answer is based entirely on the given context. Do not add any external information or make assumptions beyond what is explicitly stated in the context.

    """)
])

In [None]:
from langchain_core.output_parsers import StrOutputParser

def create_answer_chain(llm):
  return prompt_template_answer | llm | StrOutputParser()

In [None]:
def get_generate_amswer(llm_chain):
    def generate_amswer(context, query):
        llm_response = llm_chain.invoke({
                    "context": context,
                    "query": query
                })
        return llm_response.content if hasattr(llm_response, 'content') else llm_response
    return generate_amswer

# **Creating Answer Generation chain**

In [None]:
answer_chain = create_answer_chain(llm)

In [None]:
embedding_index_contextual= pc.Index(EMBEDDING_INDEX_CONTEXTUAL)
embedding_index_regular= pc.Index(EMBEDDING_INDEX_REGULAR)


# **Running the RAG**

In [None]:
set1_params = {
    'embedding_index': embedding_index_contextual,
    'chunks': chunks_with_context,
    'bm25': bm25_contextual,
    'embedding_model': embedding_model  # Add your embedding model here
}

set2_params = {
    'embedding_index': embedding_index_regular,
    'chunks': chunks_regular,
    'bm25': bm25_regular,
    'embedding_model': embedding_model  # Add your embedding model here
}

# Run comparison
comparison_results,results1_df, results2_df = compare_rag_evaluations(
    best_answers_df=best_answers_df,
    weight_sparse=0.3, #alpha
    set1_params=set1_params,
    set2_params=set2_params,
    generate_amswer=get_generate_amswer(answer_chain),
    n_samples=None  # Set to a number if you want to limit samples
)

# Display results as markdown table
print(comparison_results.to_markdown(index=False))
# Save DataFrame to CSV

Evaluating Questions:   0%|          | 0/65 [00:00<?, ?it/s]

reranker scores tensor([ 3.2947, -4.1678, -5.7306, -4.7582, -4.7453])


Evaluating Questions:   2%|▏         | 1/65 [00:47<51:04, 47.89s/it]

reranker scores tensor([ 4.5449, -2.0492,  3.9803,  3.9422, -1.0219])


Evaluating Questions:   3%|▎         | 2/65 [01:13<36:47, 35.05s/it]

reranker scores tensor([ 8.1462, -1.4348,  5.0518,  4.0138, -1.3025])


Evaluating Questions:   5%|▍         | 3/65 [01:40<32:00, 30.97s/it]

reranker scores tensor([ 2.9537, -2.0018, -1.2429, -5.2027, -1.9673])


Evaluating Questions:   6%|▌         | 4/65 [02:08<30:30, 30.01s/it]

reranker scores tensor([-1.6382,  3.4650,  1.9038, -1.7770,  0.5200])


Evaluating Questions:   8%|▊         | 5/65 [02:31<27:23, 27.39s/it]

reranker scores tensor([ 3.5001,  5.1691, -1.2828,  3.9763,  4.9151])


Evaluating Questions:   9%|▉         | 6/65 [02:57<26:35, 27.04s/it]

reranker scores tensor([3.9755, 2.2574, 6.3320, 3.3863, 1.5552])


Evaluating Questions:  11%|█         | 7/65 [03:21<25:08, 26.01s/it]

reranker scores tensor([ 5.8145, -5.1175, -4.8767, -6.7308, -5.4132])


Evaluating Questions:  12%|█▏        | 8/65 [03:44<23:50, 25.10s/it]

reranker scores tensor([ 3.1282,  3.6903,  4.3359, -8.7427, -4.7243])


Evaluating Questions:  14%|█▍        | 9/65 [04:10<23:35, 25.28s/it]

reranker scores tensor([0.7509, 2.7520, 3.8213, 3.6605, 1.3921])


Evaluating Questions:  15%|█▌        | 10/65 [04:35<23:14, 25.36s/it]

reranker scores tensor([ 1.3760,  4.1676, -0.6410, -2.9358, -1.6644])


Evaluating Questions:  17%|█▋        | 11/65 [05:07<24:30, 27.22s/it]

reranker scores tensor([ 0.5942,  7.4159, -1.8807, -0.1606, -3.0416])


Evaluating Questions:  18%|█▊        | 12/65 [05:34<24:02, 27.22s/it]

reranker scores tensor([ 4.9564,  4.6006, -5.9175, -5.9913, -6.6784])


Evaluating Questions:  20%|██        | 13/65 [06:06<24:49, 28.65s/it]

reranker scores tensor([ 4.0804, -1.0410, -0.7429, -3.0632, -1.5684])


Evaluating Questions:  22%|██▏       | 14/65 [06:44<26:49, 31.56s/it]

reranker scores tensor([ 6.1241, -2.8542, -1.5503, -2.8726,  0.1683])


Evaluating Questions:  23%|██▎       | 15/65 [07:16<26:26, 31.72s/it]

reranker scores tensor([ 6.8717, -2.6592,  5.7423, -2.2167,  5.4679])


Evaluating Questions:  25%|██▍       | 16/65 [07:40<23:49, 29.17s/it]

reranker scores tensor([ 8.6546,  5.0416, -4.6468,  5.4177, -1.2369])


Evaluating Questions:  26%|██▌       | 17/65 [08:08<23:01, 28.79s/it]

reranker scores tensor([ 7.4106, -0.8569,  3.1309, -5.0056, -1.8641])


Evaluating Questions:  28%|██▊       | 18/65 [08:32<21:27, 27.40s/it]

reranker scores tensor([7.4339, 6.1562, 6.3702, 6.6820, 1.6788])


Evaluating Questions:  29%|██▉       | 19/65 [08:53<19:41, 25.68s/it]

reranker scores tensor([-0.8899,  3.3095,  0.7873,  0.4989,  1.7449])


Evaluating Questions:  31%|███       | 20/65 [09:14<18:01, 24.04s/it]

reranker scores tensor([0.6326, 2.6334, 0.9305, 2.5065, 2.4303])


Evaluating Questions:  32%|███▏      | 21/65 [09:45<19:18, 26.34s/it]

reranker scores tensor([9.0500, 4.0033, 5.3480, 2.3889, 2.2549])


Evaluating Questions:  34%|███▍      | 22/65 [10:05<17:24, 24.29s/it]

reranker scores tensor([ 0.2700,  4.9920, -1.2023, -2.7696, -0.0685])


Evaluating Questions:  35%|███▌      | 23/65 [10:34<17:58, 25.68s/it]

reranker scores tensor([3.4947, 2.5867, 1.2560, 2.2972, 2.2781])


Evaluating Questions:  37%|███▋      | 24/65 [11:04<18:28, 27.03s/it]

reranker scores tensor([ 5.0976, -6.1883, -7.1799, -4.7912, -8.6384])


Evaluating Questions:  38%|███▊      | 25/65 [11:32<18:19, 27.48s/it]

reranker scores tensor([ 7.3270,  5.2208, -1.9764, -0.7069, -1.6172])


Evaluating Questions:  40%|████      | 26/65 [12:03<18:26, 28.37s/it]

reranker scores tensor([-1.1310,  7.3773, -0.0136,  1.3645,  0.8603])


Evaluating Questions:  42%|████▏     | 27/65 [12:25<16:45, 26.46s/it]

reranker scores tensor([5.3818, 3.3214, 1.2301, 0.7854, 1.0049])


Evaluating Questions:  43%|████▎     | 28/65 [12:47<15:26, 25.05s/it]

reranker scores tensor([ 5.9007, -3.1439, -4.0525, -2.7127, -4.0725])


Evaluating Questions:  45%|████▍     | 29/65 [13:14<15:27, 25.75s/it]

reranker scores tensor([2.0888, 4.1100, 5.8868, 3.1112, 3.2345])


Evaluating Questions:  46%|████▌     | 30/65 [13:42<15:18, 26.26s/it]

reranker scores tensor([ 6.5890,  0.2177, -1.3391,  1.9598, -1.8994])


Evaluating Questions:  48%|████▊     | 31/65 [14:06<14:38, 25.85s/it]

reranker scores tensor([ 6.2560,  2.1232, -3.2643, -4.2042, -3.7255])


Evaluating Questions:  49%|████▉     | 32/65 [14:31<14:03, 25.57s/it]

reranker scores tensor([4.5717, 6.2303, 6.8011, 1.5353, 2.6713])


Evaluating Questions:  51%|█████     | 33/65 [14:54<13:05, 24.56s/it]

reranker scores tensor([ 0.1040, -2.0827, -1.4520,  2.7723,  7.1941])


Evaluating Questions:  52%|█████▏    | 34/65 [15:22<13:13, 25.59s/it]

reranker scores tensor([ 1.7551,  4.4276,  2.4644, -7.6544, -9.1075])


Evaluating Questions:  54%|█████▍    | 35/65 [15:45<12:29, 24.97s/it]

reranker scores tensor([ 3.6149, -4.4647, -0.2754, -4.5644, -4.8733])


Evaluating Questions:  55%|█████▌    | 36/65 [16:04<11:13, 23.24s/it]

reranker scores tensor([ 3.7300, -1.1607, -3.6497, -0.9713, -1.5348])


Evaluating Questions:  57%|█████▋    | 37/65 [16:29<10:59, 23.55s/it]

reranker scores tensor([ 1.2416, -5.9809, -1.4098,  6.0595, -2.4375])


Evaluating Questions:  58%|█████▊    | 38/65 [16:49<10:13, 22.74s/it]

reranker scores tensor([ 4.0979, -1.5743,  5.5200,  3.9252, -4.3536])


Evaluating Questions:  60%|██████    | 39/65 [17:09<09:28, 21.88s/it]

reranker scores tensor([ 3.8057,  4.9112,  1.4130,  1.2264, -5.4745])


Evaluating Questions:  62%|██████▏   | 40/65 [17:41<10:21, 24.87s/it]

reranker scores tensor([ 4.4469, -5.0831, -1.5405, -0.2182, -0.3345])


Evaluating Questions:  63%|██████▎   | 41/65 [18:13<10:44, 26.84s/it]

reranker scores tensor([ 8.1554,  7.5591,  5.2526,  1.3296, -2.7935])


Evaluating Questions:  65%|██████▍   | 42/65 [18:33<09:33, 24.95s/it]

reranker scores tensor([1.4797, 6.4731, 2.9162, 2.3378, 2.5005])


Evaluating Questions:  66%|██████▌   | 43/65 [18:55<08:49, 24.08s/it]

reranker scores tensor([ 0.0934, -3.1555,  6.4146, -2.2320, -2.4773])


Evaluating Questions:  68%|██████▊   | 44/65 [19:19<08:27, 24.15s/it]

reranker scores tensor([ 7.8359, -2.8806,  1.4763,  5.0722,  2.0737])


Evaluating Questions:  69%|██████▉   | 45/65 [19:42<07:53, 23.68s/it]

reranker scores tensor([ 4.3205, -2.2609, -1.1471,  3.8639,  2.2797])


Evaluating Questions:  71%|███████   | 46/65 [20:01<07:01, 22.19s/it]

reranker scores tensor([5.7138, 5.1053, 9.3639, 8.8894, 7.5529])


Evaluating Questions:  72%|███████▏  | 47/65 [20:27<06:59, 23.31s/it]

reranker scores tensor([-1.6745,  1.8748, -1.3716,  1.2088,  7.9106])


Evaluating Questions:  74%|███████▍  | 48/65 [20:50<06:33, 23.17s/it]

reranker scores tensor([ 3.8535,  0.5656,  2.4055, -7.7704,  0.3007])


Evaluating Questions:  75%|███████▌  | 49/65 [21:14<06:18, 23.66s/it]

reranker scores tensor([-2.3001, -0.1576, -2.6984, -2.2668,  5.0509])


Evaluating Questions:  77%|███████▋  | 50/65 [21:36<05:43, 22.93s/it]

reranker scores tensor([3.3301, 4.5957, 5.8164, 5.3366, 2.8161])


Evaluating Questions:  78%|███████▊  | 51/65 [21:58<05:20, 22.91s/it]

reranker scores tensor([6.9082, 8.7994, 6.8497, 4.5128, 4.7229])


Evaluating Questions:  80%|████████  | 52/65 [22:18<04:43, 21.84s/it]

reranker scores tensor([ 8.1570,  4.0232, -2.2718, -0.2254, -0.2649])


Evaluating Questions:  82%|████████▏ | 53/65 [22:41<04:25, 22.13s/it]

reranker scores tensor([-0.4954,  6.9857, -0.2664, -3.2433, -1.3760])


Evaluating Questions:  83%|████████▎ | 54/65 [23:02<04:01, 21.94s/it]

reranker scores tensor([ 4.2398,  0.5022, -0.2347,  1.1030,  0.2055])


Evaluating Questions:  85%|████████▍ | 55/65 [23:29<03:54, 23.45s/it]

reranker scores tensor([-0.0876,  4.4731,  6.9442,  3.6836, -5.5952])


Evaluating Questions:  86%|████████▌ | 56/65 [23:50<03:25, 22.82s/it]

reranker scores tensor([-1.1986,  0.0376,  5.9193,  0.0643, -2.4231])


Evaluating Questions:  88%|████████▊ | 57/65 [24:10<02:55, 21.96s/it]

reranker scores tensor([ 5.9613,  3.2824, -1.3034, -0.3894,  2.1373])


Evaluating Questions:  89%|████████▉ | 58/65 [24:36<02:40, 23.00s/it]

reranker scores tensor([ 7.4632,  7.0215,  4.7847, -2.2080, -0.0931])


Evaluating Questions:  91%|█████████ | 59/65 [25:08<02:33, 25.66s/it]

reranker scores tensor([ 4.6328,  0.6588,  0.0410, -6.6763, -1.2851])


Evaluating Questions:  92%|█████████▏| 60/65 [25:33<02:08, 25.61s/it]

reranker scores tensor([ 5.0420,  3.2016,  0.8710, -3.7610,  1.4800])


Evaluating Questions:  94%|█████████▍| 61/65 [25:58<01:41, 25.39s/it]

reranker scores tensor([-2.0495,  4.7316,  0.5603, -0.1759, -3.6877])


Evaluating Questions:  95%|█████████▌| 62/65 [26:24<01:16, 25.49s/it]

reranker scores tensor([-0.0333,  0.3380,  3.1896,  5.8760,  8.5084])


Evaluating Questions:  97%|█████████▋| 63/65 [26:45<00:48, 24.21s/it]

reranker scores tensor([ 6.4787,  3.4857, -3.3779, -5.0493, -4.3915])


Evaluating Questions:  98%|█████████▊| 64/65 [27:16<00:26, 26.13s/it]

reranker scores tensor([ 4.2652, -2.9037, -2.0893, -2.5695, -2.1155])


Evaluating Questions: 100%|██████████| 65/65 [27:38<00:00, 25.51s/it]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/3 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 153.76 seconds, 0.42 sentences/sec

Average Scores:
Average BERTScore Precision: 0.8333
Average BERTScore Recall: 0.8992
Average BERTScore F1: 0.8647

Detailed Results Sample (first 3):

Question: What architecture is the `tokenizers-linux-x64-musl` binary designed for?

Reference Answer: x86_64-unknown-linux-musl
Generated Answer: The `tokenizers-linux-x64-musl` binary is designed for the x86_64 architecture running the musl C library on Linux systems.
BERTScore Precision: 0.8575
BERTScore Recall: 0.9092
BERTScore F1: 0.8826

Question: What is the purpose of the BLIP-Diffusion model?

Reference Answer: The BLIP-Diffusion model is designed for controllable text-to-image generation and editing.
Generated Answer: The purpose of the BLIP-Diffusion model is to enable zero-shot subject-driven generation, efficient fine-tuning for customized subjects with significant speedup, and flexible integration with other techniques like ControlNet and prompt-to-prompt for novel subject-driven 

Evaluating Questions:   0%|          | 0/65 [00:00<?, ?it/s]

reranker scores tensor([-0.0975, -6.3564, -4.5948, -5.1581, -3.7810])


Evaluating Questions:   2%|▏         | 1/65 [00:18<19:54, 18.66s/it]

reranker scores tensor([ 3.4297, -2.2105,  2.8743,  2.6945, -4.3215])


Evaluating Questions:   3%|▎         | 2/65 [00:33<17:31, 16.69s/it]

reranker scores tensor([ 6.5623,  2.9059, -3.2001, -2.4640, -2.2643])


Evaluating Questions:   5%|▍         | 3/65 [00:51<17:30, 16.95s/it]

reranker scores tensor([ 2.0401, -3.7839, -2.7268, -4.4994, -4.1350])


Evaluating Questions:   6%|▌         | 4/65 [01:09<17:53, 17.60s/it]

reranker scores tensor([ 1.8963, -5.4715, -4.6364,  1.0930, -3.5772])


Evaluating Questions:   8%|▊         | 5/65 [01:24<16:21, 16.36s/it]

reranker scores tensor([ 1.2419,  3.1147, -5.2025, -2.4262, -4.8682])


Evaluating Questions:   9%|▉         | 6/65 [01:41<16:21, 16.64s/it]

reranker scores tensor([-0.8382,  2.8632,  3.6946,  3.4186, -1.9124])


Evaluating Questions:  11%|█         | 7/65 [01:55<15:17, 15.82s/it]

reranker scores tensor([ 4.4513, -3.3830, -4.1559, -5.1120, -8.2432])


Evaluating Questions:  12%|█▏        | 8/65 [02:11<15:11, 15.99s/it]

reranker scores tensor([-2.9268,  3.5240,  3.7388, -5.5660, -6.9073])


Evaluating Questions:  14%|█▍        | 9/65 [02:26<14:30, 15.54s/it]

reranker scores tensor([ 0.2792, -1.5358, -2.8743,  1.7282, -1.4531])


Evaluating Questions:  15%|█▌        | 10/65 [02:41<14:13, 15.52s/it]

reranker scores tensor([-0.4806, -2.4827,  3.2070, -2.9757, -4.2340])


Evaluating Questions:  17%|█▋        | 11/65 [02:54<13:18, 14.80s/it]

reranker scores tensor([ 1.6791, -3.2713, -2.3957, -2.1248, -7.0844])


Evaluating Questions:  18%|█▊        | 12/65 [03:08<12:40, 14.35s/it]

reranker scores tensor([ 4.7448,  1.7750, -6.5426, -6.9455, -1.6210])


Evaluating Questions:  20%|██        | 13/65 [03:24<13:02, 15.04s/it]

reranker scores tensor([ 1.9247, -3.0056, -2.5142, -4.2347, -3.9286])


Evaluating Questions:  22%|██▏       | 14/65 [03:43<13:50, 16.29s/it]

reranker scores tensor([ 5.8978, -1.3245, -3.4286, -3.4764,  0.4409])


Evaluating Questions:  23%|██▎       | 15/65 [03:59<13:28, 16.17s/it]

reranker scores tensor([ 6.1905,  5.9614,  6.0756, -1.4657,  5.8957])


Evaluating Questions:  25%|██▍       | 16/65 [04:16<13:19, 16.32s/it]

reranker scores tensor([ 8.9758,  1.6818, -3.7320, -2.1758, -2.2851])


Evaluating Questions:  26%|██▌       | 17/65 [04:32<12:59, 16.24s/it]

reranker scores tensor([ 8.3589, -1.3373, -4.3627, -5.7805, -4.9901])


Evaluating Questions:  28%|██▊       | 18/65 [04:50<13:01, 16.64s/it]

reranker scores tensor([ 6.4006,  0.8499,  5.2096,  6.1357, -1.0957])


Evaluating Questions:  29%|██▉       | 19/65 [05:04<12:09, 15.85s/it]

reranker scores tensor([ 2.0471,  2.8803, -0.3007, -1.3363,  1.1296])


Evaluating Questions:  31%|███       | 20/65 [05:18<11:30, 15.34s/it]

reranker scores tensor([-0.6728,  2.7260,  0.1546, -1.9745, -0.0966])


Evaluating Questions:  32%|███▏      | 21/65 [05:33<11:13, 15.30s/it]

reranker scores tensor([ 8.7993,  2.4031,  1.6643,  4.5939, -5.0112])


Evaluating Questions:  34%|███▍      | 22/65 [05:47<10:42, 14.95s/it]

reranker scores tensor([ 0.5841,  0.6170,  0.6761,  5.6354, -2.9294])


Evaluating Questions:  35%|███▌      | 23/65 [06:02<10:27, 14.93s/it]

reranker scores tensor([-0.5403, -1.3392, -0.2995,  2.0727,  1.0047])


Evaluating Questions:  37%|███▋      | 24/65 [06:15<09:53, 14.47s/it]

reranker scores tensor([ 1.7084, -4.7162, -6.0092, -4.6533, -5.6003])


Evaluating Questions:  38%|███▊      | 25/65 [06:30<09:44, 14.61s/it]

reranker scores tensor([ 1.1643,  7.7047, -0.9098, -2.0704, -3.1253])


Evaluating Questions:  40%|████      | 26/65 [06:49<10:13, 15.73s/it]

reranker scores tensor([ 6.7533, -0.5283, -4.1844,  0.4355,  0.7316])


Evaluating Questions:  42%|████▏     | 27/65 [07:05<09:58, 15.75s/it]

reranker scores tensor([ 4.8964, -0.2188, -2.7166,  2.2591,  0.2771])


Evaluating Questions:  43%|████▎     | 28/65 [07:19<09:27, 15.34s/it]

reranker scores tensor([-3.6832,  5.7830, -4.5459, -5.1222, -6.1220])


Evaluating Questions:  45%|████▍     | 29/65 [07:37<09:44, 16.25s/it]

reranker scores tensor([ 6.3269, -0.3132, -4.4263, -2.9286, -0.9173])


Evaluating Questions:  46%|████▌     | 30/65 [07:57<10:04, 17.26s/it]

reranker scores tensor([ 6.5965,  2.2167,  2.1022, -8.6201, -1.9056])


Evaluating Questions:  48%|████▊     | 31/65 [08:20<10:46, 19.00s/it]

reranker scores tensor([ 6.2620, -3.4897,  2.7733, -3.8350, -3.3581])


Evaluating Questions:  49%|████▉     | 32/65 [08:39<10:25, 18.96s/it]

reranker scores tensor([-0.6433,  6.2262,  6.1459,  0.5934,  1.4126])


Evaluating Questions:  51%|█████     | 33/65 [08:55<09:40, 18.15s/it]

reranker scores tensor([-2.2685, -2.5321,  7.9299,  3.1590, -0.8881])


Evaluating Questions:  52%|█████▏    | 34/65 [09:11<09:02, 17.51s/it]

reranker scores tensor([-2.4710,  2.4014, -7.5392, -4.4463,  1.8312])


Evaluating Questions:  54%|█████▍    | 35/65 [09:26<08:24, 16.81s/it]

reranker scores tensor([-5.1953,  2.8681, -3.9645, -3.2926, -1.3767])


Evaluating Questions:  55%|█████▌    | 36/65 [09:43<08:03, 16.69s/it]

reranker scores tensor([ 3.7992, -1.9283, -5.3503, -4.7399, -2.3484])


Evaluating Questions:  57%|█████▋    | 37/65 [09:57<07:23, 15.83s/it]

reranker scores tensor([ 1.1428, -1.0238, -5.3305,  5.8448, -4.0503])


Evaluating Questions:  58%|█████▊    | 38/65 [10:10<06:50, 15.21s/it]

reranker scores tensor([ 5.3625, -2.3481, -0.6100,  0.6122, -1.6651])


Evaluating Questions:  60%|██████    | 39/65 [10:25<06:30, 15.04s/it]

reranker scores tensor([ 6.4832, -0.2593, -6.2382,  0.4576, -6.1799])


Evaluating Questions:  62%|██████▏   | 40/65 [10:55<08:08, 19.54s/it]

reranker scores tensor([ 4.9001, -2.9329, -1.1031, -6.8901,  0.9401])


Evaluating Questions:  63%|██████▎   | 41/65 [11:12<07:33, 18.92s/it]

reranker scores tensor([ 7.6968,  2.0876, -4.4786, -2.9489, -4.4540])


Evaluating Questions:  65%|██████▍   | 42/65 [11:27<06:44, 17.58s/it]

reranker scores tensor([-0.9519, -0.9396, -0.9047,  5.6769, -4.8963])


Evaluating Questions:  66%|██████▌   | 43/65 [11:40<05:58, 16.31s/it]

reranker scores tensor([-0.0792, -2.4704, -2.0089,  0.4394,  0.3865])


Evaluating Questions:  68%|██████▊   | 44/65 [11:55<05:34, 15.92s/it]

reranker scores tensor([ 7.7939, -5.5647, -5.1627, -4.9770, -3.7951])


Evaluating Questions:  69%|██████▉   | 45/65 [12:13<05:26, 16.33s/it]

reranker scores tensor([-1.0837,  3.7487, -4.0896, -2.3001, -2.6880])


Evaluating Questions:  71%|███████   | 46/65 [12:26<04:55, 15.55s/it]

reranker scores tensor([1.3300, 0.4098, 2.7961, 9.4248, 3.4745])


Evaluating Questions:  72%|███████▏  | 47/65 [12:40<04:31, 15.09s/it]

reranker scores tensor([-1.9916,  0.2212,  1.2629,  1.4247,  0.1682])


Evaluating Questions:  74%|███████▍  | 48/65 [12:54<04:07, 14.58s/it]

reranker scores tensor([-1.6113,  4.7644,  1.5893, -8.4916, -2.0674])


Evaluating Questions:  75%|███████▌  | 49/65 [13:08<03:50, 14.42s/it]

reranker scores tensor([-4.6798, -3.3296, -4.0762, -1.4349,  4.6343])


Evaluating Questions:  77%|███████▋  | 50/65 [13:22<03:33, 14.26s/it]

reranker scores tensor([4.6856, 4.5155, 4.9303, 1.3537, 4.8081])


Evaluating Questions:  78%|███████▊  | 51/65 [13:36<03:21, 14.41s/it]

reranker scores tensor([ 3.9927,  7.5712,  5.0269, -4.9118,  4.4877])


Evaluating Questions:  80%|████████  | 52/65 [13:50<03:05, 14.24s/it]

reranker scores tensor([ 5.9152, -0.0605, -2.8783, -1.0502, -0.1432])


Evaluating Questions:  82%|████████▏ | 53/65 [14:07<03:01, 15.15s/it]

reranker scores tensor([ 8.0749, -0.6297, -5.5100, -3.2688, -4.6858])


Evaluating Questions:  83%|████████▎ | 54/65 [14:21<02:42, 14.77s/it]

reranker scores tensor([-0.0583, -1.9725, -2.4026, -4.8658, -1.9562])


Evaluating Questions:  85%|████████▍ | 55/65 [14:40<02:38, 15.88s/it]

reranker scores tensor([ 0.3965,  3.8920,  0.2530,  3.5213, -5.7825])


Evaluating Questions:  86%|████████▌ | 56/65 [14:56<02:23, 15.90s/it]

reranker scores tensor([-1.7371,  5.5844, -1.1629, -2.3108, -4.6233])


Evaluating Questions:  88%|████████▊ | 57/65 [15:09<02:00, 15.06s/it]

reranker scores tensor([ 1.8174, -1.5201, -3.1118,  0.3169,  5.6359])


Evaluating Questions:  89%|████████▉ | 58/65 [15:26<01:49, 15.70s/it]

reranker scores tensor([ 6.7703,  3.3869,  3.7943, -3.7126,  0.6219])


Evaluating Questions:  91%|█████████ | 59/65 [15:44<01:37, 16.26s/it]

reranker scores tensor([ 4.9622, -2.3637, -4.1200,  0.8756, -4.1086])


Evaluating Questions:  92%|█████████▏| 60/65 [16:03<01:25, 17.07s/it]

reranker scores tensor([ 4.7445, -3.8610, -3.9538, -4.7304, -3.6378])


Evaluating Questions:  94%|█████████▍| 61/65 [16:20<01:08, 17.10s/it]

reranker scores tensor([  6.5327,  -7.2215,  -3.5552,  -5.5280, -10.4522])


Evaluating Questions:  95%|█████████▌| 62/65 [16:35<00:49, 16.59s/it]

reranker scores tensor([ 0.4283, -0.4101,  2.9424,  2.4638,  1.2620])


Evaluating Questions:  97%|█████████▋| 63/65 [16:49<00:31, 15.87s/it]

reranker scores tensor([-5.5925,  2.3130, -5.3778, -5.7646, -7.3190])


Evaluating Questions:  98%|█████████▊| 64/65 [17:09<00:16, 16.93s/it]

reranker scores tensor([ 4.4911, -2.7243, -7.2381, -2.1894, -2.2541])


Evaluating Questions: 100%|██████████| 65/65 [17:25<00:00, 16.09s/it]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/3 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 115.65 seconds, 0.56 sentences/sec

Average Scores:
Average BERTScore Precision: 0.8332
Average BERTScore Recall: 0.8977
Average BERTScore F1: 0.8640

Detailed Results Sample (first 3):

Question: What architecture is the `tokenizers-linux-x64-musl` binary designed for?

Reference Answer: x86_64-unknown-linux-musl
Generated Answer: The provided context does not contain specific information regarding the architecture for which the `tokenizers-linux-x64-musl` binary is designed. Therefore, the answer based on the given context is: "I do not have enough information to answer this question based on the provided context."
BERTScore Precision: 0.8254
BERTScore Recall: 0.8858
BERTScore F1: 0.8545

Question: What is the purpose of the BLIP-Diffusion model?

Reference Answer: The BLIP-Diffusion model is designed for controllable text-to-image generation and editing.
Generated Answer: The purpose of the BLIP-Diffusion model is to enable zero-shot subject-driven generation and control-gui

# **Saving Results to files**

In [None]:
comparison_results_file = create_filename_timestamp(filename='comparison_results', extension="csv")
comparison_results.to_csv(comparison_results_file, index=False)
results1_df_file = create_filename_timestamp(filename='contextual_rag_results', extension="csv")
results1_df.to_csv(results1_df_file, index=False)
results2_df_file = create_filename_timestamp(filename='regular_rag_results', extension="csv")
results2_df.to_csv(results2_df_file, index=False)

In [None]:
files.download(comparison_results_file)
files.download(results1_df_file)
files.download(results2_df_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>