Before running the notebook, make sure the following dependencies are installed:

1. LlamaIndex
    * Open source framework to connect your LLMs with external data sources.
        * pdfs, websites, APIs.
        * supports popular LLMs & vector databases.
    * simplifies indexing & querying data.

2. Qdrant
    * open source vector database optimized for querying high dimensional vectors
    * Use docker to self host a Qdrant vector database
    * provides fast similarity and filtering search
    * Install Docker Engine (if not already)
    * Start the docker container for qdrant
    * `docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant`
        * `docker run`: Initiates a new container.
        * `-p 6333:6333 -p 6334:6334`: Maps the host machine's ports 6333 and 6334 to the same ports on the container.
        * `-v $(pwd)/qdrant_storage:/qdrant/storage:z`:
            * `$(pwd)/qdrant_storage` is a directory on the host (the current working directory with a subdirectory named qdrant_storage).
            * `/qdrant/storage` is the corresponding directory inside the container where Qdrant will store its data.
            * The `:z` suffix sets the SELinux context for the volume (on SELinux-enabled systems), allowing the container to access the mounted volume securely.
        * `qdrant/qdrant`: Specifies the Docker image to use, which is `qdrant/qdrant` from Docker Hub set up by the Qdrant team.

3. Ollama
    * provides a platform to run LLMs locally, giivng control over the data and model usage.

In [1]:
# To handle asynchronous operations within a Jupyter Notebook environment, 
# which will allow us to make asynchronous calls smoothly in our RAG 

# nest_asyncio is a library that allows you to use asyncio in environments 
# that don't natively support it, like Jupyter Notebooks.

import nest_asyncio

# This line modifies the event loop policy to allow nested use of asyncio.run(). Essentially, 
# it makes sure that the event loop can handle multiple asynchronous calls without running into issues.

nest_asyncio.apply()

# RAG - 1

## Set up the Qdrant vector database

 set up a connection to Qdrant (our local vector database), where we will store and retrieve vector embeddings in this RAG

* Collections in Qdrant are like tables in databases, 
    * where each collection can hold a set of vectors. 
* Here, "chat_with_docs" is intended to store document embeddings 
    * to support query-based information retrieval.
* client = qdrant_client.QdrantClient(...) initializes a QdrantClient instance, 
    * connecting it to a Qdrant server running locally.

In [None]:
import qdrant_client

collection_name="chat_with_docs" # used later to create a vector store

client = qdrant_client.QdrantClient(
    host="localhost",
    port=6333
)

## Read Documents

* set up a document loader that reads files from a specified directory and 
    * extracts their contents for use in our RAG pipeline.
* allow us to retrieve text from PDF files, 
    * which we’ll later transform into embeddings and 
    * store in the Qdrant vector database created above.

In [4]:
from pathlib import Path
data_dir = Path(r"F:\cc_data\docs")

data_dir.absolute()

WindowsPath('F:/cc_data/docs')

* import the `SimpleDirectoryReader` class from the `llama_index` library. 
* This scans a directory, 
    * filters for specific file types, and 
    * loads document content into a format we can work with.
* Next, we specify the directory path where additional documents are stored.
* The loader is a `SimpleDirectoryReader` instance, 
    * which is configured to load specific types of files (`.pdf`) from the given directory, recursively.
* So far, the loader object has only been instantiated. 
    * We haven't read anything yet. 
    * Thus, the `load_data()` method is used to read the PDF file’s content and 
    * return it in a structured format, storing it in docs list. 
* Each entry in docs represents the text content of the PDF document.

In [7]:
from llama_index.core import SimpleDirectoryReader

# input_dir_path = './docs'

loader = SimpleDirectoryReader(
            input_dir = data_dir,
            required_exts=[".pdf"],
            recursive=True
        )
docs = loader.load_data()

There are 32 pages in the document

In [6]:
type(docs), len(docs)

(list, 32)

## function to index data

* Before embedding the above data and indexing in the vector database, 
    * we need to write a function that can be invoked to do this.

* In this step, we'll define a function to create an index for our document embeddings, 
    * which will be stored in the Qdrant vector database.

* This index allows us to organize and search through the document embeddings efficiently, 
    * forming the backbone of our RAG app.

* break down how this function initializes the vector store, 
    * configures the storage context, and 
    * creates an index from the loaded documents.

* The above function will take the documents as input 
    * (`docs` which we created above using `SimpleDirectoryReader`).
* We initialize a `QdrantVectorStore` object by 
    * passing the previously created Qdrant client in step 3 and 
    * a name for the collection.
* Next, we configure storage settings by specifying the above `vector_store` as the storage backend.
* Finally, we create an index by embedding each document in documents and 
    * storing it in the Qdrant vector store.

In [8]:
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext

def create_index(documents):

    vector_store = QdrantVectorStore(client=client,
                                     collection_name=collection_name)
    
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    
    index = VectorStoreIndex.from_documents(documents,
                                            storage_context=storage_context)
    
    return index

## Load the embedding model and index data

* Now that we have defined the above function, we can actually index our data.

* In this step, we are setting up an `embedding model` from `Hugging Face` to 
    * convert our documents into vector embeddings, 
    * which we’ll then store in Qdrant using our index function.

* The HuggingFaceEmbedding class lets us use Hugging Face models to generate embeddings for text data. 
    * In this case, we use "BAAI/bge-large-en-v1.5" model by the Beijing Academy of Artificial Intelligence (BAAI).
* Next, we configure `embed_model` as the default embedding model in `Settings`. 
    * This setting ensures that the same model is used throughout our RAG pipeline to maintain consistency in embedding generation.
* Finally, we invoke the `create_index` function we defined earlier, 
    * passing in docs (the list of loaded documents). 
    
    * As discussed above, this function converts each document --> into an embedding using embed_model and --> stores the embeddings in Qdrant.

In [9]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5",
                                   trust_remote_code=True)

Settings.embed_model = embed_model

index = create_index(docs)

  from .autonotebook import tqdm as notebook_tqdm


##  Load the LLM
* In this step, we configure an LLM to handle the response generation step in our RAG pipeline.
* First, we import the Ollama class. 
    * Here, we’re using the "llama3.2:1b" model.
* We are also specifying a `request_timeout` of 120 seconds for requests 
    * to the LLM to ensure that the system doesn’t get stuck if the model takes too long to respond.
* Finally, we set the above LLM instance as the default language model in `Settings`, 
    * making it the primary model used in our RAG pipeline.

In [18]:
from llama_index.llms.ollama import Ollama

llm = Ollama(model="llama3.2:1b", request_timeout=120.0)

Settings.llm = llm

## Define the prompt template

* In this step, we create a prompt template that defines a consistent format 
    * to guide the LLM about the context it should look at while answering the query.

* First, we import the `PromptTemplate` class, 
    * which lets us define reusable prompt formats for the LLM.
* Next, we define the template as a string, 
    * which is then passed to the `PromptTemplate` class to initialize its object.

In [13]:
from llama_index.core import PromptTemplate

template = """Context information is below:
              ---------------------
              {context_str}
              ---------------------
              Given the context information above I want you to think
              step by step to answer the query in a crisp manner,
              incase you don't know the answer say 'I don't know!'
            
              Query: {query_str}
        
              Answer:"""

qa_prompt_tmpl = PromptTemplate(template)

## Reranking

* After retrieval, the selected chunks might need further refinement 
    * to ensure the most relevant information is prioritized.

* In this re-ranking step, a more sophisticated model (often a `cross-encoder`) 
    * evaluates the initial list of retrieved chunks 
    * alongside the `query` to assign a relevance score to each chunk.

* This process rearranges the chunks so that the most relevant ones are prioritized for the response generation.

* Here, we use a cross-encoder to re-rank the document chunks.
* Also, we limit the output to the top 3 most relevant chunks based on the model’s scoring.

In [14]:
from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", 
    top_n=3
)

## Query the document
* Finally, we utilize the index created above to set up a query engine, 
    * which will use our indexed documents and 
    * re-ranking model to process user queries.

* The query engine integrates the 
    * retrieval, 
    * re-ranking, and 
    * prompt-based response generation steps.

* First, we convert the previously created `index` into a `query engine`.
* We specify that the `engine` should retrieve the top 10 most similar document `chunks` based on `vector` similarity to the `query`. 
* This forms the initial set of `chunks` for answering the `query`.
* The `re-ranking` step is further added to this to refine the retrieved `chunks`. 
    * The rerank model will evaluate these `chunks` to select the most relevant ones for generating a response.
* Next, we add the `prompt template` to our `query engine`.
* Finally, we execute a `query` using the configured `query engine`. 
* In this case, "What exactly is DSPy?" is our sample query.

In [19]:
query_engine = index.as_query_engine(similarity_top_k=10,
                                     node_postprocessors=[rerank])

query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": qa_prompt_tmpl}
)

response = query_engine.query("What exactly is DSPy?")

In [20]:
from IPython.display import Markdown, display

display(Markdown(str(response)))

DSPy stands for Declarative Specifier for Programming Interfaces, which refers to a programming model used for abstracting and customizing natural language prompts (signatures) in deep learning pipelines.

: 

# Metrics for RAG evaluation

* Typically, when evaluating a RAG system, we do not have access to human-annotated datasets or reference answers since 
    * the downstream application can be HIGHLY specific due to the capabilities of LLMs.
* So, we prefer self-contained or reference-free metrics that 
    * capture the “quality” of the generated response, 
    * which is precisely what matters in RAG applications.
* [Ragas Paper](https://arxiv.org/pdf/2309.15217)

## Faithfulness

* measures whether the generated response $a(q)$ “stays true” or is "faithful" to the retrieved context $c(q)$. 
* In other words, faithfulness checks that all claims or information presented in the answer 
    * can be directly inferred from the retrieved context.

* **high** faithfulness score means the generated text uses 
    * ONLY the information provided in the retrieved documents 
    * without introducing irrelevant or fabricated details, i.e., hallucinations.

* To measure faithfulness, 
    * we can use a multi-step approach involving another LLM to break down 
    * the generated answer into distinct statements, 
        * each representing a focused assertion.
    * goal of this breakdown is to simplify longer, 
        * complex sentences into smaller, verifiable statements.
* Faihfulness score:
    * $S$: Total number of statrements.
    * $V$: Number of supported statements.
    * Faithfulness Score: 
        $$F = \frac{V}{S}$$ 



## Answer Relevance

* Answer Relevance measures whether the generated answer $a(q)$
    * directly addresses the user’s query in a meaningful and complete way.
* This metric focuses on the relevance of the response rather than its factual accuracy. What??
* It discourages:
    * technically be correct but are either 
        * too broad, 
        * partially off-topic, or 
        * contain unnecessary information.
* How different from `Faithfulness`?
    * Faithfulness evaluates the answer wrt question. 
    * Answer Relevance evaluates the question with proxy questions.

* Steps
    * Generate proxy questions
        * For each $a(q)$ prmpt LLM to generate alternative questions.
    * Calculate similarity scores:
        * $$AR = \frac{1}{n} \sum_{i=1}^n \text{similarity}(q, q_i)$$
        * where $n$ is number of generated questions.
        * $simalirity(q, q_i)$: cosine similarity between embedding of $q$ and each $q_i$.
* A high AR score indicates that the generated answer is well-aligned with the original question, 
    * as it can match a variety of questions that reflect the same intent.

## Context Relevance

* "How relevant is my context?" 
    * Did we not check this in `Faithfulness`?
* context relevance measures how well the retrieved context $c(q)$ is to answer the specific question $q$. 
* Thus, it should discourage cases where 
    * irrelevant details are included in the context that can mislead the LLM during the generation stage.
* $$CR = \frac{Number-of-relevant-sentences}{Number-of-Sentences-in-c(q)}$$
* A high CR score indicates that
    * majority of the sentences in the retrieved context are directly relevant to the question.

## Answer Correctness

* Answer correctness considers two key aspects—
    * the semantic similarity between the generated answer and 
    * the ground truth, as well as factual similarity.
* This requires 2 models:
    * **critic LLM** 
        * determines factual correctness by measuring the similarity between 
            * the generated answer and 
            * the ground truth answer using a critic llm.
    * **embedding model**
        * computes the embeddings for the generated answer and the ground truth,
        * and then measures the cosine similarity. 
        * This helps in determining the semantic similarity.

```
Given a ground truth and an answer, analyze each statement in the answer and classify them in one of the following categories:

- TP (true positive): statements that are present in both the answer and the ground truth,
- FP (false positive): statements present in the answer but not found in the ground truth,
- FN (false negative): relevant statements found in the ground truth but omitted in the answer.

Factual correctness score = TP / (TP + 0.5 * (FP + FN))
```

## Context Recall

## Context Precision