## Library Imports

In [None]:
%load_ext autoreload 
%autoreload 2
import os
import nest_asyncio

nest_asyncio.apply()

After 2 years of reading and testing every 𝘁𝗶𝗺𝗲 𝘀𝗲𝗿𝗶𝗲𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗺𝗼𝗱𝗲𝗹, my conclusion is this:

➡️ 𝗗𝗲𝗰𝗼𝗱𝗲𝗿-𝗼𝗻𝗹𝘆 models lead in forecasting.

➡️ 𝗘𝗻𝗰𝗼𝗱𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 work better for "time series understanding" tasks—e.g. imputation, anomaly detection.

➡️ 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 (e.g. Chronos) remain underexplored. TimeGPT is likely one.

This mirrors NLP: encoders for supervised tasks like text classification, decoders for text generation.

Btw, a remarkable forecasting model is Toto. Tutorials in the comments! 👇

### Variables

In [None]:
QDRANT_HOST = os.getenv("QDRANT_HOST", "localhost")
QDRANT_PORT = int(os.getenv("QDRANT_PORT", 6333))
OLLAMA_BASE_URL = os.getenv("OLLAMA_HOST", "localhost")
OLLAMA_PORT = int(os.getenv("OLLAMA_PORT", 11434))
DATA_DIR = "../docs"
REQUIRED_EXTS = [".txt"]

## Setup the Qdrant vector DB 

- We create a collection in which we will store all the vector embeddings
- These vector embeddings will be indexed for efficient search

In [None]:
import qdrant_client

collection_name = "rag_cc"
client = qdrant_client.QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)

### Read the documents from a DIR

- Loading the data through llama directory reader 

In [None]:
from llama_index.core import SimpleDirectoryReader

input_dir_path = DATA_DIR
loader = SimpleDirectoryReader(
    input_dir=DATA_DIR, required_exts=REQUIRED_EXTS, recursive=True
)
docs = loader.load_data()

In [None]:
type(docs), len(docs)

## Create an Index

In [None]:
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext


def create_index(documents):
    # Create a QdrantVectorStore instance
    vector_store = QdrantVectorStore(client=client, collection_name=collection_name)

    # Configure storage settings by specifying the vector store as the storage backend
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    # Create an index by embedding each document and storing it in the vector store
    index = VectorStoreIndex.from_documents(
        documents=documents, storage_context=storage_context
    )
    return index

### Load the embedding model and index the data 

- Even though the process is not visible, this is what happens under the hood:
    - The documents are chunked using a chunking method
    - After chunking the embedding model is used to create embeddings of the data 
    - Once we have embeddings, they are indexed and stored in the vector store
- Later we can fetch the embeddings based on similarity scores to user queries

- **We need the Qdrant container to be running for the indexing to work.**

In [None]:
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.core import Settings

embed_model = FastEmbedEmbedding(
    model_name="BAAI/bge-large-en-v1.5",
)
# Add the embedding model to the settings, to be used by the index creation process
Settings.embed_model = embed_model
index = create_index(docs)

### Load the LLM 

- After we have created our vector database, we will now use LLMs, which will use user query, and relevant context to generate a response

In [None]:
from llama_index.llms.ollama import Ollama

llm = Ollama(model="gemma3n:e2b", base_url=OLLAMA_BASE_URL, request_timeout=60)
Settings.llm = llm

### Define the Prompt Template 

- We use a prompt template, for the LLM to generate a response based on the query and the context

In [None]:
from llama_index.core import PromptTemplate

template = """Context information is below:
              ---------------------
              {context_str}
              ---------------------
              Given the context information above I want you to think
              step by step to answer the query in a crisp manner,
              incase you don't know the answer say 'I don't know!'
            
              Query: {query_str}
        
              Answer:"""

qa_prompt_tmpl = PromptTemplate(template)

### Reranking 

- Based on the user query, the query engine will return us the top_k most similar contexts to the query
- To fine-grain the contexts more we use a reranker model
- Reranker is a sophisticated model (often a cross-encoder) which evaluates the initial list of retrieved chunks alongside the query to assign a relevance score to each chunk, and we pick top_n context chunks

In [None]:
from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=2)

### Query the Document 

- The query engine integrates the retrieval, re-ranking, and prompt based response generation steps.

In [None]:
query_engine = index.as_query_engine(similarity_top_k=10)
query_engine.update_prompts({"response_synthesizer:text_qa_template": qa_prompt_tmpl})

In [None]:
response = query_engine.query(
    """How did the structure of funding startups in batches contribute to the success and growth of the Y Combinator program and the startups involved?"""
)

In [None]:
from IPython.display import Markdown, display

display(Markdown(str(response)))


## Generating Evaluation Dataset using Ragas

- Relying on RAG systems with gut feelings is not the way to go. 
- It's better to have evaluations on the way, to see what works and what doesn't in reality.

- Chunking might not be precise and useful.
- The retrieval model might not always fetch the most relevant document.
- The generative model might misinterpret the context, leading to inaccurate or misleading answers.

### Load the knowledge Base 

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

- In RAG the availability of ground truths might not be available as the RAG can be applied to a very specific domain, so referene truths will be tricky to obtain 
- Hence we evaluate the RAG systems using the reference-free metrics that capture the **quality** of the generated response which is what precisely matters in the rag application. 
- These metrics rely on: 
    - question (q)
    - retrieved context (c(q))
    - generated response/answer (a(q)) 

- We look at the following metrics here: 
    - Faithfullness: Is the generated response (a(q)) faithful to the retrieved context c(q)?
        - A high faithfullness score means the generated text uses **ONLY** tthe information provided in the retrieved documents without irrelevant or hallucinations
    
    - Answer Relevance: Is the generated response (a(q)) relevant to the user query in meaningful and complete way ?
        - A high score means the response fully covers the users intent providing the information that is specific to the question asked. This metric discourages responses which may be technically correct but are either too broad, partially off-topic, or contain unnecessary information. 

    




In [None]:
loader = DirectoryLoader(
    "../docs/paul_graham",
)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=20)
documents = loader.load_and_split(text_splitter=text_splitter)

### Setting up Models

In [None]:
from langchain_ollama import ChatOllama
from langchain_ollama import OllamaEmbeddings

generator_llm = ChatOllama(model="phi3:3.8b", base_url=OLLAMA_BASE_URL)
critic_llm = ChatOllama(model="llama3.2:1b", base_url=OLLAMA_BASE_URL)
ollama_emb = OllamaEmbeddings(model="nomic-embed-text", base_url=OLLAMA_BASE_URL)

### Creating Ragas Testset Generator 

In [None]:
# generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
# dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

In [None]:
from ragas.testset import TestsetGenerator
import pandas as pd

In [None]:
generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, ollama_emb)

In [None]:
# FAILS, becaus of some dependency issues with langchain
# distribution = {"simple":0.6, "reasoning":0.3, "multi_context":0.245}
# testset = generator.generate_with_langchain_docs(documents, testset_size=10, query_distribution=distribution, raise_exceptions=True)

In [None]:
# Load the testset from a file
# test_df = testset.to_pandas().dropna()
test_df = pd.read_csv("../docs/paul_graham/test_data_paul_graham.csv").dropna()

- Below function that will accept the query engine and a question, and return the answer along with the context it looked at to generate the corresponding answer

In [None]:
def generate_response(query_engine, question):
    response = query_engine.query(question)
    return {
        "answer": response.response,
        "context": [c.node.get_content() for c in response.source_nodes],
    }

In [None]:
from datasets import Dataset
from tqdm.auto import tqdm

test_questions = test_df["question"].values

responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]

In [None]:
for i, response in enumerate(responses):
    print(response.keys())
    break

In [None]:
dataset_dict = {
    "question": test_questions,
    "answer": [response["answer"] for response in responses],
    "contexts": [response["context"] for response in responses],
    "ground_truth": test_df["ground_truth"].values.tolist(),
}

ragas_eval_dataset = Dataset.from_dict(dataset_dict)

## Metric Computation

In [None]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_correctness,
    context_recall,
    context_precision,
)

### Faithfullness
- $$F = \frac{V}{S}$$
    - S is total number of assertive statements generated from a(q) which make a specific claim in the response. 
    - V are the total number of statements from S that can be verified against the retrieved context.

- Goal: considering the retrieved documents as source of truth, how trustworthy are the the LLM responses
    - A high faithfulness score indicates that most or all statements in the answer are verifiable within the context, meaning the answer closely aligns with the information provided by the retrieval engine.
    - F evaluates the **quality of the answer**

### Answer Relevance
- $$AR = \frac{1}{n} \sum_{i=1}^{n} cos_sim(q, q_i)$$
    - $q$ is the original question 
    - $q_i$ is a generated question from the answer 

- Goal: Considering the generated answer by LLM as a true answer, how well is the **answer aligned to the original question**, as it can match a variety of questions reflecting the same intent. 
    - A high AR score indicates the answer is well-aligned to the original question. 
    - AR evaluates **the quality of the answer**

### Context Relevance
- $$ CR = \frac{Number\ of\ extracted\ sentences}{Total\ Number\ of\ Sentences\ in\ c(q)} $$
    - Here the sentences refer to all sentences retrieved from the vector DB
    - The numerator represents only the relevant sentences required to answer the question
- Goal: Considering the LLM answer is true, how relevant was the context to give that answer
    - A high score means the extracted context is highly relevant for generating the specific answer.
    - CR evaluates the **quality of the context**

### Answer Correctness
- $$ FAC = \frac{TP}{(TP + 0.5 * (FP + FN))}$$
    - TP: Statements that are present in both the answer and ground truth 
    - FP: Statements that are present in answer but not found in the ground truth
    - FN: Statements that are not present in the answer but are present in the ground truth 
- Goal: Check answers' correctness both factually & semantically
    - **Ground truth is a requirement** for this metric.
    - **A critic LLM** will check the factual correctness by comparing the generated answer and the gt 
    - An embedding model compoutes the embeddings for the generated answer and the gt and then measures the cosine of the angle between the 2 embeddings, which helps to determine the cosine similarity
    - FAC evaluates **the quality of answer wrt ground truth**



### Context Recall 
- $$ \text{Context Recall} = \frac{\text{Number of sentences that can be attributed to context}}{\text{Number of sentences in GT}}$$
    - Sentences are sentences from the ground_truth 

- Goal: To check how much of the retrieved context aligns with the ground_truth answer 
    - **Ground truth is a requirement** for this metric.
    - **A critic LLM** judges how much of the retrieved context aligns with the ground truth answer
    - CR evaluates the **quality of the retrieved context wrt ground truth**

### Context Precision
- Context precision measures **if all relevant items in the contexts are ranked high or not**. 
    - It checks how precise the fetched context is.
- Goal: Check the preciseness of the fetched content
    - Given the question, the ground truth answer, and the retrieved context, verify if the context was useful in arriving at the given answer.
    - **Ground Truth & Critic LLM are a requirement**

In [None]:
metrics = [faithfulness, answer_correctness, context_recall, context_precision]

evaluation_result = evaluate(
    llm=critic_llm, embeddings=ollama_emb, dataset=ragas_eval_dataset, metrics=metrics
)

In [None]:
eval_scores_df = pd.DataFrame(evaluation_result.scores)
eval_scores_df

### Summary
- The evaluation process involves majorly checking the quality of the retrieved context & generated answer via different metrics. 
- All metrics are self-contained/reference-free. 
    - Evaluation **without GT**:
        - Faithfullness: Evaluates the **quality of answer**, assuming the **retrieved context is correct**
        - Answer Relevance: Evaluates the **quality of answeer**, via use of semantic similarity of **generated questions** wrt original question
        - Context Relevance:Evaluates the **quality of context**, assuming **generated answer is correct** via use of **critic LLM** to know how many of the sentences in retrieved context would be necessary to come up with the generated answer
    
    - Evaluation **wrt GT**: 
        - Answer Correctness: Evaluates the **quality of answer** by verifying it's **factuallness via critc LLM** as well as **semantic similarity** to the **ground truth**
        - Context Recall: Evaluates the **quality of context** by verifying the sentences from **ground_truth, which can be based upon the retrieved context (via critic LLM)** 
        - Context Precision: Evaluates the **quality of context**  by noting via **critic llm** how precise is the fetched content to come up with the GT for the given question
