## Library Imports 

In [None]:
%load_ext autoreload 
%autoreload 2
import os
import nest_asyncio

nest_asyncio.apply()

After 2 years of reading and testing every 𝘁𝗶𝗺𝗲 𝘀𝗲𝗿𝗶𝗲𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗺𝗼𝗱𝗲𝗹, my conclusion is this:

➡️ 𝗗𝗲𝗰𝗼𝗱𝗲𝗿-𝗼𝗻𝗹𝘆 models lead in forecasting.

➡️ 𝗘𝗻𝗰𝗼𝗱𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 work better for "time series understanding" tasks—e.g. imputation, anomaly detection.

➡️ 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 (e.g. Chronos) remain underexplored. TimeGPT is likely one.

This mirrors NLP: encoders for supervised tasks like text classification, decoders for text generation.

Btw, a remarkable forecasting model is Toto. Tutorials in the comments! 👇

### Variables

In [None]:
QDRANT_HOST = os.getenv("QDRANT_HOST", "localhost")
QDRANT_PORT = int(os.getenv("QDRANT_PORT", 6333))
OLLAMA_BASE_URL = os.getenv("OLLAMA_HOST", "localhost")
OLLAMA_PORT = int(os.getenv("OLLAMA_PORT", 11434))
DATA_DIR = "../docs"
REQUIRED_EXTS = [".pdf"]

## Setup the Qdrant vector DB 

- We create a collection in which we will store all the vector embeddings
- These vector embeddings will be indexed for efficient search

In [None]:
import qdrant_client

collection_name = "rag_cc"
client = qdrant_client.QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)

### Read the documents from a DIR

- Loading the data through llama directory reader 

In [None]:
from llama_index.core import SimpleDirectoryReader

input_dir_path = DATA_DIR
loader = SimpleDirectoryReader(
    input_dir=DATA_DIR, required_exts=REQUIRED_EXTS, recursive=True
)
docs = loader.load_data()

In [None]:
type(docs), len(docs)

## Create an Index

In [None]:
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext


def create_index(documents):
    # Create a QdrantVectorStore instance
    vector_store = QdrantVectorStore(
        client=client,
        collection_name=collection_name,
    )
    # Configure storage settings by specifying the vector store as the storage backend
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    # Create an index by embedding each document and storing it in the vector store
    index = VectorStoreIndex.from_documents(
        documents=documents, storage_context=storage_context
    )
    return index

### Load the embedding model and index the data 

- Even though the process is not visible, this is what happens under the hood:
    - The documents are chunked using a chunking method
    - After chunking the embedding model is used to create embeddings of the data 
    - Once we have embeddings, they are indexed and stored in the vector store
- Later we can fetch the embeddings based on similarity scores to user queries

- **We need the Qdrant container to be running for the indexing to work.**

In [None]:
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings

embed_model = OllamaEmbedding(
    model_name="qllama/multilingual-e5-small:latest",
    base_url=OLLAMA_BASE_URL,
)
# Add the embedding model to the settings, to be used by the index creation process
Settings.embed_model = embed_model
index = create_index(docs)

### Load the LLM 

- After we have created our vector database, we will now use LLMs, which will use user query, and relevant context to generate a response

In [None]:
from llama_index.llms.ollama import Ollama

llm = Ollama(model="gemma3n:e2b", base_url=OLLAMA_BASE_URL, request_timeout=60)
Settings.llm = llm

### Define the Prompt Template 

- We use a prompt template, for the LLM to generate a response based on the query and the context

In [None]:
from llama_index.core import PromptTemplate

template = """Context information is below:
              ---------------------
              {context_str}
              ---------------------
              Given the context information above I want you to think
              step by step to answer the query in a crisp manner,
              incase you don't know the answer say 'I don't know!'
            
              Query: {query_str}
        
              Answer:"""

qa_prompt_tmpl = PromptTemplate(template)

### Reranking 

- Based on the user query, the query engine will return us the top_k most similar contexts to the query
- To fine-grain the contexts more we use a reranker model
- Reranker is a sophisticated model (often a cross-encoder) which evaluates the initial list of retrieved chunks alongside the query to assign a relevance score to each chunk, and we pick top_n context chunks

In [None]:
from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3
)

### Query the Document 

- The query engine integrates the retrieval, re-ranking, and prompt based response generation steps.

In [None]:
query_engine = index.as_query_engine(similarity_top_k=10)
query_engine.update_prompts({"response_synthesizer:text_qa_template": qa_prompt_tmpl})

In [None]:
response = query_engine.query(
    "What is this pdf about? Answer only based on the content of the file and not the file path. Remember DDoS stands for Daily Dose of Data Science not Distributed Denial of Service Attacks"
)

In [None]:
from IPython.display import Markdown, display

display(Markdown(str(response)))

### Summary

- The end to end process of RAG is:
    - Document Chunking
    - Embedding the chunks into vectors (Encoder of the whole process)
    - Indexing the vectors
    - Storing vectors in the Vector DB
    - Converting user query into vector using same model 
    - Querying vector DB to find ANNs to user query
    - Retrieval of similar vector chunks to that of query 
    - Re-ranking the retrieved chunks 
    - Feeding the top_n chunks as context along with query to the LLM (Decoder of the whole process)
    - Generating response for the user query with relevant context 

### A bit of maths 

- For getting the vector similarity we use: 
    - Cosine Similarity (Cos(theta)) as it measures the cosine of the angle between two vectors, providing a metric for their orientation rather than magnitude. 
    - It ranges from -1 to 1, where 0 means no similarity, -1 indicates high dissimilarity and 1 indicates high similarity
    - It's better than using Dot product (ABCos(theta)) as it removes the influence of magnitude when comparing vectors, so if there are 2 vectors, one having high magnitude, while the other not, the Dot products' value will be large (in both positive / negative way depending on the angle, so this is an influenced measure, but when the influence is normalized by removing the magnitudes we get the Cosine similarity)