# 03. Core RAG Pipeline

This notebook connects the pieces: **Vector Store** (Knowledge) + **Llama** (Reasoning).

**Steps:**
1.  **Load Vector Store**: Connect to the persistent ChromaDB.
2.  **Setup Retriever**: configure how we fetch documents.
3.  **Setup LLM**: Connect to Ollama (Llama 3).
4.  **Create Chain**: Build the RAG retrieval chain.
5.  **Test**: Ask questions and verify citations.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_ollama import ChatOllama
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Configuration
DB_DIR = "data/chroma_db"
EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"
LLM_MODEL = "llama3" # Ensure this matches your `ollama list`

## 1. Load Vector Store & Retriever

In [None]:
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL,
    model_kwargs={'device': 'cpu'}, 
    encode_kwargs={'normalize_embeddings': True}
)

vector_store = Chroma(
    persist_directory=DB_DIR,
    embedding_function=embedding_model
)

retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

print("Retriever ready.")

## 2. Setup LLM (Ollama)

In [None]:
llm = ChatOllama(
    model=LLM_MODEL,
    temperature=0.1, # Keep temperature low for factual answers
)
print(f"Connected to {LLM_MODEL}.")

## 3. Build RAG Chain
We use a standard "stuff" chain which stuffs all retrieved documents into the context window.

In [None]:
system_prompt = (
    "You are an expert research assistant specializing in AI, Machine Learning, and Data Science. "
    "Use the retrieved context below to answer the user's question. "
    "If the answer is not in the context, say you don't know. "
    "Keep answers technical and concise.\n\n"
    "Context:\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

## 4. Run Queries

In [None]:
def ask(query):
    print(f"\nQuestion: {query}...")
    response = rag_chain.invoke({"input": query})
    
    print("\nAnswer:")
    print(response["answer"])
    
    print("\nSources:")
    for i, doc in enumerate(response["context"]):
        print(f"- {doc.metadata.get('source')} (Page {doc.metadata.get('page')})")

# Example Question (Assumes Attention Is All You Need is loaded)
ask("What are the benefits of self-attention mechanisms?")

In [None]:
ask("Explain the difference between encoder and decoder blocks in the Transformer.")