<a href="https://colab.research.google.com/github/acastellanos-ie/NLP-MBDS-EN/blob/main/07_rag/RAG_practice_step_by_step.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing a Step-by-Step RAG Practice with LangChain

Welcome to this interactive notebook where we will build a **Retrieval-Augmented Generation (RAG)** system!

In previous practices, we explored Extractive Question Answering and standalone Large Language Models (LLMs) like LLaMA-2 acting as chatbots. While powerful, LLMs have a crucial limitation: they are prone to **hallucinations** (inventing facts) and lack access to private or recent data not seen during their training.

**RAG** solves this issue by combining two components:
1.  **Retrieval Component**: Searches a custom knowledge base (like your own PDF documents, databases, or websites) for relevant information based on a user's question.
2.  **Generation Component**: A powerful LLM takes the retrieved information as "context" and uses it to formulate a precise, well-reasoned answer.

To build this efficiently, we will use **[LangChain](https://python.langchain.com/)**, a state-of-the-art framework designed specifically to make building applications powered by LLMs a breeze.

### In this notebook, we will:
- Set up the environment and install necessary libraries.
- **Step 1**: Load and chunk a custom document to create our Knowledge Base.
- **Step 2**: Create vector Embeddings and a Vector Store (FAISS) for lightning-fast retrieval.
- **Step 3**: Initialize an efficient generative LLM using 4-bit Quantization (to run fast on free hardware).
- **Step 4**: Assemble the RetrievalQA Chain using LangChain.
- **Step 5**: Map our fully functional RAG app to a beautiful interactive Web UI using Gradio!

Ensure that you have the **GPU runtime** activated:
(Runtime -> Change runtime type -> Hardware accelerator -> GPU (T4 is perfect))

## Setup: Installing Dependencies

Let's install all the specialized tools we need. This includes LangChain components, HuggingFace transformers, FAISS (vector DB), and Gradio (UI).

*Note: We are installing `bitsandbytes` and `accelerate` to load the LLM efficiently using quantization.*

In [None]:
!pip install -Uqqq langchain langchain-community langchain-huggingface langchain-text-splitters
!pip install -Uqqq sentence-transformers faiss-cpu beautifulsoup4
!pip install -Uqqq transformers accelerate bitsandbytes
!pip install -Uqqq gradio

## Step 1: Document Loading and Chunking

To build our custom knowledge base, we need a document. For this example, let's scrape a Wikipedia article using LangChain's handy `WebBaseLoader`.

However, LLMs have a **context window limit** (e.g., they can only process 2000 words at a time). To solve this, we must split our long document into smaller, manageable pieces called **chunks**.

In [None]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Load the document (You can change this URL to any article you like!)
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
loader = WebBaseLoader(url)
data = loader.load()

print(f"Loaded {len(data)} document(s).")
print(f"Original character count: {len(data[0].page_content)}")

# 2. Split the document into chunks
# We use RecursiveCharacterTextSplitter which tries to keep paragraphs and sentences together.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # Maximum size of each chunk
    chunk_overlap=150, # Overlap helps prevent cutting the context mid-sentence
    add_start_index=True
)
docs = text_splitter.split_documents(data)

print(f"\nSplit into {len(docs)} chunks.")
print(f"Example Chunk:\n{docs[10].page_content[:300]}...")

## Step 2: Embeddings and Vector Store (The Retriever)

Now we have our text chunks. How do we quickly search through them when a user asks a question?

We use an **Embedding Model** to convert text into fixed-size numbers (vectors). Texts with similar meanings end up as vectors pointing in the same direction. We store these vectors in a **Vector Store** (like FAISS) so we can run blazing fast "similarity searches".

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Select the embedding model
# all-MiniLM-L6-v2 is an excellent, compact embedding model built by SentenceTransformers
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# 2. Create the FAISS Vector Index
# This processes all 'docs' through the embedding model and builds the searchable database
print("Generating embeddings and indexing into FAISS. This may take a minute...")
vectorstore = FAISS.from_documents(docs, embeddings)

# Create the Retrieval interface
retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # Retrieve the top 3 most relevant chunks
print("Indexing Complete!")

In [None]:
# Let's test the retriever standalone!
test_query = "What is machine learning?"
relevant_docs = retriever.invoke(test_query)
print(f"\nRetrieved {len(relevant_docs)} docs for the query '{test_query}'.")

for doc in relevant_docs:
  print("\n---")
  print("Content: ", doc.page_content)


## Step 3: Generator Setup (The LLM)

This is the brain that will formulate the final answer.
Instead of requiring you to accept usage policies for private models, we will use a fantastic, robust open model: **`TinyLlama/TinyLlama-1.1B-Chat-v1.0`**.
Despite its 'Tiny' name, it's very competent for instruction-following.

To make it ultra-fast and memory-friendly in Colab, we load it in **4-bit precision** using the `bitsandbytes` library.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from langchain_huggingface import HuggingFacePipeline

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Configuration for 4-bit Quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print(f"Loading model tokenizer and weights ({model_id})...")
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto" # Automatically maps to GPU if available
)

# Build the HuggingFace Generation Pipeline
text_generation_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.1,    # We keep temperature low in RAG to avoid hallucinations
    max_new_tokens=256, # Max length of the answer it generates
    repetition_penalty=1.1,
    return_full_text=False # We only want the generated answer, not the prompt echoed back
)

# Wrap the pipeline so LangChain can converse with it
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
print("LLM loaded and pipeline wrapped!")

## Step 4: Putting It All Together (The RAG Chain)

We have our Retriever (FAISS) and our Generator (TinyLlama). Now we use LangChain to wire them together.

We'll define a **Prompt Template** that instructs the LLM:
"Here is some context. Use it to answer the question. If you don't know the answer, just say you don't know."

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# 1. Define the Prompt explicitly for our Chat Model
# Note: This template format <|system|>, <|user|> is specific to TinyLlama-Chat.
prompt_template = """<|system|>
You are an intelligent assistant. Use the following contextual information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
{context}</s>
<|user|>
{input}</s>
<|assistant|>
"""

prompt = ChatPromptTemplate.from_template(prompt_template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# 2. Build the Retrieval Chain (Wires Retriever + Document Chain together)
rag_chain = (
    {"context": retriever | format_docs, "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("RAG Pipeline is ready!")

Let's test the RAG Chain programmatically to see if everything works:

In [None]:
from operator import itemgetter
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

# Asumimos que prompt, llm, retriever y format_docs ya existen del paso anterior.

# Paso A: Definir la rama de generaciÃ³n de respuesta
# Esta sub-cadena toma el diccionario con documentos crudos y pregunta,
# formatea los docs a string, y genera la respuesta.
answer_chain = (
    RunnablePassthrough.assign(
        context=lambda x: format_docs(x["context"])  # Formateamos docs a string SOLO para el LLM
    )
    | prompt
    | llm
    | StrOutputParser()
)

# Paso B: Construir la cadena principal que devuelve todo
# 1. 'context': recupera documentos (los guarda como objetos)
# 2. 'input': pasa la pregunta
# 3. .assign(answer=...): aÃ±ade la clave 'answer' calculada por la answer_chain
rag_chain_with_sources = (
    RunnableParallel(
        {"context": itemgetter("input") | retriever, "input": itemgetter("input")}
    )
    .assign(answer=answer_chain)
)

In [None]:
user_question = "Who formulated the concept of weak AI and strong AI?"

# Invocamos pasando un diccionario, como lo configuramos con itemgetter('input')
result = rag_chain_with_sources.invoke({"input": user_question})

print("QUESTION:", user_question)
print("\n--- LLM ANSWER ---")
# En LCEL puro, la respuesta suele estar directamente en el output si no usas .assign,
# pero con nuestra estructura nueva, 'answer' es una clave del diccionario.
print(result["answer"])

print("\n--- CITED SOURCES (Context) ---")
# 'result['context']' ahora contiene la lista de Documentos originales
for i, doc in enumerate(result['context'], 1):
    # .page_content es el atributo estÃ¡ndar de LangChain
    content_preview = doc.page_content.replace("\n", " ")[:150]
    print(f"Source {i} snippet: {content_preview}...")

## Step 5: Interactive Chat UI with Gradio

Testing with Python output is great for developers, but applications are built for end-users. We'll wrap our LangChain logic in a `Gradio` Web UI.

We define a helper function (`chat_with_rag`) that Gradio will trigger every time the user clicks submit.

In [None]:
import gradio as gr

def chat_with_rag(message, history):
    # Using our rag_chain to generate a response
    response = rag_chain_with_sources.invoke({"input": message})

    # We fetch the answer string from the output dictionary
    answer = response["answer"]
    return answer.strip()

# Create the Gradio interface
demo = gr.ChatInterface(
    fn=chat_with_rag,
    title="My First RAG App ðŸš€",
    description="Ask me anything about Artificial Intelligence! My knowledge is powered by our FAISS index and TinyLlama.",
    examples=["What is the Turing test?", "Who are the pioneers of AI?", "Explain deep learning briefly."],
)

# Launch the Web UI
demo.launch(debug=True, share=True) # share=True gives us a nice public link!

### Congratulations!

You've successfully built a fully robust RAG pipeline incorporating state-of-the-art technology:
- **LangChain** for chaining logical blocks.
- **FAISS** alongside Dense Embeddings for high-speed retrieval.
- **4-Bit Quantized Models** (`TinyLlama`) executing LLM logic locally and quickly.
- **Gradio** for serving a beautiful front-end.

**Challenge**: Try returning to **Step 1**, grab a different URL (like a Wikipedia article on Quantum Computing or the history of ancient Rome), reset the runtime, and execute all the cells again to change your Chatbot's Knowledge Base!