<a href="https://colab.research.google.com/github/acastellanos-ie/NLP-MBDS-EN/blob/main/07_rag/RAG_practice_step_by_step.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing a Step-by-Step RAG Practice with LangChain

Welcome to this interactive notebook where we will build a **Retrieval-Augmented Generation (RAG)** system!

In previous practices, we explored Extractive Question Answering and standalone Large Language Models (LLMs) like LLaMA-2 acting as chatbots. While powerful, LLMs have a crucial limitation: they are prone to **hallucinations** (inventing facts) and lack access to private or recent data not seen during their training.

**RAG** solves this issue by combining two components:
1.  **Retrieval Component**: Searches a custom knowledge base (like your own PDF documents, databases, or websites) for relevant information based on a user's question.
2.  **Generation Component**: A powerful LLM takes the retrieved information as "context" and uses it to formulate a precise, well-reasoned answer.

To build this efficiently, we will use **[LangChain](https://python.langchain.com/)**, a state-of-the-art framework designed specifically to make building applications powered by LLMs a breeze.

### In this notebook, we will:
- Set up the environment and install necessary libraries.
- **Step 1**: Load and chunk a custom document to create our Knowledge Base.
- **Step 2**: Create vector Embeddings and a Vector Store (FAISS) for lightning-fast retrieval.
- **Step 3**: Initialize an efficient generative LLM using 4-bit Quantization (to run fast on free hardware).
- **Step 4**: Assemble the RetrievalQA Chain using LangChain.
- **Step 5**: Map our fully functional RAG app to a beautiful interactive Web UI using Gradio!

Ensure that you have the **GPU runtime** activated:
(Runtime -> Change runtime type -> Hardware accelerator -> GPU (T4 is perfect))

## Setup: Installing Dependencies

Let's install all the specialized tools we need. This includes LangChain components, HuggingFace transformers, FAISS (vector DB), and Gradio (UI).

*Note: We are installing `bitsandbytes` and `accelerate` to load the LLM efficiently using quantization.*

In [14]:
# Fix dependencies and install a stable 5.x version of Gradio
!pip install -Uqqq "huggingface-hub<1.0" "transformers>=4.40.0"
!pip install -Uqqq langchain langchain-community langchain-huggingface langchain-text-splitters
!pip install -Uqqq langchain-classic sentence-transformers faiss-cpu beautifulsoup4
!pip install -Uqqq accelerate bitsandbytes
!pip install -Uqqq "gradio<6.0.0"

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.0/12.0 MB[0m [31m76.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m63.5/63.5 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m325.6/325.6 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 1: Document Loading and Chunking

To build our custom knowledge base, we need a document. For this example, let's scrape a Wikipedia article using LangChain's handy `WebBaseLoader`.

**The Rationale (Why Chunk?):** LLMs have a strict limit on how much text they can process at once, known as the **context window**. If we pass an entire Wikipedia page to the LLM, it will likely crash or forget the beginning. To solve this, we must split our long document into smaller, bite-sized pieces called **chunks**.

For this practice, we use `RecursiveCharacterTextSplitter`. Notice the `chunk_overlap` parameter? We use an overlap so that if a sentence is split midway, the next chunk will contain the preceding words, preventing loss of context!

In [3]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Load the document (You can change this URL to any article you like!)
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
loader = WebBaseLoader(url)
data = loader.load()

print(f"Loaded {len(data)} document(s).")
print(f"Original character count: {len(data[0].page_content)}")

# 2. Split the document into chunks
# We use RecursiveCharacterTextSplitter which tries to keep paragraphs and sentences together.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # Maximum size of each chunk
    chunk_overlap=150, # Overlap helps prevent cutting the context mid-sentence
    add_start_index=True
)
docs = text_splitter.split_documents(data)

print(f"\nSplit into {len(docs)} chunks.")
print(f"Example Chunk:\n{docs[10].page_content[:300]}...")



Loaded 1 document(s).
Original character count: 209343

Split into 298 chunks.
Example Chunk:
Various subfields of AI research are centered around particular goals and the use of particular tools. The traditional goals of AI research include learning, reasoning, knowledge representation, planning, natural language processing, perception, and support for robotics.[a] To reach these goals, AI ...


## Step 2: Embeddings and Vector Store (The Retriever)

Now we have our text chunks. How do we quickly search through them when a user asks a question?

**The Rationale (What are Embeddings?):** We use an **Embedding Model** to convert text into fixed-size numbers (vectors). Think of it as mapping sentences on a 3D graph: texts with similar meanings (e.g., 'dog' and 'puppy') end up grouped closely together.

By storing these vectors in a **Vector Store** (like FAISS), our system can mathematically compare a user's question to thousands of chunks and instantly return the most relevant ones. No need to keyword-match!

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Select the embedding model
# all-MiniLM-L6-v2 is an excellent, compact embedding model built by SentenceTransformers
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# 2. Create the FAISS Vector Index
# This processes all 'docs' through the embedding model and builds the searchable database
print("Generating embeddings and indexing into FAISS. This may take a minute...")
vectorstore = FAISS.from_documents(docs, embeddings)

# Create the Retrieval interface
retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # Retrieve the top 3 most relevant chunks
print("Indexing Complete!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings and indexing into FAISS. This may take a minute...
Indexing Complete!


In [5]:
# Let's test the retriever standalone!
test_query = "What is machine learning?"
relevant_docs = retriever.invoke(test_query)
print(f"\nRetrieved {len(relevant_docs)} docs for the query '{test_query}'.")

for doc in relevant_docs:
  print("\n---")
  print("Content: ", doc.page_content)



Retrieved 3 docs for the query 'What is machine learning?'.

---
Content:  Learning
Machine learning is the study of programs that can improve their performance on a given task automatically.[42] It has been a part of AI from the beginning.[e]

---
Content:  Glossary
Glossary
vte
Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.[1]

---
Content:  Evaluating approaches to AI
No established unifying theory or paradigm has guided AI research for most of its history.[aa] The unprecedented success of statistical machine learning in the 2010s eclipsed all other approaches (so much so 

## Step 3: Generator Setup (The LLM)

**The Rationale (The Brain of RAG):** The Generator is the 'brain' of our system. It reads the retrieved chunks (context) and writes a fluid, human-like answer.

We are using an open, lightweight model called **`TinyLlama/TinyLlama-1.1B-Chat-v1.0`**. We chose this because loading massive models like LLaMA-2 7B directly into a free Colab GPU might crash due to lack of RAM.

**What is Quantization?** To make the model run even faster and use less memory, we use a technique called **4-bit Quantization** via the `bitsandbytes` library. It compresses the model's 'weights' (its internal math) so it fits perfectly on a modest GPU without losing much intelligence.

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from langchain_huggingface import HuggingFacePipeline

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Configuration for 4-bit Quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print(f"Loading model tokenizer and weights ({model_id})...")
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto" # Automatically maps to GPU if available
)

# Build the HuggingFace Generation Pipeline
text_generation_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.1,    # We keep temperature low in RAG to avoid hallucinations
    max_new_tokens=256, # Max length of the answer it generates
    repetition_penalty=1.1,
    return_full_text=False # We only want the generated answer, not the prompt echoed back
)

# Wrap the pipeline so LangChain can converse with it
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
print("LLM loaded and pipeline wrapped!")

Loading model tokenizer and weights (TinyLlama/TinyLlama-1.1B-Chat-v1.0)...


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Passing `generation_config` together with generation-related arguments=({'temperature', 'max_new_tokens', 'repetition_penalty'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.


LLM loaded and pipeline wrapped!


## Step 4: Putting It All Together (The RAG Chain)

We have our Retriever (FAISS) and our Generator (TinyLlama). Now we use LangChain to wire them together.

We'll define a **Prompt Template** that instructs the LLM:
"Here is some context. Use it to answer the question. If you don't know the answer, just say you don't know."

In [7]:
from langchain_classic.prompts import PromptTemplate
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_classic.chains import create_retrieval_chain

# 1. Define the Prompt exactly as the LLM expects it
prompt_template = """<|system|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.

Context: {context}</s>
<|user|>
{input}</s>
<|assistant|>
"""
prompt = PromptTemplate.from_template(prompt_template)

# 2. Build the Document Chain
document_chain = create_stuff_documents_chain(llm, prompt)

# 3. Build the actual RAG Retrieval Chain
rag_chain = create_retrieval_chain(retriever, document_chain)

print("RAG Pipeline is ready using legacy chain builders!")

RAG Pipeline is ready using legacy chain builders!


Let's test the RAG Chain programmatically to see if everything works:

In [8]:
user_question = "Who formulated the concept of weak AI and strong AI?"

# The chain expects the question to be passed via an 'input' dictionary
result = rag_chain.invoke({"input": user_question})

print("QUESTION:", user_question)
print("\n--- LLM ANSWER ---")
print(result["answer"])

print("\n--- CITED SOURCES (Context) ---")
# Because we used create_retrieval_chain, LangChain automatically attaches the original documents
# used to formulate the answer under the 'context' key!
for i, doc in enumerate(result['context'], 1):
    content_preview = doc.page_content.replace("\n", " ")[:150]
    print(f"Source {i} snippet: {content_preview}...")

QUESTION: Who formulated the concept of weak AI and strong AI?

--- LLM ANSWER ---
The concept of weak AI and strong AI was first proposed by the philosopher John Searle in his book "Directions in the Philosophy of Mind" published in 1999. Searle argued that while it is possible for a machine to simulate human thought, it cannot truly understand or have other cognitive states like humans. He also suggested that there may be a threshold beyond which a machine would not be considered intelligent. This threshold is referred to as the "strong AI" concept, and it has since been redefined by researchers in the field of artificial intelligence.

--- CITED SOURCES (Context) ---
Source 1 snippet: ^  Searle presented this definition of "Strong AI" in 1999.[417] Searle's original formulation was "The appropriately programmed computer really is a ...
Source 2 snippet: History of AI  Crevier, Daniel (1993). AI: The Tumultuous Search for Artificial Intelligence. New York, NY: BasicBooks. ISBN¬†0-465

## Step 5: Interactive Chat UI with Gradio

Testing with Python output is great for developers, but applications are built for end-users. We'll wrap our LangChain logic in a `Gradio` Web UI.

We define a helper function (`chat_with_rag`) that Gradio will trigger every time the user clicks submit.

In [31]:
import gradio as gr
import warnings
import os

# Suppress deprecation and user warnings
warnings.filterwarnings("ignore")
os.environ["USER_AGENT"] = "ColabRAGAssistant/1.0"

# 1. Real RAG Function
def chat_with_rag(message, history):
    # history is passed as a list of dicts in modern Gradio
    response = rag_chain.invoke({"input": message})
    return response["answer"].strip()

# 2. Modern ChatGPT-style Theme
custom_theme = gr.themes.Default(
    primary_hue="zinc",
    neutral_hue="zinc",
    font=[gr.themes.GoogleFont("Inter"), "sans-serif"],
    text_size="md",
    radius_size="lg",
).set(
    body_background_fill="#ffffff",
    block_background_fill="transparent",
    block_border_width="0px",
    input_background_fill="#f4f4f5",
)

# 3. Updated CSS for a Premium UI experience
css = """
.gradio-container { max-width: 800px !important; margin: 0 auto !important; }
#chatbot { border: none !important; background: transparent !important; }

/* Fix user message text visibility: dark text on light gray background */
.message.user {
    background-color: #f4f4f5 !important;
    border: none !important;
}
.message.user p, .message.user span, .message.user .message-row {
    color: #0f0f0f !important;
}

.message.bot { background-color: transparent !important; border: none !important; }
.form { border: 1px solid #e5e7eb !important; border-radius: 16px !important; box-shadow: 0 4px 15px rgba(0,0,0,0.05) !important; }

/* Remove scroll from the input textbox */
.input-row textarea {
    overflow: hidden !important;
    resize: none !important;
}

/* Adjusting the submit button size and alignment */
.input-row button {
    max-width: 50px !important;
    min-width: 50px !important;
    height: 45px !important;
    padding: 0 !important;
    display: flex !important;
    align-items: center !important;
    justify-content: center !important;
}
"""

# 4. Interface Setup
with gr.Blocks(theme=custom_theme, css=css) as demo:
    gr.HTML("<div style='text-align: center; margin-bottom: 20px;'>"
            "<h1 style='font-weight: 800; font-size: 2.8em; color: #3b82f6; letter-spacing: -0.025em; margin-bottom: 0px;'>üß† RAG Assistant</h1>"
            "</div>")

    with gr.Column():
        # Input area at the TOP
        with gr.Row(elem_classes="input-row"):
            msg = gr.Textbox(
                show_label=False,
                placeholder="Ask me anything...",
                container=False,
                scale=10
            )
            submit_btn = gr.Button("‚û§", variant="primary", scale=1)

        # Examples right below the input
        gr.Examples(
            examples=["What is the Turing test?", "Who is John Searle?", "How does AI work?"],
            inputs=msg
        )

        # Conversation below
        chatbot = gr.Chatbot(elem_id="chatbot", show_label=False, type="messages")

    def respond(message, chat_history):
        bot_message = chat_with_rag(message, chat_history)
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        return "", chat_history

    submit_btn.click(respond, [msg, chatbot], [msg, chatbot])
    msg.submit(respond, [msg, chatbot], [msg, chatbot])

if __name__ == "__main__":
    demo.launch(share=True, show_error=False)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://eb35d36e4b486b1e3d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


### Congratulations!

You've successfully built a fully robust RAG pipeline incorporating state-of-the-art technology:
- **LangChain** for chaining logical blocks.
- **FAISS** alongside Dense Embeddings for high-speed retrieval.
- **4-Bit Quantized Models** (`TinyLlama`) executing LLM logic locally and quickly.
- **Gradio** for serving a beautiful front-end.

**Challenge**: Try returning to **Step 1**, grab a different URL (like a Wikipedia article on Quantum Computing or the history of ancient Rome), reset the runtime, and execute all the cells again to change your Chatbot's Knowledge Base!