# 🧠 Week 5-6, Notebook 7: Foundations of Retrieval-Augmented Generation (RAG)

**Module:** LLMs, Prompt Engineering & RAG  
**Project:** Build the Knowledge Core for the Manufacturing Copilot

---

Large Language Models are incredibly powerful, but they have two fundamental limitations:
1.  **Knowledge Cut-off:** Their knowledge is frozen at the point in time when they were trained. They are not aware of recent events or information.
2.  **No Access to Private Data:** They have no knowledge of your company's internal, domain-specific data, such as technical manuals, maintenance logs, or safety procedures.

**Retrieval-Augmented Generation (RAG)** is a powerful and elegant technique designed to solve exactly this problem. RAG connects a pre-trained LLM to an external, up-to-date knowledge base, allowing it to answer questions by first **retrieving** relevant documents and then **generating** an answer based *only* on the information found in those documents.

This is the core technology that will power our Manufacturing Copilot's ability to act as a knowledgeable expert, answering questions about our factory's specific procedures and equipment.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:

1.  **Explain the RAG Architecture:** Clearly describe the two main stages of a RAG pipeline (Offline Indexing and Online Retrieval/Generation) and the key components within each.
2.  **Compare RAG vs. Fine-Tuning:** Articulate the fundamental trade-offs between RAG and fine-tuning, and identify which approach is better suited for different manufacturing scenarios.
3.  **Build a Simple RAG System:** Implement a basic, end-to-end RAG pipeline from scratch using Hugging Face models for retrieval and generation.
4.  **Identify RAG's Risks and Mitigations:** Recognize the most common failure modes of RAG systems (e.g., poor retrieval, hallucination) and describe practical strategies to address them.

## 🏗️ The RAG Architecture: A High-Level View

A RAG system can be conceptually divided into two main stages, much like a library and a librarian.

### Stage 1: Offline Indexing (Building the Library)

This is the preparatory stage where you build your knowledge base. It is typically done once and then updated periodically as your source documents change.

1.  **Load & Chunk:** Your source documents (e.g., PDFs of maintenance manuals, text files of incident reports, SharePoint pages) are loaded and split into smaller, more manageable chunks. A common approach is to split them into paragraphs or sections.
2.  **Embed & Store:** Each chunk of text is passed through an **embedding model** (like the sentence transformers we used before) to convert it into a numerical vector. These embeddings, along with the original text, are then stored in a specialized database called a **Vector Store**. This vector store is optimized for fast similarity searches.

<br>
<p align="center">
  <img src="https://i.imgur.com/yV1n4k2.png" width="700">
</p>
<br>

### Stage 2: Online Retrieval & Generation (Consulting the Librarian)

This stage happens in real-time whenever a user asks a question.

1.  **Retrieve:** The user's question is converted into an embedding using the *same* embedding model used for indexing. The system then queries the Vector Store to find the document chunks whose embeddings are most semantically similar (i.e., closest in vector space) to the question's embedding.
2.  **Augment & Generate:** The text from these retrieved chunks is then inserted into a prompt along with the original question. This "augmented" prompt is sent to the LLM. The prompt essentially commands the LLM: *"Answer the user's question based only on the following context."* The LLM then generates a final answer that is grounded in the retrieved information.

<br>
<p align="center">
  <img src="https://i.imgur.com/3yS0hH5.png" width="700">
</p>
<br>

## ⚙️ Building a Mini-RAG System from Scratch

Let's build a small-scale but complete RAG system to see these concepts in action. We will create a tiny knowledge base consisting of three Standard Operating Procedures (SOPs) for our factory and then use it to answer a question.

In [None]:
# Hands-On: Setup and Library Imports
# Ensure necessary libraries are installed. In a real project, add these to your requirements.txt.
try:
    from sentence_transformers import SentenceTransformer, util
    from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
except ImportError:
    print("Installing required libraries: sentence-transformers, transformers, torch")
    import sys
    !{sys.executable} -m pip install -q sentence-transformers transformers torch
    from sentence_transformers import SentenceTransformer, util
    from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer

import pandas as pd
import torch

# Set device for model loading
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")

### Step 1: Prepare the Knowledge Base (The "Indexing" Stage)

First, we define our source documents. In this simple case, our "documents" are just strings in a Pandas DataFrame. We will then use a sentence transformer model to create an embedding for each document. This process simulates loading, chunking (though our docs are already small), and embedding.

In [None]:
# Hands-On: Indexing the Knowledge Base
# Our knowledge base: A few snippets from different Standard Operating Procedures (SOPs).
# In a real system, this would come from loading and chunking PDF or text files.
knowledge_base_df = pd.DataFrame([
    {
        "doc_id": "SOP-HYD-001",
        "text": "For hydraulic press maintenance, the first step is always to perform a full lockout-tagout (LOTO) procedure. Once the machine is de-energized, inspect all primary and secondary seals for any signs of leakage. After inspection, verify that the main pressure gauges read within the safe operating range of 500-550 PSI. Top off hydraulic fluid if the level is below the minimum indicator line."
    },
    {
        "doc_id": "SOP-CONV-003",
        "text": "To troubleshoot a conveyor belt stoppage, first check for any obvious physical obstructions on the line. If the path is clear, verify that the motor's thermal overload switch has not been tripped. If the motor is operational, the next step is to inspect the belt for proper tension and check the alignment of all guide sensors."
    },
    {
        "doc_id": "SOP-ROBO-002",
        "text": "Standard preventive maintenance for a 6-axis robotic arm requires greasing all major joints on a monthly basis. Furthermore, the torque sensors for each motor must be recalibrated quarterly to ensure precision. The alignment of the vision system camera should also be validated weekly against the calibration grid."
    },
])

# Load a sentence-transformer model to be our "retriever".
# This model is specifically trained to create embeddings that are good for semantic search.
retriever_model = SentenceTransformer("all-MiniLM-L6-v2", device=device)

# Create the embeddings for our knowledge base. This is the core of the indexing process.
# In a real system, these embeddings would be stored in a vector database like FAISS, Chroma, or Pinecone.
corpus_embeddings = retriever_model.encode(knowledge_base_df.text.tolist(), convert_to_tensor=True)

print("Knowledge base indexed successfully!")
print(f"Shape of our corpus embeddings: {corpus_embeddings.shape} (Documents, Embedding Dimension)")

### Step 2: Ask a Question (The "Retrieval & Generation" Stage)

Now, a user asks a question. We will walk through the full online process:
1.  Embed the user's question using the same `retriever_model`.
2.  Use semantic search to find the most relevant document from our knowledge base.
3.  Construct a detailed prompt that includes the retrieved information.
4.  Use a generative LLM to create a final answer based on the prompt.

# Hands-On: Retrieval
# The user's question
question = "What are the steps to fix a hydraulic press that is not working?"

# --- 1. Embed the user's question ---
query_embedding = retriever_model.encode(question, convert_to_tensor=True)

# --- 2. Find the most relevant document ---
# We use cosine similarity to find the document embedding that is closest to the query embedding.
# `util.semantic_search` handles this for us efficiently.
top_k = 1 # We'll retrieve the single best document for simplicity.
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]

# The result 'hits' contains a list of dictionaries with 'corpus_id' and 'score'.
retrieved_doc_index = hits[0]['corpus_id']
retrieved_doc = knowledge_base_df.iloc[retrieved_doc_index]

retrieved_context = retrieved_doc['text']
retrieved_doc_id = retrieved_doc['doc_id']
retrieval_score = hits[0]['score']

print(f"--- Retrieval Results for Query: '{question}' ---")
print(f"Most relevant document found: {retrieved_doc_id} (Similarity Score: {retrieval_score:.4f})")
print("\nRetrieved Context:\n", retrieved_context)

In [None]:
# Hands-On: Augment and Generate
# --- 3. Construct the Augmented Prompt ---
# This is a critical step. The prompt must clearly instruct the LLM to use ONLY the provided context.
rag_prompt = f"""
You are a helpful and precise manufacturing assistant. Your task is to answer the user's question based *only* on the provided context.
Do not use any other information. If the context does not contain the answer, say "I'm sorry, the provided document does not contain the answer to this question."
Cite the document ID in your answer.

Context from document {retrieved_doc_id}:
"{retrieved_context}"

Question: {question}

Answer:
"""

# --- 4. Generate the Final Answer ---
# We load a generative model to produce the final answer.
# Flan-T5 is excellent at following instructions, making it a good choice for a RAG generator.
generator_model_name = "google/flan-t5-base"
try:
    generator_tokenizer = AutoTokenizer.from_pretrained(generator_model_name)
    generator_model = AutoModelForSeq2SeqLM.from_pretrained(generator_model_name)

    generator = pipeline(
        'text2text-generation',
        model=generator_model,
        tokenizer=generator_tokenizer,
        device=device,
        max_length=200
    )
    print("--- Generating Final Answer ---")
    final_answer = generator(rag_prompt)
    print("\n--- Final Answer from RAG System ---")
    print(final_answer[0]['generated_text'])

except Exception as e:
    print(f"Could not load the generator model. Error: {e}")
    print("\n--- Final Answer from RAG System ---")
    print("Error: Generator model not available.")

## 🤔 RAG vs. Fine-Tuning: Which One to Choose?

A common point of confusion is when to use RAG versus when to fine-tune a model. They solve different problems.

| Aspect | Retrieval-Augmented Generation (RAG) | Fine-Tuning |
| :--- | :--- | :--- |
| **Primary Goal** | **Injecting Knowledge:** Providing the model with up-to-date, specific facts. | **Teaching a Skill:** Changing the model's behavior, style, or task format. |
| **Knowledge Source** | External (Vector Database, Document Store) | Internal (Stored in the model's weights) |
| **Updating Knowledge** | **Easy & Fast:** Simply update the documents in the vector store and re-index. | **Hard & Slow:** Requires creating a new dataset and re-running the entire training process. |
| **Hallucination Risk** | **Lower:** Answers are grounded in retrieved text. It's easier to verify the source. | **Higher:** The model can still invent facts or blend training data in undesirable ways. |
| **Best For...** | Fact-based Q&A, knowledge-intensive tasks, systems requiring auditable sources. | Learning a new style (e.g., "speak like a senior engineer"), a new output format, or adapting to a highly specialized vocabulary. |
| **Manufacturing Example** | Building a chatbot that can answer questions by quoting directly from the latest SOPs. | Creating a chatbot that can summarize maintenance logs in the specific bullet-point format used by your company. |

**Key Takeaway:** RAG and fine-tuning are not mutually exclusive. A very powerful pattern is to **fine-tune a model to be better at RAG**—for example, fine-tuning it to better follow instructions like "answer only from the context."

## ⚠️ Common RAG Risks and How to Mitigate Them

While powerful, RAG systems are not infallible. It's crucial to be aware of the common failure modes and to design your system to be resilient to them.

-   **Risk 1: Poor Retrieval ("Lost in the Library")**
    -   **Problem:** The retriever fails to find the relevant documents, providing the generator with useless or incorrect context. This is the most common failure mode.
    -   **Mitigations:**
        -   **Better Embeddings:** Use a higher-quality embedding model, or even fine-tune the embedding model on your specific domain data.
        -   **Better Chunking:** Experiment with different document chunking strategies (e.g., by paragraph, by section, or even overlapping chunks).
        -   **Data Cleaning:** Ensure your source documents are clean, well-structured, and free of irrelevant "noise."

-   **Risk 2: Stale Information ("Outdated Books")**
    -   **Problem:** The source documents in your knowledge base are out of date, leading the system to provide correct but obsolete answers.
    -   **Mitigations:**
        -   **Automated Indexing Pipeline:** Implement a regular, automated process to re-index documents from their source of truth (e.g., a document management system).

-   **Risk 3: Generation Hallucination ("Creative Librarian")**
    -   **Problem:** The LLM ignores the retrieved context and generates an answer based on its own internal (and potentially incorrect) knowledge.
    -   **Mitigations:**
        -   **Strong Prompts:** Use very strict prompts that explicitly command the model to *only* use the provided context and to state when the answer is not present.
        -   **Answer Verification:** Add a post-processing step that checks the generated answer against the retrieved source documents to ensure it is factually supported. This can even be done with another LLM call.

-   **Risk 4: Context Window Limitations ("Too Many Books")**
    -   **Problem:** The retrieved documents are too long to fit into the LLM's context window, forcing you to truncate them and potentially lose key information.
    -   **Mitigations:**
        -   **Smarter Chunking:** Use smaller, more focused document chunks.
        -   **Re-ranking:** Retrieve more documents than you need (e.g., 10) and then use a second, faster model (a "re-ranker") to select the top-k most relevant ones to pass to the main LLM.

## Congratulations and Next Steps!

You've just built your first, albeit simple, Retrieval-Augmented Generation system from scratch!

You now understand the core, two-stage architecture that powers modern RAG applications:
1.  **Offline Indexing:** Converting a knowledge base into a searchable vector index.
2.  **Online Retrieval & Generation:** Retrieving relevant context and using it to generate an informed answer.

In the next notebook, **`08_rag_implementation.ipynb`**, we will take this foundation and build upon it. We'll move from our simple list-based "vector store" to a professional-grade, highly optimized vector database, and explore how to build a more robust and scalable RAG pipeline.

Let's keep building!