# Bridging Knowledge with Retrieval-Augmented Generation (RAG)

---

Now that our LLM can remember key decisions and maintain coherent conversations, it's time to give it **real knowledge** ‚Äî about *your* tasks, *your* calendar, *your* project docs.

Right now, your **TaskFriend** app is limited to what it hears during the chat. But what if it could:

* Pull in your upcoming calendar events?
* Read your task list from a database?
* Reference your company‚Äôs onboarding guide?

That‚Äôs where **Retrieval-Augmented Generation (RAG)** comes in.

With RAG, your LLM doesn‚Äôt just rely on pre-trained knowledge ‚Äî it retrieves relevant information from external sources and uses it to generate accurate, personalized responses.


## The story so far...

**Scenario:**
Users love chatting with **TaskFriend** as they can work with it to figure out how to plan their day. The features you've built so far lke streaming responses, multi-turn conversations, and a professional system prompt have been a great hit! However, there's still a gap:

**TaskFriend** is unable to provide satisfactory answers to questions like:

> ‚ÄúWhat tasks do I have due this week?‚Äù

Or:

> ‚ÄúCan I reschedule my presentation prep if I go to the gym tomorrow morning?‚Äù

This is not because **TaskFriend** is not smart enough, but because it has **no access to the user‚Äôs actual data**. It remembers what was *said* in the conversation, but not what‚Äôs *true* in the user‚Äôs world. 

## Goals

* Understand how RAG extends LLM knowledge beyond pre-training
* Build a document retrieval system using embeddings and vector search
* Inject retrieved context into prompts to generate informed responses
* Handle private, dynamic, or frequently updated information

## Intitializing the environment

### Setting up the API key

Before we start work on in any notebook, we'll need to load the [API key for Model Studio](https://modelstudio.console.alibabacloud.com/?tab=globalset#/efm/api_key). This ensures that we can call APIs of Qwen models we'll be using throughout this course. 

> If you're unsure about how to find your **Model Studio** API key, refer to the `00 Setting Up the Environment` file.

In [None]:
# Load Model Studio API key
import os
from config.load_key import load_key
load_key(
    confirmation=False
)

### Setting up the LLM and embedding model

We set up Alibaba Cloud's `qwen-plus` as the LLM and DashScope's `text-embedding-v3` embedding model.

For this lesson, we'll be using `OpenAILike` instead of `OpenAI`, which we were using before this. `OpenAILike` is a **LlamaIndex-specific wrapper** designed for OpenAI-compatible models, including:

* Model Studio
* Dashscope
* vLLM
* Ollama
* Local LLMs with OpenAI-compatible APIs


> **Note:** DashScope takes `https://dashscope-intl.aliyuncs.com/api/v1` as its API endpoint instead of the `https://dashscope-intl.aliyuncs.com/compatible-mode/v1` we've been using so far.

In [None]:
# Set global settings
import time
import logging
import dashscope
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.dashscope import DashScopeEmbedding
from llama_index.llms.openai_like import OpenAILike
from pathlib import Path

logging.getLogger().setLevel(logging.ERROR)

# Dashscope uses https://dashscope-intl.aliyuncs.com/api/v1 
# instead of https://dashscope-intl.aliyuncs.com/compatible-mode/v1
dashscope.base_http_api_url ="https://dashscope-intl.aliyuncs.com/api/v1"

Settings.llm=OpenAILike(
    model="qwen-plus",
    api_base="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    is_chat_model=True
)

Settings.embed_model = DashScopeEmbedding(
    model_name="text-embedding-v3",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    encoding_format="float"
)

print("‚úÖ Global parameters set!")

# Limitations of Standalone LLMs

---

Before diving into Retrieval-Augmented Generation (RAG), it‚Äôs important to understand the **inherent limitations of standalone large language models (LLMs)**. While LLMs are remarkably capable at generating fluent, coherent text, they are not omniscient or perfectly reliable. Their behavior is shaped entirely by pre-training data and prompt input‚Äîmeaning they lack dynamic access to new, private, or real-time information.

Understanding these limitations helps us make informed decisions about how to enhance LLMs for real-world applications like **TaskFriend**.

## Key limitations

### Knowledge cutoffs

Most LLMs are trained on static datasets with a fixed knowledge cutoff date. For example:

* **Alibaba Cloud's Qwen3:** April 2025
* **OpenAI's GPT 4.1:** June 2024
* **Google's Gemini 2.5 Pro:** Jan 2025
* **Anthropic's Claude 4 Opus:** March 2025

This means they **cannot know about events, products, or research published after that date**.

> **üìå Example:** Ask a base LLM, ‚ÄúWho won the 2025 UEFA Champions League?‚Äù ‚Äî it will either guess or invent an answer.

Even if the model is powerful, its knowledge is frozen in time.

### No access to private or internal data

LLMs are not connected to your personal task list, company wiki, or internal CRM. Unless explicitly provided, they have **zero awareness** of:

* Your calendar
* Your project notes
* Company policies
* Customer records

This makes them **useless for personalized or enterprise tasks** without augmentation.

> **üîê Security Note:** This isolation is actually a *feature* for privacy‚Äîbut it means we must *intentionally* connect them to data when needed.

### Hallucinations
When an LLM lacks sufficient information, it may **confidently generate false or fabricated content**‚Äîa phenomenon known as *hallucination*.

> **üö® Example:**  
> **User:** ‚ÄúWhat‚Äôs the deadline for the Q2 report?‚Äù  
> **LLM:** ‚ÄúThe Q2 report is due on April 15.‚Äù  
> **Reality:** No such report exists.

This is dangerous in productivity, legal, medical, or customer-facing applications.

## Alternatives and their drawbacks

To overcome these limitations, several strategies exist‚Äîeach with tradeoffs in cost, scalability, and maintenance.

| Approach          | Description                                  | Limitations |
|-------------------|----------------------------------------------|-------------|
| **Prompt Engineering** | Crafting prompts to guide behavior (e.g., system prompts, few-shot examples) | Limited by context window; static; fragile to input changes |
| **Fine-tuning**   | Retraining the model on new data to internalize knowledge or style | Expensive; hard to update; risks overfitting; not versionable |
| **Pure Retrieval** | Returning relevant documents or snippets without generation | Doesn‚Äôt synthesize answers; requires user to read; no natural language output |

While each approach has its place, none offers a perfect balance of **accuracy, freshness, cost, and ease of maintenance**.


## The spectrum of LLM enhancement: Context vs. model optimization

There‚Äôs a fundamental tradeoff in how we enhance LLMs:

| Axis | Description |
|------|-------------|
| **Model Optimization** | Changing the model itself (e.g., fine-tuning, distillation, pre-training) |
| **Context Optimization** | Keeping the model fixed, but enriching the input context (e.g., RAG, prompt engineering, retrieval) |

This leads to a strategic spectrum:  

<div style="text-align: center;">
  <img src="images/LMP-C01_03-Model Engineering Matrix.gif" style="max-width: 800px;" />
  <br>
  <small>LLM optimization matrix</small>
  <br>
  <small><i>Source: <a href="https://platform.openai.com/docs/guides/optimizing-llm-accuracy" target="_blank">OpenAI - Optimizing LLM Accuracy</a></i></small>
</div>

# What is Retrieval Augmented Generation (RAG)?

---

**Retrieval-Augmented Generation (RAG)** is a powerful technique that enhances large language models (LLMs) by integrating external knowledge into the generation process. In simple terms, RAG allows an AI model to look up relevant information from a knowledge base before generating a response. This makes the answers more accurate, up-to-date, and grounded in real-world data.

RAG is a hybrid approach that combines two key components:

* **Retrieval**: Finding the most relevant pieces of information from a large dataset.
* **Generation**: Using a language model to craft a natural language response based on that retrieved information.

This combination allows RAG to overcome some of the limitations of standalone LLMs, such as outdated knowledge or the tendency to hallucinate.

## Why RAG matters

Traditional LLMs are trained on massive datasets, but once deployed, their knowledge is static. They cannot access real-time or private data, which limits their usefulness in many applications. RAG solves this by allowing the model to dynamically pull in the most relevant information at the time of inference.

| Problem | Without RAG | With RAG |
|--------|-------------|---------|
| ‚ÄúWhat tasks are due this week?‚Äù | Can‚Äôt answer (no access to data) | Retrieves actual task list |
| ‚ÄúI need to break down my project‚Äù | General advice only | Uses real project notes |
| ‚ÄúAre there any company policies on remote work?‚Äù | Might hallucinate | Pulls from HR docs |


This makes RAG especially valuable in:
* Enterprise environments (e.g., internal knowledge bases).
* Research and academic settings (e.g., answering questions from scientific papers).
* Customer support (e.g., answering queries using product documentation).

With RAG, our LLM transforms from a *reactive chatbot* into a **proactive knowledge assistant**!


## How RAG works: The core pipeline

```mermaid
graph LR
    classDef frameworkStyle fill:#ffffff,stroke:#1f77b4,stroke-width:2px;
    
    subgraph flowchart[RAG pipeline]
        subgraph "Your data"
            A[(Database)]
            B[Document]
            C[API]
        end

        U((User))
        I[Index]
        L[LLM]

        A -- structured --> I
        B -- unstructured --> I
        C -- programmatic --> I

        U -- query --> I
        I -- "prompt +<br>query +<br>relevant data" --> L
        L -- response --> U
    end
    
    class flowchat frameworkStyle;
```

The RAG system operates in **three distinct stages**:


### Stage 1: Retrieval

When a user asks a question, the system first needs to find the most relevant pieces of information. This is done using a **retrieval model** that converts the query into a numerical representation (embedding) and searches a database of pre-embedded documents for the most similar matches.

For example, if the user asks `‚ÄúWhat is the capital of France?‚Äù`, the system might retrieve a document that says `‚ÄúParis is the capital of France.‚Äù`

### Stage 2: Augmentation

Once the system retrieves relevant documents, it injects them directly into the prompt as **context**. This transforms a generic query into a data-rich instruction the LLM can act on.

Example:

```mermaid
graph LR
    A["User Query:<br>'What is on my shopping list?'"] --> D((Prompt))
    B["Retrieved Context:<br>'- Bread<br>- Milk<br>- Jam'"] --> D
    C[LLM] --> Response["Response:<br>'Your shopping list is:<br>bread, milk, jam'"]
    
    subgraph "Augmented Prompt"
        D --> E["'Based on the following:<br>- Bread<br>- Milk<br>- Jam<br>Answer: What tasks are due today?'"]
    end

    E --> C
```

### Stage 3: Generation

The LLM generates a natural language response **grounded in the retrieved data**, reducing hallucinations and increasing accuracy.





# Understanding Embeddings and Vector Search

At the heart of RAG is vector search, powered by embeddings ‚Äî dense numerical representations of text.

## What is an embedding?
An embedding is a fixed-length vector (e.g., length `1024`) that represents the semantic meaning of a piece of text. Similar texts have similar embeddings.

> üí° Analogy: Think of documents as books in a library. Embeddings are like GPS coordinates ‚Äî they help us find the closest matches.

## How does vector search work?

* The user query is converted into an embedding.
* The system searches a vector index for the most similar embeddings.
* The top matching documents are returned and used to augment the prompt.

This process is powered by **cosine similarity** ‚Äî a way to measure how alike two pieces of text are in meaning, even if their words differ.


**Why "angle" matters more than "distance"**

Imagine each document (and the query) lives as a point in a high-dimensional space ‚Äî say, `1024` dimensions. You can‚Äôt visualize that, but here‚Äôs the key idea:

**Cosine similarity** looks at the angle between two vectors, not how far apart they are. 

* If two vectors point in nearly the same direction ‚Üí high similarity
* If they point in opposite directions ‚Üí low similarity

> **Pro tip:**  
> Think of embeddings like arrows shot from the origin.  
> Even if one arrow is longer (e.g., a longer document), what matters is where it‚Äôs aiming.  
> Two arrows aiming in the same direction represent similar meanings ‚Äî and cosine similarity captures that.

In [None]:
# Step 1: Use configured embedding model
embed_model = Settings.embed_model

# Step 2: Sample documents
docs = [
    "Paris is the capital of France.",
    "The Eiffel Tower is in Paris.",
    "Berlin is the capital of Germany.",
    "Tokyo is the capital of Japan.",
    "Machine learning is a subset of artificial intelligence."
]

# Step 3: Import and plot
from functions.vector_visualization import plot_vector_search

query = "What is the capital city of France?"

plot_vector_search(embed_model, docs, query)

# Building the RAG System Step-by-Step

Let‚Äôs walk through how to build a working RAG system using `llama_index`, `DashScope`, and our local `documents`.

## Step 1: Load the Documents

LlamaIndex provides the `SimpleDirectoryReader`, which we will use to load files from the `./docs/taskfriend` directory.

> **Note:** files may be separted into multiple pieces by `SimpleDirectoryReader`.  
> For our example, the embedder takes a maximum of 10 pieces.

In [None]:
# Load the documents
documents = SimpleDirectoryReader(
    input_dir="./docs/taskfriend",
    required_exts=[".pdf"],
    recursive=False
).load_data()

print(f"\nüìÑ Raw documents loaded: {len(documents)}")
for doc in documents:
    print(f" - {Path(doc.metadata['file_path']).name} (Text len: {len(doc.text)})")

## Step 2: Build and Save the Index

Next, we use LlamaIndex's `VectorStoreIndex.from_documents()` function to build a vector index from the documents we loaded, and persist it to disk. 

> **Pro tip:** Persisting to disk helps improve the speed of our RAG since we don't need to rebuild the index every time.

In [None]:
# Build index from documents
print("Creating index...", end="", flush=True)
start_time = time.time()

index = VectorStoreIndex.from_documents(
    documents,
    embed_model=Settings.embed_model
)

load_time = time.time() - start_time
print(f" Done ‚úì ({load_time:.1f} seconds)")

# Save index
index.storage_context.persist("knowledge_base/taskfriend")
print("‚úÖ Index built and saved")

In [None]:
from llama_index.core import SimpleDirectoryReader
import logging

logging.getLogger().setLevel(logging.ERROR)

documents = SimpleDirectoryReader(
    input_dir="./docs/taskfriend",
    required_exts=[".pdf"],
    recursive=False
).load_data()

print("Raw chunks:")
for i, doc in enumerate(documents):
    print(f"\n--- Chunk {i+1} ---\n")
    print(doc.text)

## Step 3: Query the RAG System

Now, use the `index.as_query_engine()` function to create the `query_engine`. Then, we'll build a wrapper for multi-turn conversations call it from our **TaskFriend** app.

In [None]:
from taskfriend.chat import chat_interface, wrap_rag_for_chat

# Build the query engine (used to implement RAG)
query_engine = index.as_query_engine(
    streaming=True,
    llm=Settings.llm,
)

# üìù Define & initialize full_conversation
full_conversation = []

def get_rag_response(question, query_engine):
    
    try:
        # üîç Query the RAG engine
        response = query_engine.query(question)

        # üß† Extract the answer
        if hasattr(response, 'response'):
            answer = response.response
        else:
            answer = str(response)

        return answer

    except Exception as e:
        print(f"[RAG Error] {e}")
        return "[Error retrieving response]"


# Wrap function for compatibility
wrapped_rag = wrap_rag_for_chat(
    get_rag_response,
    query_engine=query_engine,
)

# Start chat with RAG
chat_interface(
    full_conversation=full_conversation,
    # client=client,
    call_llm_fn=wrapped_rag,
)

Now, try asking your model the following questions:

```
"What tasks are due today?"
"What tasks are due this week?"
```

Congratulations! You've successfully created your first RAG!  
The model can now read from the `tasks.pdf` file in `./docs/taskfriend`, giving you answers about the tasks you have. Here's a table of the tasks (if you can't find `tasks.pdf`):


| ID | Task | Type | Due | Status | Notes |
|----|------|------|-----|--------|-------|
| 01 | Finalize Q3 OKRs by 3pm | One-off | Today | Pending | Collaborate with department heads to align on measurable objectives, lay out solid plan to achieve objectives and assign responsibility to team members. |
| 02 | Prepare presentation for client review | One-off | This Week | Pending | Focus on deliverables from Q2, highlight success metrics, and outline next steps. Obtain client feedback on presentation and tweak direction based on client preferences. |
| 03 | Onboard new team member | One-off | Today | Done | Schedule intro meetings with team members, send welcome email with onboarding checklist, assign mentor for first 30 days.<br>Karen was assigned to be the mentor for the new team member. |
| 04 | Review team feedback survey results | One-off | This Week | Pending | Analyze anonymous feedback from recent engagement survey and identify top 3 pain points and 2 strengths. |
| 05 | Update Project Phoenix roadmap | One-off | Today | Pending | Sync with project leads to reflect latest timelines, milestones, and resource allocations, taking into account the latest changes to supply-chain disruptions. |
| 06 | Schedule 1:1s with team | One-off | This Month | Pending | Book 30-minute slots via calendar invite with each team member over the next 4 weeks, focus agenda on career paths, workload balance, and feedback. |
| 07 | Call bank regarding home loan | One-off | This Week | Pending | Contact customer service to inquire about refinancing options, and compare current interest rate with market rates, inquire about early repayment penalties and eligibility for better terms based on current market conditions. |
| 08 | Weekly report: Project Phoenix | Recurring | This Week | Started | Compile progress on deliverables, blockers, and resource usage, share report with stakeholders via DingTalk every Friday EOD. |
| 09 | Develop 3-year plan | One-off | This Year | Pending | Based on company strategy shifts and market trends, draft a long-term vision for the team, and present draft at annual planning retreat. |
| 10 | Write thank-you letter to penpal in Korea | One-off | Today | Started | Thank penpal in Korea for the help they provided when you needed advice on planning a trip to Norway, and remember to ask them about their newborn son, Edwin. |


However, you'll notice that your RAG is not perfect - the answer it gave you isn't representative of all the tasks you have. And as you continue to talk to **TaskFriend**, you'll realize that there are some questions it still can't answer correctly. We'll cover this in the next chapter.

# What's next?

## Quiz yourself!

<details>
<summary style="cursor: pointer; padding: 12px; border: 1px solid #dee2e6; border-radius: 6px;">
<b>1. Which of the following is a key limitation of standalone LLMs that RAG helps solve?</b>  

<ul>
    <li>A) High API costs  </li>
    <li>B) Inability to generate fluent text  </li>
    <li>C) Lack of access to private or real-time data  </li>
    <li>D) Slow inference speed</li>
</ul>

**View answer ‚Üí**
</summary>

<div style="margin-top: 10px; padding: 15px; border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">

‚úÖ **Correct answer:** C) Lack of access to private or real-time data  
üìù **Explaination**Ôºö
* RAG enables LLMs to retrieve and use up-to-date, user-specific information (e.g., tasks, calendars).

</div>
</details>

<br>

<details>
<summary style="cursor: pointer; padding: 12px; border: 1px solid #dee2e6; border-radius: 6px;">
<b>2. In the RAG pipeline, what happens during the "Augmentation" stage?</b>  

<ul>
    <li>A) Embeddings are retrained  </li>
    <li>B) The model fine-tunes on new documents  </li>
    <li>C) The user is shown raw search results  </li>
    <li>D) Retrieved documents are added to the prompt as context</li>
</ul>

**View answer ‚Üí**
</summary>

<div style="margin-top: 10px; padding: 15px; border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">

‚úÖ **Correct answer:** D) Retrieved documents are added to the prompt as context  
üìù **Explaination**Ôºö
* This allows the LLM to generate responses grounded in actual data.

</div>
</details>

## Takeaways

* **Limitations of standalone LLMs**
    * **Knowledge cutoffs** mean LLMs cannot know about events or data after their training date (e.g., Qwen3: April 2025).
    * **No access to private or internal data** ‚Äî LLMs don‚Äôt see your calendar, tasks, or company docs unless explicitly provided.
    * **Hallucinations** occur when models lack information and invent plausible-sounding but false answers.
    * **Alternatives have tradeoffs**:
      * *Prompt engineering*: Limited by context window.
      * *Fine-tuning*: Expensive, hard to update.
      * *Pure retrieval*: Returns raw text, no natural language synthesis.
    * **RAG solves these** by dynamically injecting real, up-to-date, user-specific context at inference time.

<br>

* **RAG systems**
    * **RAG bridges the gap** between general knowledge and specific, private, or real-time data.
    * It combines two powerful components:
      - **Retrieval**: Find relevant documents from a knowledge base.
      - **Generation**: Use an LLM to generate a natural language response based on retrieved content.
    * **RAG transforms LLMs** from static chatbots into dynamic knowledge assistants.
    * It enables accurate, personalized responses to questions like:
      - ‚ÄúWhat tasks are due this week?‚Äù
      - ‚ÄúCan I reschedule my presentation prep?‚Äù
    * **RAG is context optimization**, not model optimization ‚Äî the LLM stays fixed, but the input is enriched.

<br>

* **Embeddings and vector search**
    * **Embeddings** are dense numerical vectors (e.g., length 1024) that represent semantic meaning.
    * **Similar texts have similar embeddings** ‚Äî this enables semantic search beyond keyword matching.
    * **Vector search** finds the most relevant documents by comparing query and document embeddings.
    * **Cosine similarity** measures semantic alignment by the *angle* between vectors, not distance:
      - Small angle ‚Üí high similarity
      - Opposite directions ‚Üí low similarity
    * **Embeddings allow the system to understand** that ‚Äúcapital of France‚Äù and ‚ÄúParis‚Äù are related, even if the words don‚Äôt match exactly.

<br>

* **Building a RAG system**
    * **Step 1: Load documents** using tools like `SimpleDirectoryReader` to ingest PDFs, text files, or APIs.
    * **Step 2: Build a vector index** ‚Äî convert documents into embeddings and store them for fast retrieval.
    * **Step 3: Persist the index** to disk so it doesn‚Äôt need to be rebuilt every time.
    * **Step 4: Query with augmentation** ‚Äî retrieve relevant context and inject it into the prompt.
    * **The RAG pipeline**:
      1. **Retrieval**: Convert query to embedding ‚Üí find top-matching documents.
      2. **Augmentation**: Add retrieved content to the prompt as context.
      3. **Generation**: LLM generates a grounded, accurate response.
    * **RAG is iterative** ‚Äî your first version may not be perfect, but it‚Äôs a foundation for improvement.