<a target="_blank" href="https://colab.research.google.com/github/VectorInstitute/fed-rag/blob/main/docs/notebooks/no_encode_rag_with_mcp.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

_(NOTE: if running on Colab, you will need to supply a WandB API Key in addition to your HFToken. Also, you'll need to change the runtime to a T4.)_

# Build a NoEncode RAG System with an MCP Knowledge Store

## Introduction

In traditional RAG systems, there are three components: a retriever, a knowledge store, and a generator. A user's query is encoded by the retriever and used to retrieve relevant knowledge chunks from the knowledge store that had previously been encoded by the retriever as well. The user query along with the retrieved knowledge chunks are passed to the LLM generator to finally respond to the original query.

With NoEncode RAG systems, knowledge is still kept in a knowledge store and retrieved for responses to user queries, but there is no encoding step at all. Instead of pre-computing embeddings, NoEncode RAG systems query knowledge sources directly using natural language.

### Key Differences

**Traditional RAG:**
- Documents → Embed → Vector Store
- Query → Embed → Vector Search → Retrieve → Generate

**NoEncode RAG:**
- Knowledge Sources (MCP servers, APIs, databases)
- Query → Direct Natural Language Query → Retrieve → Generate

_**NOTE:** Knowledge sources may be traditional RAG systems themselves, and thus, these would involve encoding. However, the main RAG system does not handle encoding of queries or knowledge chunks at all._

### Model Context Protocol (MCP)

MCP provides a standardized way for AI systems to connect to external tools and data sources. In our NoEncode RAG system, MCP servers act as live knowledge sources that can be queried directly with natural language. An MCP knowledge store acts as the MCP client host that creates connections to these servers and retrieves context from them.

### Outline

In this cookbook, we will stand up two MCP knowledge sources, use them as part of an MCP knowledge store, and finally build an `AsyncNoEncodeRAGSystem` that allows us to query these sources.

1. MCP Knowledge Source 1: an AWS Kendra Index MCP Server
2. MCP Knowledge Source 2: a LlamaCloud MCP Server
3. Create an MCP Knowledge Store (using our two built sources)
4. Assemble a NoEncode RAG System

## MCP Knowledge Source 1: an AWS Kendra Index MCP Server

Here, we make use of one the myriad of officially supported [AWS MCP servers](https://github.com/awslabs/mcp?tab=readme-ov-file#available-servers) offered by [AWS Labs](https://github.com/awslabs), namely: their [AWS Kendra Index MCP Server](https://github.com/awslabs/mcp/tree/main/src/amazon-kendra-index-mcp-server).

AWS Kendra is an enterprise search service powered by machine learning. It can search across various data sources including documents, FAQs, knowledge bases, and websites, providing intelligent answers to natural language queries.

In [1]:
import os

from mcp import StdioServerParameters
from fed_rag.knowledge_stores.no_encode import MCPStdioKnowledgeSource

In [2]:
server_params = StdioServerParameters(
    command="docker",
    args=[
        "run",
        "--rm",
        "--interactive",
        "--init",  # important!
        "--env-file",
        f"{os.getcwd()}/.env",
        "awslabs/amazon-kendra-index-mcp-server:latest",
    ],
)

mcp_source = MCPStdioKnowledgeSource(
    name="awslabs.amazon-kendra-index-mcp-server",
    server_params=server_params,
    tool_name="KendraQueryTool",
    query_param_name="query",
    tool_call_kwargs={
        "indexId": "572aca26-16be-44df-84d3-4d96d778f120",
        "region": "ca-central-1",
    },
)

### Using the default converter of `~mcp.CallToolResult` to `~fed_rag.KnowledgeNode`

In [3]:
call_tool_result = await mcp_source.retrieve("What is RAFT?")

In [4]:
knowledge_node = mcp_source.call_tool_result_to_knowledge_node(
    call_tool_result
)
knowledge_node.text_content[:500]

'{"query": "What is RAFT?", "total_results_count": 4, "results": [{"id": "48f58734-f2c8-4a65-9cce-3540b914fb52-c413e6f3-def2-4ee1-96b7-c4844fb735f5", "type": "ANSWER", "document_title": "raft.pdf", "document_uri": "https://fed-rag-mcp-cookbook.s3.ca-central-1.amazonaws.com/raft.pdf", "score": "HIGH", "excerpt": "In this paper, we present Retrieval Augmented\\nFine Tuning (RAFT), a training recipe which improves the model\\u2019s ability\\nto answer questions in \\"open-book\\" in-domain settings. In t'

As we can see, this is not really the most ideal conversion. We should probably only pass the excerpt text content to the LLM generator. Thus, we should define our own converter function to extract only the text content.

According to the source code for this server, we see that a successful tool call of `KendraQueryTool` will yield a `CallToolResult` whose `text` attribute is a JSON string containing a `results` key. The value for `results` is a list of `result_items` each containing an `excerpt` field which is what we want.

In [5]:
import json
import re

from typing import Any
from mcp.types import CallToolResult
from fed_rag.data_structures import KnowledgeNode


def kendra_index_converter_fn(
    result: CallToolResult, metadata: dict[str, Any] | None = None
) -> KnowledgeNode:
    excerpts = []
    for c in result.content:
        if c.type == "text":
            data = json.loads(c.text)
            for res in data["results"]:
                excerpts.append(re.sub(r"\s+", " ", res["excerpt"].strip()))
    return KnowledgeNode(
        node_type="text", text_content="\n\n<<CHUNK_SEP>>\n\n".join(excerpts)
    )

Let's test it out!

In [6]:
print(kendra_index_converter_fn(call_tool_result).text_content)

In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t

<<CHUNK_SEP>>

3 RAFT In this section, we present RAFT, a novel way of training LLMs for domain-specific open- book exams. We first introduce the classical technique of supervised fine-tuning, followed with the key takeaways from our experiments. Then, we introduce RAFT , a modified version of general instructio

<<CHUNK_SEP>>

, 2023; Xu 9 Preprint, Under Review et al., 2023; Liu et al., 2024). These works focus on constructing a combination of finetuning dataset for RAG and train a model to perform well on these tasks. In particular, in their settings, at test time, the domain or documents can be different tha

<<CHUNK_SEP>>

...Fine Tuning (RAFT), a training recipe which improves the mod

This is better than we had before with the default converter, so let's use it.

In [7]:
# update the converter function
mcp_source = mcp_source.with_converter(kendra_index_converter_fn)

## MCP Knowledge Source 2: a LlamaCloud MCP Server

In this part of the cookbook, we'll stand up an MCP server using [LlamaCloud](https://docs.llamaindex.ai/en/stable/llama_cloud/)—an enterprise solution by LlamaIndex—by following their MCP [demo](https://github.com/run-llama/llamacloud-mcp?tab=readme-ov-file#llamacloud-as-an-mcp-server).

LlamaCloud provides document parsing, indexing, and retrieval capabilities. By exposing these through an MCP server, we can query processed documents directly using natural language without managing our own document processing pipeline.

## Create an MCP Knowledge Store

In [8]:
from fed_rag.knowledge_stores.no_encode import MCPKnowledgeStore

In [9]:
knowledge_store = MCPKnowledgeStore().add_source(mcp_source)

In [10]:
res = await knowledge_store.retrieve("What is RAFT", top_k=1)

## Assemble a NoEncode RAG System

In [11]:
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
import torch
from transformers.generation.utils import GenerationConfig

generation_cfg = GenerationConfig(
    do_sample=True,
    eos_token_id=151643,
    bos_token_id=151643,
    max_new_tokens=2048,
    top_p=0.9,
    temperature=0.6,
    cache_implementation="offloaded",
    stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
    model_name="Qwen/Qwen2.5-0.5B",
    load_model_at_init=False,
    load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
    generation_config=generation_cfg,
)

In [12]:
from fed_rag import AsyncNoEncodeRAGSystem, RAGConfig

rag_config = RAGConfig(top_k=2)
rag_system = AsyncNoEncodeRAGSystem(
    knowledge_store=knowledge_store,
    generator=generator,
    rag_config=rag_config,
)

In [15]:
res = await rag_system.query(query="What is RAFT?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


In [16]:
print(res)

Assistant: Here's a concise response:

**What is RAFT?**

RAFT is a training recipe that improves the performance of language models on tasks like answering questions in "open-book" in-domain settings. It involves training the model to ignore documents that don't help in answering the question, similar to how a fine-tuned model might ignore irrelevant information during inference.

**Context**

In this paper, we present RAFT, a training recipe that improves the model's ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question and a set of retrieved documents, we train the model to ignore those documents that don't help in answering the question, which is a key takeaway from our experiments.

**Summary**

In this section, we introduce RAFT, a novel way of training LLMs for domain-specific open- book exams. We first present the classical technique of supervised fine-tuning, followed by the key takeaways from our experiments. Then, we introduce RAFT