<a target="_blank" href="https://colab.research.google.com/github/VectorInstitute/fed-rag/blob/main/docs/notebooks/no_encode_rag_with_mcp.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

_(NOTE: if running on Colab, you will need to supply a WandB API Key in addition to your HFToken. Also, you'll need to change the runtime to a T4.)_

# Build a NoEncode RAG System with an MCP Knowledge Store

## Introduction

In traditional RAG systems, there are three components: a retriever, a knowledge store, and a generator. A user's query is encoded by the retriever and used to retrieve relevant knowledge chunks from the knowledge store that had previously been encoded by the retriever as well. The user query along with the retrieved knowledge chunks are passed to the LLM generator to finally respond to the original query.

With NoEncode RAG systems, knowledge is still kept in a knowledge store and retrieved for responses to user queries, but there is no encoding step at all. Instead of pre-computing embeddings, NoEncode RAG systems query knowledge sources directly using natural language.

### Key Differences

**Traditional RAG:**
- Documents → Embed → Vector Store
- Query → Embed → Vector Search → Retrieve → Generate

**NoEncode RAG:**
- Knowledge Sources (MCP servers, APIs, databases)
- Query → Direct Natural Language Query → Retrieve → Generate

_**NOTE:** Knowledge sources may be traditional RAG systems themselves, and thus, these would involve encoding. However, the main RAG system does not handle encoding of queries or knowledge chunks at all._

### Model Context Protocol (MCP)

MCP provides a standardized way for AI systems to connect to external tools and data sources. In our NoEncode RAG system, MCP servers act as live knowledge sources that can be queried directly with natural language. An MCP knowledge store acts as the MCP client host that creates connections to these servers and retrieves context from them.

### Outline

In this cookbook, we will stand up two MCP knowledge sources, use them as part of an MCP knowledge store, and finally build an `AsyncNoEncodeRAGSystem` that allows us to query these sources.

1. MCP Knowledge Source 1: an AWS Kendra Index MCP Server
2. MCP Knowledge Source 2: a LlamaCloud MCP Server
3. Create an MCP Knowledge Store (using our two built sources)
4. Assemble a NoEncode RAG System

## MCP Knowledge Source 1: an AWS Kendra Index MCP Server

Here, we make use of one the myriad of officially supported [AWS MCP servers](https://github.com/awslabs/mcp?tab=readme-ov-file#available-servers) offered by [AWS Labs](https://github.com/awslabs), namely: their [AWS Kendra Index MCP Server](https://github.com/awslabs/mcp/tree/main/src/amazon-kendra-index-mcp-server).

AWS Kendra is an enterprise search service powered by machine learning. It can search across various data sources including documents, FAQs, knowledge bases, and websites, providing intelligent answers to natural language queries.

In [1]:
import os

from mcp import StdioServerParameters
from fed_rag.knowledge_stores.no_encode import MCPStdioKnowledgeSource

In [2]:
server_params = StdioServerParameters(
    command="docker",
    args=[
        "run",
        "--rm",
        "--interactive",
        "--init",  # important!
        "--env-file",
        f"{os.getcwd()}/.env",
        "awslabs/amazon-kendra-index-mcp-server:latest",
    ],
)

mcp_source = MCPStdioKnowledgeSource(
    name="awslabs.amazon-kendra-index-mcp-server",
    server_params=server_params,
    tool_name="KendraQueryTool",
    query_param_name="query",
    tool_call_kwargs={
        "indexId": "572aca26-16be-44df-84d3-4d96d778f120",
        "region": "ca-central-1",
    },
)

### Using the default converter of `~mcp.CallToolResult` to `~fed_rag.KnowledgeNode`

In [3]:
call_tool_result = await mcp_source.retrieve("What is RAFT?")

In [4]:
knowledge_nodes = mcp_source.call_tool_result_to_knowledge_nodes_list(
    call_tool_result
)

print("Number of results returned: ", len(knowledge_nodes), "\n")
print(
    "Text content of first returned node:\n",
    knowledge_nodes[0].text_content[:500],
)

Number of results returned:  1 

Text content of first returned node:
 {"query": "What is RAFT?", "total_results_count": 4, "results": [{"id": "4fd36a75-4a08-4c9d-9d98-0e92049ece06-2f7adc0b-3250-47c9-976d-dc4c01eff138", "type": "ANSWER", "document_title": "raft.pdf", "document_uri": "https://fed-rag-mcp-cookbook.s3.ca-central-1.amazonaws.com/raft.pdf", "score": "HIGH", "excerpt": "In this paper, we present Retrieval Augmented\nFine Tuning (RAFT), a training recipe which improves the model\u2019s ability\nto answer questions in \"open-book\" in-domain settings. In t


As we can see, this is not really the most ideal conversion. We should probably only pass the excerpt text content to the LLM generator. Thus, we should define our own converter function to extract only the text content.

According to the [source code](https://github.com/awslabs/mcp/blob/36ae951bb8cfed234f69f6336fe2463eb4e08587/src/amazon-kendra-index-mcp-server/awslabs/amazon_kendra_index_mcp_server/server.py#L150) for this server, we see that a successful tool call of `KendraQueryTool` will yield a `CallToolResult` whose `text` attribute is a JSON string containing a `results` key. The value for `results` is a list of `result_items` each containing an `excerpt` field which is what we want.

In [5]:
import json
import re

from typing import Any
from mcp.types import CallToolResult
from fed_rag.data_structures import KnowledgeNode


def kendra_index_converter_fn(
    result: CallToolResult, metadata: dict[str, Any] | None = None
) -> list[KnowledgeNode]:
    nodes = []
    for c in result.content:
        if c.type == "text":
            data = json.loads(c.text)
            for res in data["results"]:
                text_content = re.sub(r"\s+", " ", res["excerpt"].strip())
                nodes.append(
                    KnowledgeNode(
                        node_type="text",
                        text_content=text_content,
                        metadata=metadata,
                    )
                )
    return nodes

Let's test it out!

In [6]:
knowledge_nodes = kendra_index_converter_fn(call_tool_result)

print("Number of results returned: ", len(knowledge_nodes), "\n")
print(
    "Text content of first returned node:\n",
    knowledge_nodes[0].text_content[:500],
    "\n",
)
print(
    "Text content of second returned node:\n",
    knowledge_nodes[1].text_content[:500],
)

Number of results returned:  4 

Text content of first returned node:
 In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t 

Text content of second returned node:
 3 RAFT In this section, we present RAFT, a novel way of training LLMs for domain-specific open- book exams. We first introduce the classical technique of supervised fine-tuning, followed with the key takeaways from our experiments. Then, we introduce RAFT , a modified version of general instructio


This is better than we had before with the default converter, so let's use it.

In [7]:
# update the converter function
mcp_source = mcp_source.with_converter(kendra_index_converter_fn)

## MCP Knowledge Source 2: a LlamaCloud MCP Server

In this part of the cookbook, we'll stand up an MCP server using [LlamaCloud](https://docs.llamaindex.ai/en/stable/llama_cloud/)—an enterprise solution by LlamaIndex—by following their MCP [demo](https://github.com/run-llama/llamacloud-mcp?tab=readme-ov-file#llamacloud-as-an-mcp-server).

LlamaCloud provides document parsing, indexing, and retrieval capabilities. By exposing these through an MCP server, we can query processed documents directly using natural language without managing our own document processing pipeline.

In [8]:
llama_cloud_server_params = StdioServerParameters(
    command="sh",
    args=[
        "-c",
        "cd /home/nerdai/OSS/llamacloud-mcp && poetry install && exec poetry run python mcp-server.py",
    ],
)

llama_cloud_mcp_source = MCPStdioKnowledgeSource(
    name="llama-index-server",
    server_params=llama_cloud_server_params,
    tool_name="LlamaCloudQueryTool",
    query_param_name="query",
)

In [9]:
res = await llama_cloud_mcp_source.retrieve("What is RALT?")

In [10]:
res

CallToolResult(meta=None, content=[TextContent(type='text', text='RALT stands for Retrieval-Augmented Language Model. It is a model that combines traditional language models with a retrieval mechanism to enhance performance on various natural language processing tasks. By incorporating a retrieval component, RALT can access external knowledge sources to improve its understanding and generation of text.\n\nHere is an example of how a Retrieval-Augmented Language Model (RALT) can be implemented using Python and the Hugging Face Transformers library:\n\n```python\nfrom transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n\n# Load the RALT model and tokenizer\nmodel_name = "your_RALT_model_name"\nmodel = AutoModelForSeq2SeqLM.from_pretrained(model_name)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\n# Define a prompt\nprompt = "Your prompt here"\n\n# Tokenize the prompt\ninputs = tokenizer(prompt, return_tensors="pt")\n\n# Generate text using the RALT model\noutput = model.

In [11]:
knowledge_nodes = (
    llama_cloud_mcp_source.call_tool_result_to_knowledge_nodes_list(res)
)

print("Number of results returned: ", len(knowledge_nodes), "\n")
print(
    "Text content of first returned node:\n",
    knowledge_nodes[0].text_content[:500],
)

Number of results returned:  1 

Text content of first returned node:
 RALT stands for Retrieval-Augmented Language Model. It is a model that combines traditional language models with a retrieval mechanism to enhance performance on various natural language processing tasks. By incorporating a retrieval component, RALT can access external knowledge sources to improve its understanding and generation of text.

Here is an example of how a Retrieval-Augmented Language Model (RALT) can be implemented using Python and the Hugging Face Transformers library:

```python
fro


## Create an MCP Knowledge Store

In [12]:
from fed_rag.knowledge_stores.no_encode import MCPKnowledgeStore
from sentence_transformers import CrossEncoder

In [13]:
def reranker_callback(
    nodes: list[KnowledgeNode], query: str
) -> list[tuple[float, KnowledgeNode]]:
    model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L2")
    # Concatenate the query and all passages and predict the scores for the pairs [query, passage]
    model_inputs = [[query, n.text_content] for n in nodes]
    scores = model.predict(model_inputs)

    # Sort the scores in decreasing order
    results = [(score, node) for score, node in zip(scores, nodes)]
    return sorted(results, key=lambda x: x[0], reverse=True)

In [14]:
knowledge_store = (
    MCPKnowledgeStore()
    .add_source(mcp_source)
    .add_source(llama_cloud_mcp_source)
    .with_reranker(reranker_callback)
)

In [15]:
res = await knowledge_store.retrieve("What is RAFT?", top_k=2)

In [16]:
res

[(0.7772398,
  KnowledgeNode(node_id='ca5afffd-ec50-4900-b25c-56f2e5b26e98', embedding=None, node_type=<NodeType.TEXT: 'text'>, text_content='...Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which...', image_content=None, metadata={'name': 'awslabs.amazon-kendra-index-mcp-server', 'tool_name': 'KendraQueryTool', 'query_param_name': 'query', 'tool_call_kwargs': {'indexId': '572aca26-16be-44df-84d3-4d96d778f120', 'region': 'ca-central-1'}, 'server_params': {'command': 'docker', 'args': ['run', '--rm', '--interactive', '--init', '--env-file', '/home/nerdai/Projects/fed-rag/docs/notebooks/.env', 'awslabs/amazon-kendra-index-mcp-server:latest'], 'env': None, 'cwd': None, 'encoding': 'utf-8', 'encoding_error_handler': 'strict'}})),
 (0.12683225,
  Knowledg

## Assemble a NoEncode RAG System

In [17]:
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
import torch
from transformers.generation.utils import GenerationConfig

generation_cfg = GenerationConfig(
    do_sample=True,
    eos_token_id=151643,
    bos_token_id=151643,
    max_new_tokens=2048,
    top_p=0.9,
    temperature=0.6,
    cache_implementation="offloaded",
    stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
    model_name="Qwen/Qwen2.5-3B",
    load_model_at_init=False,
    load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
    generation_config=generation_cfg,
)

In [18]:
from fed_rag import AsyncNoEncodeRAGSystem, RAGConfig

rag_config = RAGConfig(top_k=2)
rag_system = AsyncNoEncodeRAGSystem(
    knowledge_store=knowledge_store,
    generator=generator,
    rag_config=rag_config,
)

In [19]:
res = await rag_system.query(query="What is RAFT?")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [20]:
print(res)

RAFT stands for Retrieval-Augmented Fine-Tuning. It is a technique used in natural language processing that combines the power of retrieval-based methods with fine-tuning language models. This approach involves augmenting language models with retrieved information from external sources to enhance their performance on various NLP tasks. In RAFT, a language model is fine-tuned on a specific task while also incorporating information retrieved from external knowledge sources. This retrieved information helps the model to better understand the context of the task and make more informed predictions. RAFT can be implemented using Python and Hugging Face's Transformers library.


In [21]:
res.source_nodes

[SourceNode(score=0.7772397994995117, node=KnowledgeNode(node_id='3f170eec-31a5-4afb-a92e-ae8f8414c476', embedding=None, node_type=<NodeType.TEXT: 'text'>, text_content='...Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which...', image_content=None, metadata={'name': 'awslabs.amazon-kendra-index-mcp-server', 'tool_name': 'KendraQueryTool', 'query_param_name': 'query', 'tool_call_kwargs': {'indexId': '572aca26-16be-44df-84d3-4d96d778f120', 'region': 'ca-central-1'}, 'server_params': {'command': 'docker', 'args': ['run', '--rm', '--interactive', '--init', '--env-file', '/home/nerdai/Projects/fed-rag/docs/notebooks/.env', 'awslabs/amazon-kendra-index-mcp-server:latest'], 'env': None, 'cwd': None, 'encoding': 'utf-8', 'encoding_error_handler': 'strict'}}