<a target="_blank" href="https://colab.research.google.com/github/VectorInstitute/fed-rag/blob/main/docs/notebooks/no_encode_rag_with_mcp.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

_(NOTE: if running on Colab, you will need to supply a WandB API Key in addition to your HFToken. Also, you'll need to change the runtime to a T4.)_

# Build a NoEncode RAG System with an MCP Knowledge Store

## Introduction

In traditional RAG systems, there are three components: a retriever, a knowledge store, and a generator. A user's query is encoded by the retriever and used to retrieve relevant knowledge chunks from the knowledge store that had previously been encoded by the retriever as well. The user query along with the retrieved knowledge chunks are passed to the LLM generator to finally respond to the original query.

With NoEncode RAG systems, knowledge is still kept in a knowledge store and retrieved for responses to user queries, but there is no encoding step at all. Instead of pre-computing embeddings, NoEncode RAG systems query knowledge sources directly using natural language.

### Key Differences

**Traditional RAG:**
- Documents → Embed → Vector Store
- Query → Embed → Vector Search → Retrieve → Generate

**NoEncode RAG:**
- Knowledge Sources (MCP servers, APIs, databases)
- Query → Direct Natural Language Query → Retrieve → Generate

_**NOTE:** Knowledge sources may be traditional RAG systems themselves, and thus, these would involve encoding. However, the main RAG system does not handle encoding of queries or knowledge chunks at all._

### Model Context Protocol (MCP)

MCP provides a standardized way for AI systems to connect to external tools and data sources. In our NoEncode RAG system, MCP servers act as live knowledge sources that can be queried directly with natural language. An MCP knowledge store acts as the MCP client host that creates connections to these servers and retrieves context from them.

### Outline

In this cookbook, we will stand up two MCP knowledge sources, use them as part of an MCP knowledge store, and finally build an `AsyncNoEncodeRAGSystem` that allows us to query these sources.

1. MCP Knowledge Source 1: an AWS Kendra Index MCP Server
2. MCP Knowledge Source 2: a LlamaCloud MCP Server
3. Create an MCP Knowledge Store (using our two built sources)
4. Assemble a NoEncode RAG System

## MCP Knowledge Source 1: an AWS Kendra Index MCP Server

Here, we make use of one the myriad of officially supported [AWS MCP servers](https://github.com/awslabs/mcp?tab=readme-ov-file#available-servers) offered by [AWS Labs](https://github.com/awslabs), namely: their [AWS Kendra Index MCP Server](https://github.com/awslabs/mcp/tree/main/src/amazon-kendra-index-mcp-server).

AWS Kendra is an enterprise search service powered by machine learning. It can search across various data sources including documents, FAQs, knowledge bases, and websites, providing intelligent answers to natural language queries.

### Pre-requisite Steps

#### Create a Kendra Index

To be able to use this MCP server, you need to create a new Kendra Index. Add a S3 data connector to it that has the [RAFT paper](https://arxiv.org/pdf/2403.10131)—make sure to sync your index so that its ready to be queried with the RAFT paper. Finally, fill in the information below for regarding your Kendra Index:

In [1]:
# info regarding your kendra index which needs to be passed to the MCP tool call
kendra_index_info = {
    "indexId": "572aca26-16be-44df-84d3-4d96d778f120",
    "region": "ca-central-1",
}

#### Build the Kendra Index MCP Server Docker image

With our Kendra index in hand, we now are able to build a local MCP server that would interact with it. To do this, we take the following steps

1. Clone the `awslabs/mcp` Github repo:

```sh
git clone https://github.com/awslabs/mcp.gi
```

2. cd into the Kendra index src directory:

```sh
cd mcp/src/amazon-kendra-index-mcp-server
```

3. Locally build the Docker image

```sh
docker build -t awslabs/amazon-kendra-index-mcp-server .
```

#### Configure AWS Credentials

Create a `.env` file in the same directory as this notebook, with your AWS credentials:

```sh
# .env file
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
```

### Build the MCP Stdio Knowledge Source

In [2]:
import os
from mcp import StdioServerParameters
from fed_rag.knowledge_stores.no_encode import MCPStdioKnowledgeSource

In [3]:
server_params = StdioServerParameters(
    command="docker",
    args=[
        "run",
        "--rm",
        "--interactive",
        "--init",  # important to have in Jupyter Notebook
        "--env-file",
        f"{os.getcwd()}/.env",
        "awslabs/amazon-kendra-index-mcp-server:latest",
    ],
)

mcp_source = MCPStdioKnowledgeSource(
    name="awslabs.amazon-kendra-index-mcp-server",
    server_params=server_params,
    tool_name="KendraQueryTool",
    query_param_name="query",
    tool_call_kwargs=kendra_index_info,
)

Let's test out our new MCP source by invoking the `retrieve()` method with a specific query. This will return an `~mcp.CallToolResult` object.

In [4]:
call_tool_result = await mcp_source.retrieve("What is RAFT?")

#### Converting MCP Call Tool Results to Knowledge Nodes

MCP tool results are automatically converted to `KnowledgeNode` objects using a default converter in `MCPStdioKnowledgeSource`. This generic converter works for basic use cases but may not extract all valuable information from server-specific responses. Implement a custom converter to optimize knowledge extraction for your particular MCP server. Let's see the default converter in action first and determine if we need to create our own converter function.

In [5]:
# using the default converter function
knowledge_nodes = mcp_source.call_tool_result_to_knowledge_nodes_list(
    call_tool_result
)

print("Number of knowledge nodes created: ", len(knowledge_nodes), "\n")
print(
    "Text content of first created node:\n",
    knowledge_nodes[0].text_content[:500],
)

Number of knowledge nodes created:  1 

Text content of first created node:
 {"query": "What is RAFT?", "total_results_count": 4, "results": [{"id": "1881517b-ad6c-4c37-8fd1-f71018fe3f7c-02036f99-420a-457b-8bb3-edd7114d173d", "type": "ANSWER", "document_title": "raft.pdf", "document_uri": "https://fed-rag-mcp-cookbook.s3.ca-central-1.amazonaws.com/raft.pdf", "score": "HIGH", "excerpt": "In this paper, we present Retrieval Augmented\nFine Tuning (RAFT), a training recipe which improves the model\u2019s ability\nto answer questions in \"open-book\" in-domain settings. In t


As we can see, this is not really the most ideal conversion. We should probably only pass the `excerpt` text content to the LLM generator. Thus, we should define our own converter function to extract only the text content.

According to the [source code](https://github.com/awslabs/mcp/blob/36ae951bb8cfed234f69f6336fe2463eb4e08587/src/amazon-kendra-index-mcp-server/awslabs/amazon_kendra_index_mcp_server/server.py#L150) for this server, we see that a successful tool call of `KendraQueryTool` will return a `CallToolResult` whose `text` attribute is a JSON string containing a `results` key. The value for `results` is a list of `result_items` each containing an `excerpt` field, which is ultimately what we want to pass to the LLM generator.

Let's create a custom converter function to do this now.

In [6]:
import json
import re

from typing import Any
from mcp.types import CallToolResult
from fed_rag.data_structures import KnowledgeNode


# signature of a converter function
def kendra_index_converter_fn(
    result: CallToolResult, metadata: dict[str, Any] | None = None
) -> list[KnowledgeNode]:
    nodes = []
    for c in result.content:
        if c.type == "text":  # only use ~mcp.TextContent
            data = json.loads(c.text)
            for res in data["results"]:
                # take only the content in the "excerpt" key
                text_content = re.sub(r"\s+", " ", res["excerpt"].strip())
                nodes.append(
                    # create the knowledge node
                    KnowledgeNode(
                        node_type="text",
                        text_content=text_content,
                        metadata=metadata,
                    )
                )
    return nodes

Let's test out our custom converter as a standalone function on the previously obtained `call_tool_result`.

In [7]:
knowledge_nodes = kendra_index_converter_fn(call_tool_result)

print("Number of knowledge nodes created: ", len(knowledge_nodes), "\n")
print(
    "Text content of first created node:\n",
    knowledge_nodes[0].text_content[:500],
    "\n",
)
print(
    "Text content of second created node:\n",
    knowledge_nodes[1].text_content[:500],
)

Number of knowledge nodes created:  4 

Text content of first created node:
 In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t 

Text content of second created node:
 3 RAFT In this section, we present RAFT, a novel way of training LLMs for domain-specific open- book exams. We first introduce the classical technique of supervised fine-tuning, followed with the key takeaways from our experiments. Then, we introduce RAFT , a modified version of general instructio


This much improved and should work better when passing down as context to the LLM generator. We can update our `mcp_source` to use this converter function easily.

In [8]:
# update the converter function
mcp_source = mcp_source.with_converter(kendra_index_converter_fn)

## MCP Knowledge Source 2: a LlamaCloud MCP Server

In this part of the cookbook, we'll stand up an MCP server using [LlamaCloud](https://docs.llamaindex.ai/en/stable/llama_cloud/)—an enterprise solution by LlamaIndex—by following their MCP [demo](https://github.com/run-llama/llamacloud-mcp?tab=readme-ov-file#llamacloud-as-an-mcp-server).

LlamaCloud provides document parsing, indexing, and retrieval capabilities. By exposing these through an MCP server, we can query processed documents directly using natural language without managing our own document processing pipeline.

### Pre-requisite Steps

The steps below follow from the setup instructions listed in the Github repo for the [llamacloud-mcp demo](https://github.com/run-llama/llamacloud-mcp).

This requires the creation of a new LlamaCloud account. If you don't have one, then you create one by visiting <https://cloud.llamaindex.ai/>.

#### Create a LlamaCloud Index

Login to LlamaCloud with your account and navigate to "Tool" > "Index" in the left side-bar. Click the "Create Index" button to create a new index. After creating the new index, upload the [RA-DIT paper](https://arxiv.org/pdf/2310.01352).

__NOTE__: You'll need to supply information on your new index in the next step.

#### Create a local MCP server to connect with the LlamaCloud Index

1. Clone the `llamacloud-mcp` demo Github repo:

```sh
https://github.com/run-llama/llamacloud-mcp.git
```

2. cd into `llamacloud-mcp` directory

```sh
cd llamacloud-mcp
```

3. Update the `mcp-server.py`

```python
from dotenv import load_dotenv
from mcp.server.fastmcp import FastMCP
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
import os

load_dotenv()

mcp = FastMCP('llama-index-server')

@mcp.tool(name='LlamaCloudQueryTool')
def llama_index_query(query: str) -> str:
    """Search the llama-index documentation for the given query."""

    index = LlamaCloudIndex(
        name="<your-llamacloud-index-name>", 
        project_name="Default",  # change this if you didn't use default project name
        organization_id="<your-llamacloud-org-id>",
        api_key=os.getenv("LLAMA_CLOUD_API_KEY"),
    )

    response = index.as_query_engine().query(query + " Be verbose and include code examples.")

    return str(response)

if __name__ == "__main__":
    mcp.run(transport="stdio")
```

4. Create a `.env` file in the `llamacloud-mcp` directory

```sh
# .env
LLAMA_CLOUD_API_KEY=<your-llamacloud-api-key>
OPENAI_API_KEY=<your-openai-api-key>
```

__NOTE__: This llamacloud-mcp demo builds a `~llama_index.QueryEngine()` which by default uses an OpenAI LLM.

### Build the MCP Stdio Knowledge Source

In [9]:
# change this to your actual path
path_to_llamacloud_mcp = "/home/nerdai/OSS"

In [10]:
llama_cloud_server_params = StdioServerParameters(
    command="sh",
    args=[
        "-c",
        f"cd {path_to_llamacloud_mcp}/llamacloud-mcp && poetry install && exec poetry run python mcp-server.py",
    ],
)

llama_cloud_mcp_source = MCPStdioKnowledgeSource(
    name="llama-index-server",
    server_params=llama_cloud_server_params,
    tool_name="LlamaCloudQueryTool",
    query_param_name="query",
)

In [11]:
res = await llama_cloud_mcp_source.retrieve("What is RALT?")

In [12]:
res

CallToolResult(meta=None, content=[TextContent(type='text', text='RALT stands for Retrieval-Augmented Language Model. It is a model that combines traditional language models with a retrieval mechanism to enhance performance on various natural language processing tasks. By incorporating a retrieval component, RALT models can access external knowledge sources to improve their understanding and generation capabilities.\n\nHere is an example of how a RALT model can be implemented using Python and the Hugging Face Transformers library:\n\n```python\nfrom transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n\n# Load the RALT model and tokenizer\nmodel_name = "your_RALT_model_name"\nmodel = AutoModelForSeq2SeqLM.from_pretrained(model_name)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\n# Define a prompt for the RALT model\nprompt = "Your prompt here"\n\n# Tokenize the prompt\ninputs = tokenizer(prompt, return_tensors="pt")\n\n# Generate output from the RALT model\noutput = mod

In [13]:
knowledge_nodes = (
    llama_cloud_mcp_source.call_tool_result_to_knowledge_nodes_list(res)
)

print("Number of results returned: ", len(knowledge_nodes), "\n")
print(
    "Text content of first returned node:\n",
    knowledge_nodes[0].text_content[:500],
)

Number of results returned:  1 

Text content of first returned node:
 RALT stands for Retrieval-Augmented Language Model. It is a model that combines traditional language models with a retrieval mechanism to enhance performance on various natural language processing tasks. By incorporating a retrieval component, RALT models can access external knowledge sources to improve their understanding and generation capabilities.

Here is an example of how a RALT model can be implemented using Python and the Hugging Face Transformers library:

```python
from transformers im


## Create an MCP Knowledge Store

In [14]:
from fed_rag.knowledge_stores.no_encode import MCPKnowledgeStore
from sentence_transformers import CrossEncoder

### Define a ReRanker

When `MCPKnowledgeStore` retrieves knowledge from multiple MCP sources, you can provide a `reranker_callback` function to rank and filter results by relevance. This optimization step ensures downstream components receive only the highest-quality, most contextually relevant information for a given query.

Below we'll use a `sentence_transformer.CrossEncoder` to rerank the nodes from our two MCP sources.

In [15]:
def reranker_callback(
    nodes: list[KnowledgeNode], query: str
) -> list[tuple[float, KnowledgeNode]]:
    model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L2")
    # Concatenate the query and all passages and predict the scores for the pairs [query, passage]
    model_inputs = [[query, n.text_content] for n in nodes]
    scores = model.predict(model_inputs)

    # Sort the scores in decreasing order
    results = [(score, node) for score, node in zip(scores, nodes)]
    return sorted(results, key=lambda x: x[0], reverse=True)

In [16]:
# adding the re-ranker to the knowledge store is easy to do!
knowledge_store = (
    MCPKnowledgeStore()
    .add_source(mcp_source)
    .add_source(llama_cloud_mcp_source)
    .with_reranker(reranker_callback)
)

In [17]:
res = await knowledge_store.retrieve("What is RAFT?", top_k=2)

In [18]:
res

[(0.7772398,
  KnowledgeNode(node_id='4a032b54-4037-4ba8-ad6c-7cc9f810bcc0', embedding=None, node_type=<NodeType.TEXT: 'text'>, text_content='...Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which...', image_content=None, metadata={'name': 'awslabs.amazon-kendra-index-mcp-server', 'tool_name': 'KendraQueryTool', 'query_param_name': 'query', 'tool_call_kwargs': {'indexId': '572aca26-16be-44df-84d3-4d96d778f120', 'region': 'ca-central-1'}, 'server_params': {'command': 'docker', 'args': ['run', '--rm', '--interactive', '--init', '--env-file', '/home/nerdai/Projects/fed-rag/docs/notebooks/.env', 'awslabs/amazon-kendra-index-mcp-server:latest'], 'env': None, 'cwd': None, 'encoding': 'utf-8', 'encoding_error_handler': 'strict'}})),
 (0.18298462,
  Knowledg

## Assemble a NoEncode RAG System

Now that we have built our `MCPKnowledgeStore`, we can assemble our NoEncode RAG system. Recall that with NoEncode RAG systems, we forego the encoding step, and thus, we don't require a retriever model as we did with traditional RAG systems—all we need is a generator!

In [19]:
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
import torch
from transformers.generation.utils import GenerationConfig

generation_cfg = GenerationConfig(
    do_sample=True,
    eos_token_id=151643,
    bos_token_id=151643,
    max_new_tokens=2048,
    top_p=0.9,
    temperature=0.6,
    cache_implementation="offloaded",
    stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
    model_name="Qwen/Qwen2.5-3B",
    load_model_at_init=False,
    load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
    generation_config=generation_cfg,
)

In [20]:
from fed_rag import AsyncNoEncodeRAGSystem, RAGConfig

rag_config = RAGConfig(top_k=2)
rag_system = AsyncNoEncodeRAGSystem(
    knowledge_store=knowledge_store,
    generator=generator,
    rag_config=rag_config,
)

In [21]:
res = await rag_system.query(query="What is RAFT?")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [22]:
# final RAG response
print(res)

RAFT stands for Retrieval-Augmented Fine-Tuning, a training recipe that enhances a language model's ability to answer questions in "open-book" in-domain settings. It involves fine-tuning the model using a combination of in-context retrieval augmentation and instruction tuning methods. By incorporating relevant information from external sources, RAFT improves the model's understanding of the context and generates more accurate responses.


In [23]:
# a peak at the retrieved source nodes from MCP knowledge store
for ix, sn in enumerate(res.source_nodes):
    print(
        f"SOURCE NODE {ix}:\nSCORE: {sn.score}\nSOURCE: {sn.metadata['name']}\nTEXT: {sn.text_content[:500]}\n\n"
    )

SOURCE NODE 0:
SCORE: 0.7772397994995117
SOURCE: awslabs.amazon-kendra-index-mcp-server
TEXT: ...Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which...


SOURCE NODE 1:
SCORE: 0.6331754326820374
SOURCE: llama-index-server
TEXT: RAFT stands for Retrieval-Augmented Fine-Tuning. It is a technique used in natural language processing to enhance language models by incorporating retrieved information during the fine-tuning process. This approach involves leveraging external knowledge sources to improve the model's performance on various tasks.

In RAFT, the language model is fine-tuned using a combination of in-context retrieval augmentation and instruction tuning methods. By retrieving relevant information from external sour




## In Summary

In this comprehensive notebook, we covered how to bring in context from MCP servers (or sources). More specifically, we went through:

- How to build and interact with an `MCPStdioKnowledgeSource`
- How to build and interact with an `MCPKnowledgeStore` that is connected to these sources
- How to then assemble a NoEncode RAG system that combines the `MCPKnowledgeStore` with a chosen generator LLM
- How to define and use a reranker callback to better prioritize the retrieved knowledge nodes from the multiple MCP sources

## What's Next

After assembling the `NoEncodeRAGSystem`, you can use it with any `GeneratorTrainers` (e.g., [HuggingFaceTrainerForRALT](http://127.0.0.1:8000/api_reference/trainers/huggingface/#src.fed_rag.trainers.huggingface.ralt.HuggingFaceTrainerForRALT)) to fine-tune the RAG system to better adapt to your MCP knowledge store.