# Agentic RAG with Llama Stack

This notebook highlights how to integrate **Docling MCP** tools in the Agentic RAG available in Llama Stack.

We will use the Llama Stack framework. To get an introduction to Llama Stack capabilities, including its current builtin tools, you can refer to the
[Llama Stack Demos on OpenDataHub](https://github.com/opendatahub-io/llama-stack-demos) repository.

This example will use the inline Milvus component available in the Llama Stack distributions.

### Tools:

We will use tools internal to Llama Stack and from the Docling MCP server that allow executing tasks such as:
- [`mcp::docling`] converting a PDF file from a local or remote location into a unified document representation [DoclingDocument](https://docling-project.github.io/docling/concepts/docling_document/).
- [`mcp::docling`] chunk and ingest the document in the Llama Stack vectordb.
- [`builtin::rag/knowledge_search`] search in the document using agentic rag techniques.

## Pre-Requisites

Before starting this notebook, ensure that you have:
- Followed the instructions in the [README.md](./README.md) file to set up the following resources:
  - Inference model with Ollama
  - Llama Stack server with the Ollama template [distribution-starter](https://hub.docker.com/r/llamastack/distribution-starter)
  - Docling MCP server 

You may want to create a virtual environment to run this notebook, for instance, with [uv](https://docs.astral.sh/uv/). Ensure to install the llama-stack optional dependencies, as well as the examples group dependencies:

```bash
uv venv
source .venv/bin/activate
uv sync --extra llama-stack --group examples
```


## Setting Up this Notebook

Rename or copy the [`.env.example`](./.env.examples) file to create a new file called `.env`. Most environmental variables are already set up with default values to run this notebook and they are aligned to the set up of the pre-requisites, like the Llama Stack server and the Docling MCP endpoints.

```bash
cp .env.example .env
```

### Environment variables required for this notebook

- `BASE_URL`: the URL of the remote Llama Stack server. Defaults to `http://localhost:8321`.
- `INFERENCE_MODEL`: the generative AI model id. Defaults to the Meta Llama 3.2 model (`meta-llama/Llama-3.2-3B-Instruct`).
- `TEMPERATURE` (optional): the temperature to use during inference. Defaults to 0.0.
- `TOP_P` (optional): the top_p parameter to use during inference. Defaults to 0.95.
- `MAX_TOKENS` (optional): the maximum number of tokens that can be generated in the completion. Defaults to 4096.
- `STREAM` (optional): set this to True to stream the output of the model/agent and False otherwise. Defaults to True.
- `USE_PROMPT_CHAINING`: dictates if the prompt should be formatted as a few separate prompts to isolate each step or in a single turn.
- `DOCLING_MCP_URL`: the URL for the Docling MCP server. If the client does not find the tool registered to the llama-stack instance, it will use this URL to register it.

## Necessary Imports

In [None]:
import logging
import uuid

from llama_stack_client import Agent
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.event_logger import EventLogger
from pydantic import NonNegativeFloat, AnyHttpUrl
from pydantic_settings import BaseSettings, SettingsConfigDict
from rich.pretty import pprint

from src.utils import step_printer, user_printer

# set the logger
logger = logging.getLogger(__name__)
if not logger.hasHandlers():
    logger.setLevel(logging.INFO)
    stream_handler = logging.StreamHandler()
    stream_handler.setLevel(logging.INFO)
    formatter = logging.Formatter("%(message)s")
    stream_handler.setFormatter(formatter)
    logger.addHandler(stream_handler)


# access the environment variables
class Settings(BaseSettings):
    base_url: AnyHttpUrl = "http://localhost:8321"
    inference_model: str = "ollama/llama3.2:3b-instruct-fp16"
    max_tokens: int = 4096
    temperature: NonNegativeFloat = 0.0
    top_p: float = 0.95
    stream: bool = False
    use_prompt_chaining: bool = True

    docling_mcp_url: AnyHttpUrl = "http://host.containers.internal:8000/mcp"

    vdb_provider: str = "milvus"
    vdb_embedding: str = "all-MiniLM-L6-v2"
    vdb_embedding_dimension: int = 384
    # vdb_embedding_window: int = 256

    model_config = SettingsConfigDict(
        env_file=".env", env_file_encoding="utf-8", extra="ignore"
    )


settings = Settings()
pprint(settings)

## Setting Up the Server Connection

Establish the connection to your Llama Stack server.

In [4]:
client = LlamaStackClient(base_url=str(settings.base_url))
print(f"Connected to Llama Stack server @ {client.base_url}")

Connected to Llama Stack server @ http://localhost:8321/


## Initializing the Inference Parameters

Fetch the inference-related parameters from the corresponding environment variables and convert them to the format Llama Stack expects.

In [5]:
if settings.temperature > 0.0:
    strategy = {
        "type": "top_p",
        "temperature": settings.temperature,
        "top_p": settings.top_p,
    }
else:
    strategy = {"type": "greedy"}

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": settings.max_tokens,
}

print(
    f"Inference Parameters:\n\tSampling Parameters: {sampling_params}\n\tstream: {settings.stream}"
)

Inference Parameters:
	Sampling Parameters: {'strategy': {'type': 'greedy'}, 'max_tokens': 4096}
	stream: False


## Validate that the Docling MCP tools are available in the Llama Stack instance

When an instance of Llama Stack is redeployed, it may be the case that the tools will need to be re-registered. Also if a tool is already registered with a Llama Stack instance, trying to register another one with the same `toolgroup_id` will throw you an error.

For this reason, it is recommended to validate your tools and toolgroups. The following code will check that `mcp::docling` tools are correctly registered, and if not it will attempt to register them using their specific endpoints.

In [6]:
registered_tools = client.tools.list()
registered_toolgroups = [t.toolgroup_id for t in registered_tools]
if "mcp::docling" not in registered_toolgroups:
    client.toolgroups.register(
        toolgroup_id="mcp::docling",
        provider_id="model-context-protocol",
        mcp_endpoint={"uri": str(settings.docling_mcp_url)},
    )

registered_tools = client.tools.list()
registered_toolgroups = [t.toolgroup_id for t in registered_tools]
print(
    f"Your Llama Stack server is registered with the following tool groups @ {set(registered_toolgroups)} \n"
)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1/tools "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/tools "HTTP/1.1 200 OK"


Your Llama Stack server is registered with the following tool groups @ {'builtin::websearch', 'mcp::docling', 'builtin::rag'} 



## Defining our Agent - Prompt Chaining

We define an agent provided with the **Docling MCP** tools together with the built-in knowledge_search. The agent should be able to accomplish the following tasks in a multi-step, multi-tool approach:

1. Converting a PDF file into the `DoclingDocument` format.
2. Ingest the results in the vector db.
3. Search in the vector db using an agentic/multi-step approach.

In [7]:
model_prompt = """You are a helpful assistant. You have access to a number of tools.
Whenever a tool is called, be sure to return the Response in a friendly and helpful tone.
"""

In [24]:
# define the name of the vectordb collection to use
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=settings.vdb_embedding,
    embedding_dimension=settings.vdb_embedding_dimension,
    provider_id=settings.vdb_provider,
)


# Create simple agent with tools
agent = Agent(
    client=client,
    model=settings.inference_model,  # replace this with model_id to get the value of INFERENCE_MODEL_ID environment variable
    instructions=model_prompt,  # update system prompt based on the model you are using
    tools=[
        dict(
            name="mcp::docling/convert_pdf_document_into_docling_document",
            args={},
        ),
        dict(
            name="mcp::docling/insert_document_to_vectordb",
            args={
                "vector_db_id": vector_db_id,
            },
        ),
        dict(
            name="builtin::rag/knowledge_search",
            args={
                "vector_db_ids": [
                    vector_db_id
                ],  # list of IDs of document collections to consider during retrieval
            },
        ),
    ],
    tool_config={"tool_choice": "auto"},
    sampling_params=sampling_params,
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/vector-dbs "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/tools?toolgroup_id=mcp%3A%3Adocling%2Fconvert_pdf_document_into_docling_document "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/tools?toolgroup_id=mcp%3A%3Adocling%2Finsert_document_to_vectordb "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/tools?toolgroup_id=builtin%3A%3Arag%2Fknowledge_search "HTTP/1.1 200 OK"


In [25]:
user_prompts = [
    "Convert the PDF document on https://arxiv.org/pdf/2206.01062 to DoclingDocument.",
    "Ingest the document.",
    "Answer with the document knowledge in the vectordb: How many pages were manually annotated in the dataset?",
]
session_id = agent.create_session(f"docling-session_{uuid.uuid4()}")

for i, prompt in enumerate(user_prompts):
    user_printer(prompt)
    response = agent.create_turn(
        messages=[{"role": "user", "content": prompt}],
        session_id=session_id,
        stream=settings.stream,
    )
    if settings.stream:
        for log in EventLogger().log(response):
            log.print()
    else:
        step_printer(
            response.steps
        )  # print the steps of an agent's response in a formatted way.

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/13f24c2d-512a-43a6-96ac-dbb02a87b781/session "HTTP/1.1 200 OK"


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/13f24c2d-512a-43a6-96ac-dbb02a87b781/session/19caa645-d53a-442f-83cd-88f37f274799/turn "HTTP/1.1 200 OK"


👤 User Query:
[36mConvert the PDF document on https://arxiv.org/pdf/2206.01062 to DoclingDocument.[0m

---------- 📍 Step 1: InferenceStep ----------
🛠️ Tool call Generated:
[35mTool call: convert_pdf_document_into_docling_document, Arguments: {'source': 'https://arxiv.org/pdf/2206.01062'}[0m

---------- 📍 Step 2: ToolExecutionStep ----------
🔧 Executing tool...


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/13f24c2d-512a-43a6-96ac-dbb02a87b781/session/19caa645-d53a-442f-83cd-88f37f274799/turn "HTTP/1.1 200 OK"



---------- 📍 Step 3: InferenceStep ----------
🤖 Model Response:
[35mThe PDF document has been successfully converted to a Docling document and stored in the local cache. The unique key for this document is '868f49ae1f0e66e82238a8aea43fd30b'. You can use this key to insert the document into a vector database or perform knowledge searches on it.
[0m

👤 User Query:
[36mIngest the document.[0m

---------- 📍 Step 1: InferenceStep ----------
🛠️ Tool call Generated:
[35mTool call: insert_document_to_vectordb, Arguments: {'document_key': '868f49ae1f0e66e82238a8aea43fd30b', 'vector_db_id': 'my_vector_db'}[0m

---------- 📍 Step 2: ToolExecutionStep ----------
🔧 Executing tool...


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/13f24c2d-512a-43a6-96ac-dbb02a87b781/session/19caa645-d53a-442f-83cd-88f37f274799/turn "HTTP/1.1 200 OK"



---------- 📍 Step 3: InferenceStep ----------
🤖 Model Response:
[35mThe document has been successfully ingested into the specified vector database. The unique identifier for this vector database is 'test_vector_db_90b20b43-ce39-481b-a95c-a997d2c7e9ca'. You can now perform knowledge searches on the converted PDF document using this vector database.
[0m

👤 User Query:
[36mAnswer with the document knowledge in the vectordb: How many pages were manually annotated in the dataset?[0m

---------- 📍 Step 1: InferenceStep ----------
🛠️ Tool call Generated:
[35mTool call: knowledge_search, Arguments: {'query': 'number of pages manually annotated in dataset'}[0m

---------- 📍 Step 2: ToolExecutionStep ----------
🔧 Executing tool...



---------- 📍 Step 3: InferenceStep ----------
🤖 Model Response:
[35mAccording to the search results, 91104 annotation instances were created for the DocLayNet dataset. This corresponds to a total of 7059 pages with two annotations and 1591 pages with three annotations. Therefore, the answer is:

There are 91104 annotation instances in the DocLayNet dataset, which corresponds to 7059 pages with two annotations and 1591 pages with three annotations.
[0m

