# Multi-Agent Report Generation with AgentWorkflow

In this notebook, we will explore how to use an `AgentWorkflow` in LlamaIndex to create multi-agent systems. Specifically, we will create a system that can generate a report on a given topic.

For this, we leverage both local serving of a `qwen3-8b` Small Language Model (SLM) as our LLM, served by LM Studio, and a state-of-the-art model `gemini-2.5-flash` hosted by Google Cloud. For all supported LLM inference providers and models by LlamaIndex, check out the [examples documentation](https://docs.llamaindex.ai/en/stable/examples/llm/openai/) or [LlamaHub](https://llamahub.ai/?tab=llms) for a list of all supported LLMs and how to install/use them.

Note that if we wanted, each agent could have a different LLM, but for this example, we will use the same LLM for all agents.

## Prerequisites

Before running this notebook, ensure you have the following requirements met:

### 1. API Keys and Services

You'll need to obtain API keys for the following services:

- **Google Gemini API Key**: Required for the LLM functionality
  - Visit [Google AI Studio](https://aistudio.google.com/app/apikey) to get your API key
  - Set as `GOOGLE_API_KEY` environment variable

- **Tavily Search API Key**: Required for web search functionality
  - Visit [Tavily](https://tavily.com/) to create an account and get your API key
  - Set as `TAVILY_SEARCH_API_KEY` environment variable

### 2. Environment Setup

Create a `.env` file in your project root with the following variables:
```
GOOGLE_API_KEY=your_google_api_key_here
TAVILY_SEARCH_API_KEY=your_tavily_api_key_here
```

### 3. Python Dependencies

We leverage `uv` as a high-performance Python package and project manager. Once you have cloned this repository, just run `uv sync` to handle dependency management.

### 4. Directory Structure

The project has the following directory structure:
```
multi-agent-governance/
├── .env
├── agent_workflow_multi.ipynb
├── data/
│   ├── input/          # Place PDF files here
│   └── output/         # Converted markdown files will be stored here
└── database/
    └── vector_store/   # ChromaDB database will be created here
```

### 5. Optional: Local LLM Setup

If you want to use a local LLM instead of Google Gemini:

- **LM Studio**: Download and install [LM Studio](https://lmstudio.ai/)
- **Model**: Download a compatible model (e.g., qwen3-8b)
- **Configuration**: Start LM Studio server on `http://127.0.0.1:1234/v1`

### 6. Input Data

- Place any PDF documents you want to process in the `data/input/` directory
- The notebook includes an example with `internet-history-09.pdf`

**Note**: This notebook demonstrates advanced multi-agent workflows and may take several minutes to complete, especially during the initial setup and PDF processing phases.


## Execution Variables Setup

This section includes all the necessary variables to the execution of the RAG pipeline and the agentic flow. We also perform a quick test against the model endpoint before proceeding.

In [1]:
# Load environment variables from .env file
import os
from dotenv import load_dotenv
load_dotenv()

# Environment variables for local LM studio inference
model = "qwen/qwen3-8b"
base_url = "http://127.0.0.1:1234/v1"
api_key = ""

# Environment variable for Google GenAI API inference
google_api_key = os.getenv("GOOGLE_API_KEY", "")
google_model = "gemini-2.5-flash"
google_model_eval = "gemini-2.5-pro"

# Environment variables for local data
dir_input = './data/input'
dir_output = './data/output'
dir_chromadb = './database/vector_store/'
chromadb_collection = 'internet_history'

In [2]:
# Fix for "RuntimeError: This event loop is already running"
import nest_asyncio
nest_asyncio.apply()

from llama_index.llms.lmstudio import LMStudio
from llama_index.core.base.llms.types import ChatMessage, MessageRole

# Initialize the LMStudio client with the model and base URL
#llm = LMStudio(
#    model_name=model,
#    base_url=base_url,
#    temperature=0.7,
#)

from llama_index.llms.google_genai import GoogleGenAI
# Initialize the Google GenAI client with the API key
llm = GoogleGenAI(
    model=google_model,
    api_key=google_api_key,  
)

In [3]:
# Test the LLM endpoint with a simple prompt
response = llm.complete("Write a paragraph on the history of the internet.")
print(str(response))

The internet's origins trace back to the Cold War era, specifically the U.S. Department of Defense's ARPANET in the late 1960s, designed as a resilient communication network using packet switching. This foundational network evolved with the development of TCP/IP protocols in the 1970s, which standardized communication and allowed disparate networks to connect, forming the true "internetwork." However, it was the advent of the World Wide Web in the early 1990s, spearheaded by Tim Berners-Lee, that truly democratized access. With user-friendly browsers like Mosaic and Netscape, the internet transformed from a research tool into a public utility, ushering in an era of rapid commercialization and global expansion that fundamentally reshaped communication, commerce, and information sharing, becoming the indispensable global nervous system it is today.


## System Design

Our system will have three agents:

1. A `ResearchAgent` that will search local data as well as the web for information on the given topic.
2. A `WriteAgent` that will write the report using the information found by the `ResearchAgent`.
3. A `ReviewAgent` that will review the report and provide feedback.

We will use the `AgentWorkflow` class to create a multi-agent system that will execute these agents in order.

While there are many ways to implement this system, in this case, we will use a few tools to help with the research and writing processes.

1. A `web_search` tool to search the web for information on the given topic.
2. A `query_engine` tool to query local documents via Query Engine (RAG)
3. A `record_notes` tool to record notes on the given topic.
4. A `write_report` tool to write the report using the information found by the `ResearchAgent`.
5. A `review_report` tool to review the report and provide feedback.

Utilizing the `Context` class, we can pass state between agents, and each agent will have access to the current state of the system.


## RAG Pipeline

### Function convert_pdfs_to_markdown

The function takes two arguments: the directory containing PDF files and the directory where the converted Markdown files will be saved. The function checks if the output directory exists and creates it if necessary. It then iterates over all PDF files in the input directory, converts each to Markdown using the DocumentConverter class, and saves the result in the output directory.

In [4]:
from warnings import filterwarnings
from docling.document_converter import DocumentConverter

# Suppress warning from easyocr to avoid cluttering the output of the conversion process
filterwarnings(action="ignore", category=FutureWarning, module="easyocr") 


def convert_pdfs_to_markdown(pdf_dir, md_dir):
	if not os.path.exists(md_dir):
		os.makedirs(md_dir)

	pdf_files = [f for f in os.listdir(pdf_dir) if f.endswith('.pdf')]
	for pdf_file in pdf_files:
		pdf_path = os.path.join(pdf_dir, pdf_file)
		md_path = os.path.join(md_dir, f"{os.path.splitext(pdf_file)[0]}.md")

		if not os.path.exists(md_path):
			print(f"Converting `{pdf_file}` to Markdown ...")

			doc_converter = DocumentConverter()
			result = doc_converter.convert(source=pdf_path)
			
			with open(md_path, 'w', encoding='utf-8') as md_file:
				md_file.write(result.document.export_to_markdown())

  from .autonotebook import tqdm as notebook_tqdm


### Execute the convert_pdfs_to_markdown function

Convert all PDFs in the specified input directory to Markdown format and saving them in the output directory. The function prints messages to indicate the progress of the conversion process.

In [5]:
convert_pdfs_to_markdown(dir_input, dir_output)

Converting `internet-history-09.pdf` to Markdown ...




### Initializes models and clients required for generating the vector database.

We create an embedding model using the HuggingFace library then read the converted Markdown documents from the output directory and loads them into a SimpleDirectoryReader.

Next, the code initializes a ChromaDB client and creates or retrieves a collection within the database. It sets up a vector store using the ChromaDB collection and a storage context with default settings. Finally, it creates a VectorStoreIndex from the loaded documents, using the embedding model for vectorization. The process concludes with a print statement indicating that the vector database has been successfully generated.

In [6]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)
import chromadb

chroma_embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
documents = SimpleDirectoryReader(input_dir=dir_output).load_data()

chroma_client = chromadb.PersistentClient(path = dir_chromadb)
chroma_collection = chroma_client.get_or_create_collection(name=chromadb_collection)

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=chroma_embed_model)

print("Vector database successfully generated!")

Vector database successfully generated!


### Test a simple query to the vector database

In [7]:
test_query = "Who published the first paper on packet switching theory?"
result = index.as_query_engine(llm=llm).query(test_query)
print(f"Q: {test_query}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

Q: Who published the first paper on packet switching theory?
A: Leonard Kleinrock at MIT published the first paper on packet switching theory in July 1961.

Sources:


[("There is the operations and management aspect of a global and complex operational infrastructure. There is the social aspect, which resulted in a broad community of Internauts working together to create and evolve the technology. And there is the commercialization aspect, resulting in an extremely effective transition of research results into a broadly deployed and available information infrastructure.\n\nThe Internet today is a widespread information infrastructure, the initial prototype of what is often called the National (or Global or Galactic) Information Infrastructure. Its history is complex and involves many aspects - technological, organizational, and community. And its influence reaches not only to the technical fields of computer communications but throughout society as we move toward increasing use of online tools to accomplish electronic commerce, information acquisition, and community operations.\n\n## 2. ORIGINS OF THE INTERNET\n\nThe first recorded description of the

## DeepEval Integration for RAG/LLM Evaluation

Now we’ll set DeepEval metrics to evaluate our overall agentic application as well as elements of the system, including the RAG pipeline and the report writing built on LlamaIndex.

DeepEval’s metrics are powered by LLM-as-Judge. Here we override the default provider OpenAI and model `gpt-4o` which are used by default, to use a Google Gemini model `gemini-2.5-pro`. For the purposes of this agentic application, we use a set of metrics including G-Eval, Bias for overall system evaluation, 3 RAG metrics (Answer Relevancy, Faithfulness, and Contextual Precision) for our RAG pipeline, and the Summarization metric to assess the quality of the report writing process.These can help us measure:

1. **Faithfulness**: Measure whether the RAG agent output factually aligns with the contents of the RAG,s retrival context 
2. **Answer Relevancy**: How relevant the answer of the RAG agent is to the question
3. **Contextual Relevancy**: How relevant the retrieved context of the RAG pipeline is to the question
4. **Bias Detection**: Whether the response contains biased content
5. **Summarization**: Whether the response 
6. **G-Eeval**: Allows us to use a Reasoning LLM to act as a "judge," scoring the final report generated by our agentic system based on our own custom-defined criteria for overall performance.

These metrics will be used to assess the quality of our multi-agent system's outputs.


In [22]:
# Use native integration with Google Gemini for evaluation model
from deepeval.models import GeminiModel
eval_model = GeminiModel(
    model_name=google_model_eval,
    api_key=google_api_key
)
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualRelevancyMetric,
    GEval,
)
from deepeval.test_case import LLMTestCase
from deepeval.test_case import LLMTestCaseParams
from deepeval import evaluate

metric_answer_relevancy = AnswerRelevancyMetric(
        model=eval_model,
        threshold=0.8)
metric_faithfulness = FaithfulnessMetric(
        model=eval_model,
        threshold=0.8)
metric_contextual_relevancy = ContextualRelevancyMetric(
        model=eval_model,
        threshold=0.8)

# Define a specific GEval metric for completeness
metric_completeness = GEval(
    name="Completeness",
    model=eval_model,
    evaluation_steps=[
        "Determine if the response answers every part of the input or question.",
        "Identify any missing elements, skipped sub-questions, or incomplete reasoning.",
        "Check whether the output provides sufficient detail for each aspect mentioned.",
        "Do not penalize for brevity if the coverage is complete and accurate."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.INPUT],
)

### Test RAG Evaluation Metrics for a simple query on the RAG database

In [9]:
# Run a test query against the indexed local data and evaluate the response
test_query = "Who published the first paper on packet switching theory?"
result = index.as_query_engine(llm=llm, similarity_top_k=3).query(test_query)
print(f"Q: {test_query}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

# Extract the actual output (generated answer) from the model
actual_output = result.response

# Extract the retrieved context used to generate the answer
retrieval_context = [source_node.get_content() for source_node in result.source_nodes]

# Create a test case object to evaluate the model's performance
test_case = LLMTestCase(
    input=test_query,  # the input question
    actual_output=actual_output,  # the model's generated answer
    retrieval_context=retrieval_context  # the supporting retrieved context
)

# Evaluate the RAG metrics for this test case
evaluate([test_case], [metric_answer_relevancy, metric_faithfulness, metric_contextual_relevancy])

# Print the evaluation results
print(metric_answer_relevancy.score)
print(metric_faithfulness.score) 
print(metric_contextual_relevancy.score) 

Q: Who published the first paper on packet switching theory?
A: Leonard Kleinrock at MIT published the first paper on packet switching theory in July 1961.

Sources:


[("There is the operations and management aspect of a global and complex operational infrastructure. There is the social aspect, which resulted in a broad community of Internauts working together to create and evolve the technology. And there is the commercialization aspect, resulting in an extremely effective transition of research results into a broadly deployed and available information infrastructure.\n\nThe Internet today is a widespread information infrastructure, the initial prototype of what is often called the National (or Global or Galactic) Information Infrastructure. Its history is complex and involves many aspects - technological, organizational, and community. And its influence reaches not only to the technical fields of computer communications but throughout society as we move toward increasing use of online tools to accomplish electronic commerce, information acquisition, and community operations.\n\n## 2. ORIGINS OF THE INTERNET\n\nThe first recorded description of the



Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.8, strict: False, evaluation model: gemini-2.5-pro, reason: The score is 1.00. Excellent! The output was perfectly relevant and directly answered the user's question., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.8, strict: False, evaluation model: gemini-2.5-pro, reason: The score is 1.0 because the output is perfectly faithful to the provided context, with no contradictions found., error: None)
  - ❌ Contextual Relevancy (score: 0.2857142857142857, threshold: 0.8, strict: False, evaluation model: gemini-2.5-pro, reason: The score is 0.29 because while the context does contain the direct answer that 'Leonard Kleinrock at MIT published the first paper on packet switching theory in July 1961', it is surrounded by a large amount of irrelevant information about topics like the 'Galactic Network' concept, the 'plan for the ARPANET', and the introduction of 'electronic mail'., error: None)

For test case:

  - 

None
None
None


## Agentic System Setup

We will start by defining the set of tools available to our Agents.

In [14]:
from tavily import AsyncTavilyClient
from llama_index.core.workflow import Context
import asyncio

async def search_web(query: str) -> str:
    """Useful for using the web to answer questions."""
    client = AsyncTavilyClient(api_key=os.environ.get("TAVILY_SEARCH_API_KEY"))
    return str(await client.search(query))

async def evaluate_rag_async(query: str, actual_output: str, retrieval_context: list) -> None:
    """Asynchronously evaluate RAG metrics for a query and response."""
    try:
        # Create a test case object to evaluate the model's performance
        test_case = LLMTestCase(
            input=query,  # the input question
            actual_output=actual_output,  # the model's generated answer
            retrieval_context=retrieval_context  # the supporting retrieved context
        )
        
        # Evaluate the RAG metrics for this test case
        evaluate([test_case], [metric_answer_relevancy, metric_faithfulness, metric_contextual_relevancy])
        
        # Optional: Print evaluation results
        print(f"Evaluation completed for query: {query}")
        print(f"Answer Relevancy: {metric_answer_relevancy.score}")
        print(f"Faithfulness: {metric_faithfulness.score}")
        print(f"Contextual Relevancy: {metric_contextual_relevancy.score}")
        
    except Exception as e:
        print(f"Error during evaluation: {e}")

async def query_data(query: str) -> str:
    """Query local vector database for information on internet history."""
    result = index.as_query_engine(llm=llm).query(query)
    formatted_output = f"Q: {query}\nA: {result.response.strip()}\n\nSources:\n{[(n.text, n.metadata) for n in result.source_nodes]}"
    
    # Extract the actual output (generated answer) from the model
    actual_output = result.response
    # Extract the retrieved context used to generate the answer
    retrieval_context = [source_node.get_content() for source_node in result.source_nodes]
    
    # Run RAG evaluation asynchronously in the background (fire and forget)
    # asyncio.create_task(evaluate_rag_async(query, actual_output, retrieval_context))
    
    # Return the formatted output immediately without waiting for evaluation
    return str(formatted_output)

async def record_notes(ctx: Context, notes: str, notes_title: str) -> str:
    """Useful for recording notes on a given topic. Your input should be notes with a title to save the notes under."""
    current_state = await ctx.get("state")
    if "research_notes" not in current_state:
        current_state["research_notes"] = {}
    current_state["research_notes"][notes_title] = notes
    await ctx.set("state", current_state)
    return "Notes recorded."


async def write_report(ctx: Context, report_content: str) -> str:
    """Useful for writing a report on a given topic. Your input should be a markdown formatted report."""
    current_state = await ctx.get("state")
    current_state["report_content"] = report_content
    await ctx.set("state", current_state)
    return "Report written."


async def review_report(ctx: Context, review: str) -> str:
    """Useful for reviewing a report and providing feedback. Your input should be a review of the report."""
    current_state = await ctx.get("state")
    current_state["review"] = review
    await ctx.set("state", current_state)
    return "Report reviewed."

With our tools defined, we can now create our agents.

If the LLM you are using supports tool calling, you can use the `FunctionAgent` class. Otherwise, you can use the `ReActAgent` class.

Here, the name and description of each agent is used so that the system knows what each agent is responsible for and when to hand off control to the next agent.

In [15]:
from llama_index.core.agent.workflow import FunctionAgent, ReActAgent

test_agent_search = ReActAgent(
    tools=[search_web],
    llm=llm,
    system_prompt="You are a helpful assistant that can search the web for information.",
)

test_agent_query = ReActAgent(
    tools=[query_data],
    llm=llm,
    system_prompt=(
        "You are a helpful assistant that can query a local vector database for information on internet history. "
        "You should first search the local vertor database with the query_data tool for information on the topic. "
        "You should return the response in a markdown format including the question, answer, and sources. "
    ),
)

research_agent = ReActAgent(
    name="ResearchAgent",
    description="Useful for searching the web for information on a given topic and recording notes on the topic.",
    system_prompt=(
        "You are the ResearchAgent that can search local data or on the web for information on a given topic and record notes on the topic."
        "You should first search the local vector database with the query_data tool for information on the topic if relevant to the information stored. "
        "If not sufficient, you should then search web with the search_web tool for information on the topic. "
        "You should always record notes on the topic using the record_notes tool. "
        "Once notes are recorded and once you are satisfied, you should always hand off control to the WriteAgent to write a report on the topic. "
        "You should have at least some notes on a topic before handing off control to the WriteAgent."
    ),
    llm=llm,
    tools=[query_data, search_web, record_notes],
    can_handoff_to=["WriteAgent"],
)

write_agent = ReActAgent(
    name="WriteAgent",
    description="Useful for writing a report on a given topic.",
    system_prompt=(
        "You are the WriteAgent that can write a report on a given topic. "
        "Your report should be in a markdown format. The content should be grounded in the research notes. "
        "Once the report is written, you should get feedback at least once from the ReviewAgent."
    ),
    llm=llm,
    tools=[write_report],
    can_handoff_to=["ReviewAgent", "ResearchAgent"],
)

review_agent = ReActAgent(
    name="ReviewAgent",
    description="Useful for reviewing a report and providing feedback.",
    system_prompt=(
        "You are the ReviewAgent that can review the write report and provide feedback. "
        "Your review should either approve the current report or request changes for the WriteAgent to implement. "
        "If you have feedback that requires changes, you should hand off control to the WriteAgent to implement the changes after submitting the review."
    ),
    llm=llm,
    tools=[review_report],
    can_handoff_to=["WriteAgent"],
)

## Testing a single agent

Use the test agent to ensure that agent and tools work with the chosen model

In [12]:
response = await test_agent_search.run(user_msg="What is the weather in San Francisco?")
print(str(response))

The weather in San Francisco is currently 57°F and partly cloudy, with a west wind at 10 mph and 77% humidity.


In [16]:
response = await test_agent_query.run(user_msg="Who published the first paper on packet switching theory?")
print(str(response))

Leonard Kleinrock at MIT published the first paper on packet switching theory in July 1961.


## Running the Workflow

With our agents defined, we can create our `AgentWorkflow` and run it.

In [17]:
from llama_index.core.agent.workflow import AgentWorkflow

agent_workflow = AgentWorkflow(
    agents=[research_agent, write_agent, review_agent],
    root_agent=research_agent.name,
    initial_state={
        "research_notes": {},
        "report_content": "Not written yet.",
        "review": "Review required.",
    },
)

As the workflow is running, we will stream the events to get an idea of what is happening under the hood.

In [18]:
from llama_index.core.agent.workflow import (
    AgentInput,
    AgentOutput,
    ToolCall,
    ToolCallResult,
    AgentStream,
)

workflow_message=(
     "Search for the history of internet and write me a report on it. "
        "Briefly describe the history of the internet, including the development of the internet, the development of the web, "
        "and the development of the internet in the 21st century."
)

handler = agent_workflow.run(
    user_msg=(workflow_message)
)

current_agent = None
current_tool_calls = ""
async for event in handler.stream_events():
    if (
        hasattr(event, "current_agent_name")
        and event.current_agent_name != current_agent
    ):
        current_agent = event.current_agent_name
        print(f"\n{'='*50}")
        print(f"🤖 Agent: {current_agent}")
        print(f"{'='*50}\n")

    # if isinstance(event, AgentStream):
    #     if event.delta:
    #         print(event.delta, end="", flush=True)
    # elif isinstance(event, AgentInput):
    #     print("📥 Input:", event.input)
    elif isinstance(event, AgentOutput):
        if event.response.content:
            print("📤 Output:", event.response.content)
        if event.tool_calls:
            print(
                "🛠️  Planning to use tools:",
                [call.tool_name for call in event.tool_calls],
            )
    elif isinstance(event, ToolCallResult):
        print(f"🔧 Tool Result ({event.tool_name}):")
        print(f"  Arguments: {event.tool_kwargs}")
        print(f"  Output: {event.tool_output}")
    elif isinstance(event, ToolCall):
        print(f"🔨 Calling Tool: {event.tool_name}")
        print(f"  With arguments: {event.tool_kwargs}")


🤖 Agent: ResearchAgent

📤 Output: Thought: The current language of the user is: English. I need to use a tool to help me answer the question. The user wants a report on the history of the internet. I should start by searching my local vector database for information on internet history.
Action: query_data
Action Input: {"query": "history of the internet, development of the internet, development of the web, internet in the 21st century"}
🛠️  Planning to use tools: ['query_data']
🔨 Calling Tool: query_data
  With arguments: {'query': 'history of the internet, development of the internet, development of the web, internet in the 21st century'}
🔧 Tool Result (query_data):
  Arguments: {'query': 'history of the internet, development of the internet, development of the web, internet in the 21st century'}
  Output: Q: history of the internet, development of the internet, development of the web, internet in the 21st century
A: The Internet's history is marked by a sustained commitment to resea

  current_state = await ctx.get("state")
  await ctx.set("state", current_state)


📤 Output: Thought: The current language of the user is: English. I have gathered the necessary information and recorded it as notes. Now I need to hand off to the `WriteAgent` to write the report based on these notes.
Action: handoff
Action Input: {'to_agent': 'WriteAgent', 'reason': 'The user requested a report on the history of the internet, and I have gathered and recorded the necessary information. The WriteAgent is best suited to compile this information into a report.'}
🛠️  Planning to use tools: ['handoff']
🔨 Calling Tool: handoff
  With arguments: {'to_agent': 'WriteAgent', 'reason': 'The user requested a report on the history of the internet, and I have gathered and recorded the necessary information. The WriteAgent is best suited to compile this information into a report.'}
🔧 Tool Result (handoff):
  Arguments: {'to_agent': 'WriteAgent', 'reason': 'The user requested a report on the history of the internet, and I have gathered and recorded the necessary information. The Write

  current_state = await ctx.get("state")
  await ctx.set("state", current_state)


📤 Output: Thought: The current language of the user is: English. I have successfully written the report on the history of the internet using the `write_report` tool. The report covers the development of the internet, the web, and its evolution in the 21st century, as requested by the user. I can now confirm that the task is complete.
Answer: I have successfully written the report on the history of the internet. You can find the content of the report in the `report_content` state.


Now, we can retrieve the final report in the system for ourselves.

In [19]:
state = await handler.ctx.store.get("state")
report_content = state["report_content"]
print("\nFinal Report Content:")
print("=" * 50)
print(report_content)


Final Report Content:
# The History of the Internet

The Internet's journey is a remarkable story of sustained research and development, evolving from its nascent stages to the pervasive global information infrastructure it is today. This evolution has been driven by technological innovation, strategic operational management, a collaborative community, and successful commercialization.

## Early Development of the Internet

The genesis of the Internet can be traced back to early research in packet switching and the development of the ARPANET. This foundational work laid the groundwork for a new paradigm of communication. The initial vision for the Internet was to support a range of functions, including file sharing, remote login, resource sharing, and collaboration among researchers. Over time, it adapted from an era of time-sharing systems to accommodate personal computers, client-server architectures, and peer-to-peer computing. Crucially, it was designed before the advent of Local 

Finally we run the GEval-based completeness evaluation on the overall system flow. 

In [24]:
# Define a test case on the overall agentic flow, actual_output is the final report content
system_test_case = LLMTestCase(input=workflow_message, actual_output=report_content)

# Use G-Eval metric
evaluate([system_test_case], [metric_completeness])
print(metric_completeness.score, metric_completeness.reason)



Metrics Summary

  - ✅ Completeness [GEval] (score: 1.0, threshold: 0.5, strict: False, evaluation model: gemini-2.5-pro, reason: The response comprehensively addresses all three parts of the input, providing distinct sections for the development of the internet, the development of the web, and the internet in the 21st century. Each section contains sufficient and relevant detail, fulfilling the request for a brief report completely., error: None)

For test case:

  - input: Search for the history of internet and write me a report on it. Briefly describe the history of the internet, including the development of the internet, the development of the web, and the development of the internet in the 21st century.
  - actual output: # The History of the Internet

The Internet's journey is a remarkable story of sustained research and development, evolving from its nascent stages to the pervasive global information infrastructure it is today. This evolution has been driven by technological i

None None
