Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs-website/docs/concepts/agents.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Key capabilities include:
- **Human-in-the-loop**: Intercept tool calls for human review before execution. See [Human in the Loop](../pipeline-components/agents-1/human-in-the-loop.mdx).
- **Multi-agent systems**: Wrap an `Agent` as a `ComponentTool` to build coordinator/specialist architectures. See [Multi-Agent Systems](./agents/multi-agent-systems.mdx).
- **MCP server exposure**: Expose your agent as an MCP server using [Hayhooks](../development/hayhooks.mdx), making it callable from any MCP-compatible client such as Claude Desktop or Cursor.
- **Multimodal inputs**: Pass images alongside text using `ImageContent` in `ChatMessage` content parts, or return `ImageContent` from tools for dynamic image analysis. Requires a vision-capable model such as `gpt-5` or `gemini-2.5-flash`. See [Multimodal Inputs](../pipeline-components/agents-1/agent.mdx#multimodal-inputs).

Check out the [Agent](../pipeline-components/agents-1/agent.mdx) documentation, or the [example](#tool-calling-agent) below to get started.

Expand Down
105 changes: 105 additions & 0 deletions docs-website/docs/pipeline-components/agents-1/agent.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,109 @@ See our [Streaming Support](../generators/guides-to-generators/choosing-the-righ
Give preference to `print_streaming_chunk` by default.
Write a custom callback only if you need a specific transport (for example, SSE/WebSocket) or custom UI formatting.

## Multimodal Inputs

Agents support multimodal inputs when paired with a vision-capable model such as `gpt-5` (OpenAI) or `gemini-2.5-flash` (Google).
Pass images alongside text by including `ImageContent` objects in the `content_parts` of a `ChatMessage`:

```python
from haystack.dataclasses import ChatMessage, ImageContent

image = ImageContent.from_url("https://example.com/chart.png")
result = agent.run(
messages=[
ChatMessage.from_user(content_parts=["What does this chart show?", image]),
],
)
```

Tools can also return `ImageContent` directly, letting the agent fetch and reason about images dynamically during its loop.
Two things are required: set `outputs_to_string={"raw_result": True}` so the `ToolInvoker` skips string conversion, and return a `list[ImageContent]` (the tool result type is `str | Sequence[TextContent | ImageContent]`).

The standard Chat Completions API doesn't support images in tool results — use `OpenAIResponsesChatGenerator` (OpenAI's Responses API) instead:

```python
from typing import Annotated
from haystack.components.agents import Agent
from haystack.components.generators.chat import OpenAIResponsesChatGenerator
from haystack.dataclasses import ChatMessage, ImageContent
from haystack.tools import tool


@tool(outputs_to_string={"raw_result": True})
def fetch_image(
url: Annotated[str, "URL of the image to fetch and analyze"],
) -> list[ImageContent]:
"""Fetch an image from a URL so the agent can analyze its contents."""
return [ImageContent.from_url(url)]


agent = Agent(
chat_generator=OpenAIResponsesChatGenerator(model="gpt-5"),
tools=[fetch_image],
system_prompt="You are a helpful assistant that can fetch and analyze images from URLs.",
)

result = agent.run(
messages=[
ChatMessage.from_user(
"Fetch the image at https://picsum.photos/seed/haystack/640/480 and describe what you see.",
),
],
)
print(result["last_message"].text)
```

`ImageContent` can be created from a URL, a local file path, or a PDF page using the `PDFToImageContent` converter.

### In a pipeline

When an `Agent` sits inside a pipeline, use `ChatPromptBuilder` with its string template format and the `| templatize_part` filter to pass images as structured content parts:

```python
from haystack import Pipeline
from haystack.components.agents import Agent
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ImageContent

template = """
{% message role="user" %}
{{ question }}
{{ image | templatize_part }}
{% endmessage %}
"""

agent = Agent(
chat_generator=OpenAIChatGenerator(model="gpt-5"),
system_prompt="You are a helpful assistant that can analyze images.",
)
prompt_builder = ChatPromptBuilder(
template=template,
required_variables=["question", "image"],
)

pipeline = Pipeline()
pipeline.add_component("prompt_builder", prompt_builder)
pipeline.add_component("agent", agent)
pipeline.connect("prompt_builder.prompt", "agent.messages")

# Download or provide your own chart image as "chart.png"
image = ImageContent.from_file_path("chart.png")
result = pipeline.run(
{
"prompt_builder": {"question": "What does this chart show?", "image": image},
},
)
print(result["agent"]["last_message"].text)
```

:::tip
See these cookbooks for complete multimodal agent examples:
- [Multimodal Agents](https://haystack.deepset.ai/cookbook/multimodal_intro#multimodal-agent) — image inputs and tool use with agents
- [Gemma Chat RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag) — vision model in a RAG pipeline
:::

## Multi-Agent Systems

You can wrap an `Agent` as a tool to build multi-agent systems where specialist agents handle focused subtasks and a coordinator agent plans and delegates.
Expand Down Expand Up @@ -363,3 +466,5 @@ Agents work with MCP in two directions:
🧑‍🍳 Cookbook:

- [Build a GitHub Issue Resolver Agent](https://haystack.deepset.ai/cookbook/github_issue_resolver_agent)
- [Multimodal Agents](https://haystack.deepset.ai/cookbook/multimodal_intro#multimodal-agent)
- [Gemma Chat RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag)
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Key capabilities include:
- **Human-in-the-loop**: Intercept tool calls for human review before execution. See [Human in the Loop](../pipeline-components/agents-1/human-in-the-loop.mdx).
- **Multi-agent systems**: Wrap an `Agent` as a `ComponentTool` to build coordinator/specialist architectures. See [Multi-Agent Systems](./agents/multi-agent-systems.mdx).
- **MCP server exposure**: Expose your agent as an MCP server using [Hayhooks](../development/hayhooks.mdx), making it callable from any MCP-compatible client such as Claude Desktop or Cursor.
- **Multimodal inputs**: Pass images alongside text using `ImageContent` in `ChatMessage` content parts, or return `ImageContent` from tools for dynamic image analysis. Requires a vision-capable model such as `gpt-5` or `gemini-2.5-flash`. See [Multimodal Inputs](../pipeline-components/agents-1/agent.mdx#multimodal-inputs).

Check out the [Agent](../pipeline-components/agents-1/agent.mdx) documentation, or the [example](#tool-calling-agent) below to get started.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,109 @@ See our [Streaming Support](../generators/guides-to-generators/choosing-the-righ
Give preference to `print_streaming_chunk` by default.
Write a custom callback only if you need a specific transport (for example, SSE/WebSocket) or custom UI formatting.

## Multimodal Inputs

Agents support multimodal inputs when paired with a vision-capable model such as `gpt-5` (OpenAI) or `gemini-2.5-flash` (Google).
Pass images alongside text by including `ImageContent` objects in the `content_parts` of a `ChatMessage`:

```python
from haystack.dataclasses import ChatMessage, ImageContent

image = ImageContent.from_url("https://example.com/chart.png")
result = agent.run(
messages=[
ChatMessage.from_user(content_parts=["What does this chart show?", image]),
],
)
```

Tools can also return `ImageContent` directly, letting the agent fetch and reason about images dynamically during its loop.
Two things are required: set `outputs_to_string={"raw_result": True}` so the `ToolInvoker` skips string conversion, and return a `list[ImageContent]` (the tool result type is `str | Sequence[TextContent | ImageContent]`).

The standard Chat Completions API doesn't support images in tool results — use `OpenAIResponsesChatGenerator` (OpenAI's Responses API) instead:

```python
from typing import Annotated
from haystack.components.agents import Agent
from haystack.components.generators.chat import OpenAIResponsesChatGenerator
from haystack.dataclasses import ChatMessage, ImageContent
from haystack.tools import tool


@tool(outputs_to_string={"raw_result": True})
def fetch_image(
url: Annotated[str, "URL of the image to fetch and analyze"],
) -> list[ImageContent]:
"""Fetch an image from a URL so the agent can analyze its contents."""
return [ImageContent.from_url(url)]


agent = Agent(
chat_generator=OpenAIResponsesChatGenerator(model="gpt-5"),
tools=[fetch_image],
system_prompt="You are a helpful assistant that can fetch and analyze images from URLs.",
)

result = agent.run(
messages=[
ChatMessage.from_user(
"Fetch the image at https://picsum.photos/seed/haystack/640/480 and describe what you see.",
),
],
)
print(result["last_message"].text)
```

`ImageContent` can be created from a URL, a local file path, or a PDF page using the `PDFToImageContent` converter.

### In a pipeline

When an `Agent` sits inside a pipeline, use `ChatPromptBuilder` with its string template format and the `| templatize_part` filter to pass images as structured content parts:

```python
from haystack import Pipeline
from haystack.components.agents import Agent
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ImageContent

template = """
{% message role="user" %}
{{ question }}
{{ image | templatize_part }}
{% endmessage %}
"""

agent = Agent(
chat_generator=OpenAIChatGenerator(model="gpt-5"),
system_prompt="You are a helpful assistant that can analyze images.",
)
prompt_builder = ChatPromptBuilder(
template=template,
required_variables=["question", "image"],
)

pipeline = Pipeline()
pipeline.add_component("prompt_builder", prompt_builder)
pipeline.add_component("agent", agent)
pipeline.connect("prompt_builder.prompt", "agent.messages")

# Download or provide your own chart image as "chart.png"
image = ImageContent.from_file_path("chart.png")
result = pipeline.run(
{
"prompt_builder": {"question": "What does this chart show?", "image": image},
},
)
print(result["agent"]["last_message"].text)
```

:::tip
See these cookbooks for complete multimodal agent examples:
- [Multimodal Agents](https://haystack.deepset.ai/cookbook/multimodal_intro#multimodal-agent) — image inputs and tool use with agents
- [Gemma Chat RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag) — vision model in a RAG pipeline
:::

## Multi-Agent Systems

You can wrap an `Agent` as a tool to build multi-agent systems where specialist agents handle focused subtasks and a coordinator agent plans and delegates.
Expand Down Expand Up @@ -363,3 +466,5 @@ Agents work with MCP in two directions:
🧑‍🍳 Cookbook:

- [Build a GitHub Issue Resolver Agent](https://haystack.deepset.ai/cookbook/github_issue_resolver_agent)
- [Multimodal Agents](https://haystack.deepset.ai/cookbook/multimodal_intro#multimodal-agent)
- [Gemma Chat RAG](https://haystack.deepset.ai/cookbook/gemma_chat_rag)
Loading