### Agent-Lab: Vision Document Agent

Objective of this notebook is evaluating and adapting the implementation of [Multi-modal Agent](https://python.langchain.com/docs/integrations/llms/ollama/#multi-modal) specialized on documents.

#### Preparation steps:

Before executing the notebook perform the following preparation steps.

1. Start Docker containers: `docker compose up -d --build`

2. Verify application is up and running, with web browser: `http://localhost:18000/docs`

3. Inside project root directory, create a `.env` file with env vars with correct values:

    ```
    OLLAMA_ENDPOINT="http://localhost:11434"
    OLLAMA_MODEL_TAG="llama3.2-vision:latest"
    ```

---

In [1]:
%%capture

import os
os.chdir('..')

from dotenv import load_dotenv
load_dotenv()

from notebooks import experiment_utils
from app.core.container import Container
from app.interface.api.messages.schema import MessageRequest

# start dependency injection container
container = Container()
container.init_resources()
container.wire(modules=[__name__])

In [2]:
# create agent
agent_id = experiment_utils.create_ollama_agent(
    agent_type="react_rag",
    llm_tag="gemma3:12b",
    ollama_endpoint=os.getenv("OLLAMA_ENDPOINT")
)

# create attachment
attachment_id = experiment_utils.create_attachment(
    file_path="tests/integration/vision_document_01.png",
    content_type="image/png"
)

In [4]:
# get agent instance
react_rag_agent = container.react_rag_agent()

# Create Graph
workflow = react_rag_agent.get_workflow_builder(agent_id)
experiment_utils.print_graph(workflow)

ReadTimeout: HTTPSConnectionPool(host='mermaid.ink', port=443): Read timed out. (read timeout=10)

In [None]:
agent_config = {
    "configurable": {
        "thread_id": agent_id,
    },
    "recursion_limit": 30
}

In [None]:
%%capture

message = MessageRequest(
    message_role="human",
    message_content="Describe this image, generate data for study material. You can use up to three paragraphs to describe.",
    agent_id=agent_id,
    attachment_id=first_attachment_id
)

inputs = vision_document_agent.get_input_params(message)
result = workflow.invoke(inputs, agent_config)

In [None]:
print(result.keys())

In [None]:
print(result['generation'])