Skip to content

Memory: support multimodal content (files, images, audio, artifacts) #1392

@mattzcarey

Description

@mattzcarey

Proposed by a user building serverless agents.

The current memory/context providers are mostly string-based, which is a good starting point. For multimodal agents, it would be useful to also remember files, images, audio, PDFs, generated artifacts, and their derived text.

Real-world agents work with more than text. An agent might:

  • Generate a chart and need to reference it later
  • Receive a PDF and extract key facts from it
  • Record audio notes and recall them in a future session
  • Produce code artifacts that should persist as part of the agent's memory

Today these would need to be stored and retrieved out-of-band, with only a text summary saved to memory.

It might make sense to store text and metadata in Durable Object SQLite, while larger binary assets live in R2. This would keep the current simple text-first API working unchanged, while allowing agents that need multimodal recall to opt in.

(Replaces #1388, #1390)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requeston the roadmapFeature accepted and planned for implementation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions