Proposed by a user building serverless agents.
The current memory/context providers are mostly string-based, which is a good starting point. For multimodal agents, it would be useful to also remember files, images, audio, PDFs, generated artifacts, and their derived text.
Real-world agents work with more than text. An agent might:
- Generate a chart and need to reference it later
- Receive a PDF and extract key facts from it
- Record audio notes and recall them in a future session
- Produce code artifacts that should persist as part of the agent's memory
Today these would need to be stored and retrieved out-of-band, with only a text summary saved to memory.
It might make sense to store text and metadata in Durable Object SQLite, while larger binary assets live in R2. This would keep the current simple text-first API working unchanged, while allowing agents that need multimodal recall to opt in.
(Replaces #1388, #1390)
Proposed by a user building serverless agents.
The current memory/context providers are mostly string-based, which is a good starting point. For multimodal agents, it would be useful to also remember files, images, audio, PDFs, generated artifacts, and their derived text.
Real-world agents work with more than text. An agent might:
Today these would need to be stored and retrieved out-of-band, with only a text summary saved to memory.
It might make sense to store text and metadata in Durable Object SQLite, while larger binary assets live in R2. This would keep the current simple text-first API working unchanged, while allowing agents that need multimodal recall to opt in.
(Replaces #1388, #1390)