[Feature Request] Native support for multi-modal RAG (text + images)

## Feature Description

With the rise of vision-language models (GPT-4V, LLaVA, CogVLM), it would be valuable to have native multi-modal document support in Haystack pipelines.

## Current Limitation

Currently, image content in PDFs/documents is lost during ingestion. Users need custom extractors to handle images alongside text.

## Proposed Enhancement

1. Multi-modal document parser that extracts text AND images
2. Multi-modal embeddings (CLIP-style) for image chunks
3. Multi-modal retriever that searches across text and image content
4. VLM integration for answer generation from mixed context

## Use Case

Technical documentation with diagrams, medical records with scans, financial reports with charts - all common enterprise use cases where image understanding is critical for accurate retrieval.

Would love to hear the team's thoughts on this direction!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Native support for multi-modal RAG (text + images) #11037

Feature Description

Current Limitation

Proposed Enhancement

Use Case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Native support for multi-modal RAG (text + images) #11037

Description

Feature Description

Current Limitation

Proposed Enhancement

Use Case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions