Feature Description
With the rise of vision-language models (GPT-4V, LLaVA, CogVLM), it would be valuable to have native multi-modal document support in Haystack pipelines.
Current Limitation
Currently, image content in PDFs/documents is lost during ingestion. Users need custom extractors to handle images alongside text.
Proposed Enhancement
- Multi-modal document parser that extracts text AND images
- Multi-modal embeddings (CLIP-style) for image chunks
- Multi-modal retriever that searches across text and image content
- VLM integration for answer generation from mixed context
Use Case
Technical documentation with diagrams, medical records with scans, financial reports with charts - all common enterprise use cases where image understanding is critical for accurate retrieval.
Would love to hear the team's thoughts on this direction!
Feature Description
With the rise of vision-language models (GPT-4V, LLaVA, CogVLM), it would be valuable to have native multi-modal document support in Haystack pipelines.
Current Limitation
Currently, image content in PDFs/documents is lost during ingestion. Users need custom extractors to handle images alongside text.
Proposed Enhancement
Use Case
Technical documentation with diagrams, medical records with scans, financial reports with charts - all common enterprise use cases where image understanding is critical for accurate retrieval.
Would love to hear the team's thoughts on this direction!