-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Flexible Conversion Parameters to PDF Converters #6896
Comments
Thanks @warichet for the detailed feature request! I think we should definitely implement this one, but as we're finalising 2.0.0 I'm not sure we can prioritise this feature at this very moment. I'm adding the "contributions wanted" label, in case someone wants to give it a try without waiting for us. |
Thanks |
@masci I'd like to work on this. I can create a draft PR soon for early feedback. |
After #7361 and #7362, from typing import Optional
from pypdf import PdfReader
from haystack import Document, default_from_dict, default_to_dict
from haystack.components.converters.pypdf import PyPDFToDocument
class ConverterWithPages:
def __init__(self, start_page: Optional[int] = None, end_page: Optional[int] = None):
self.start_page = start_page or 0
self.end_page = end_page
self._upper_bound = end_page+1 if end_page is not None else -1
def convert(self, reader: "PdfReader") -> Document:
text_pages = []
for page in reader.pages[self.start_page : self._upper_bound]:
text_pages.append(page.extract_text())
text = "\f".join(text_pages)
return Document(content=text)
def to_dict(self):
"""Serialize the converter to a dictionary."""
return default_to_dict(self, start_page=self.start_page, end_page=self.end_page)
@classmethod
def from_dict(cls, data):
"""Deserialize the converter from a dictionary."""
return default_from_dict(cls, data)
pypdf_converter = PyPDFToDocument(converter=ConverterWithPages(start_page=0, end_page=2))
res = pypdf_converter.run(sources=["/home/anakin87/apps/haystack/test/test_files/pdf/react_paper.pdf"])
print(res)
print(res["documents"][0].content) The solution is not entirely straightforward, but not too complex either. |
Feature Request
Currently, our library's PDF converters only support static, predefined conversion parameters. This limitation makes it difficult to adapt to varied use cases where dynamic parameters (like specific page numbers to convert) are necessary. I propose we add a feature that allows users to pass dynamic conversion parameters as a dictionary to our PDF converters.
Proposed Solution
Introduce a system for flexible conversion parameters for PDF converters. This can be achieved by accepting a
conversion_params
dictionary in thePyPDFToDocument
constructor and passing these parameters to the converter via**kwargs
.Benefits
start_page
,end_page
, and other converter-specific options.Example Usage
The text was updated successfully, but these errors were encountered: