Add Flexible Conversion Parameters to PDF Converters #6896

warichet · 2024-02-02T14:58:56Z

Feature Request

Currently, our library's PDF converters only support static, predefined conversion parameters. This limitation makes it difficult to adapt to varied use cases where dynamic parameters (like specific page numbers to convert) are necessary. I propose we add a feature that allows users to pass dynamic conversion parameters as a dictionary to our PDF converters.

Proposed Solution

Introduce a system for flexible conversion parameters for PDF converters. This can be achieved by accepting a conversion_params dictionary in the PyPDFToDocument constructor and passing these parameters to the converter via **kwargs.

Benefits

Increased Flexibility: Allows users to specify dynamic conversion parameters, such as start_page, end_page, and other converter-specific options.
Ease of Extension: Adding new conversion parameters becomes trivial, without needing modifications in method or class signatures.
Compatibility: Maintains compatibility with existing converters that do not require additional parameters.

Example Usage

conversion_params = {"start_page": 1, "end_page": 10}
converter = PyPDFToDocument(converter_name="custom", conversion_params=conversion_params)

`class CustomConverter:
    """
    Le convertisseur personnalisé qui extrait le texte des pages d'un objet PdfReader et retourne un objet Document,
    en tenant compte des paramètres supplémentaires tels que start_page et end_page.
    """
    def convert(self, reader: "PdfReader", **kwargs) -> Document:
        # Extraire les paramètres start_page et end_page de kwargs, avec des valeurs par défaut
        start_page = kwargs.get('start_page', 0)
        end_page = kwargs.get('end_page', len(reader.pages) - 1)

        text_with_pages = ""
        page_starts = []
        current_length = 0

        # Si end_page est défini comme -1, traiter jusqu'à la fin du document
        if end_page == -1 or end_page >= len(reader.pages):
            end_page = len(reader.pages) - 1

        for page_num, page in enumerate(reader.pages[start_page:end_page + 1], start=start_page):
            page_text = page.extract_text()
            if page_text:
                # Ajouter un marqueur de début de page si ce n'est pas la première page de texte
                if current_length > 0:
                    page_starts.append(current_length)
                text_with_pages += f"{page_text}\n"
                current_length = len(text_with_pages)

        # Ajouter le titre et le sujet depuis les métadonnées du PDF, si disponibles
        title = self.add_title(reader)
        subject = self.add_subject(reader)

        return Document(content=text_with_pages, meta={"page_starts": page_starts, 'title': title, 'subject': subject})

    def add_title(self, reader: "PdfReader"):
        # Extraction du titre depuis les métadonnées du PDF
        metadata = reader.metadata
        title = metadata.get('/Title', '')
        return title

    def add_subject(self, reader: "PdfReader"):
        # Extraction du sujet depuis les métadonnées du PDF
        metadata = reader.metadata
        subject = metadata.get('/Subject', '')
        return subject`

`# This registry is used to store converters names and instances.
# It can be used to register custom converters.
CONVERTERS_REGISTRY: Dict[str, PyPDFConverter] = {"default": DefaultConverter(), "custom": CustomConverter()}


@component
class PyPDFToDocument:
    """
    Converts PDF files to Document objects.
    It uses a converter that follows the PyPDFConverter protocol to perform the conversion.
    A default text extraction converter is used if no custom converter is provided.

    Usage example:
    ```python
    from haystack.components.converters.pypdf import PyPDFToDocument

    converter = PyPDFToDocument()
    results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
    documents = results["documents"]
    print(documents[0].content)
    # 'This is a text from the PDF file.'
    ```
    """

    def __init__(self, converter_name: str = "default", conversion_params: dict = None):
        """
        Initializes the PyPDFToDocument component with an optional custom converter.
        :param converter_name: A converter name that is registered in the CONVERTERS_REGISTRY.
            Defaults to 'default'.
        """
        pypdf_import.check()

        try:
            converter = CONVERTERS_REGISTRY[converter_name]
        except KeyError:
            msg = (
                f"Invalid converter_name: {converter_name}.\n Available converters: {list(CONVERTERS_REGISTRY.keys())}"
            )
            raise ValueError(msg) from KeyError
        self.converter_name = converter_name
        self._converter: PyPDFConverter = converter
        self.conversion_params = conversion_params or {}

    def to_dict(self):
        # do not serialize the _converter instance
        return default_to_dict(self, converter_name=self.converter_name)

    @component.output_types(documents=List[Document])
    def run(self, sources: List[Union[str, Path, ByteStream]], meta: Optional[List[Dict[str, Any]]] = None):
        """
        Converts a list of PDF sources into Document objects using the configured converter.

        :param sources: A list of PDF data sources, which can be file paths or ByteStream objects.
        :param meta: Optional metadata to attach to the Documents.
          This value can be either a list of dictionaries or a single dictionary.
          If it's a single dictionary, its content is added to the metadata of all produced Documents.
          If it's a list, the length of the list must match the number of sources, because the two lists will be zipped.
          Defaults to `None`.
        :return: A dictionary containing a list of Document objects under the 'documents' key.
        """
        documents = []
        meta_list = normalize_metadata(meta, sources_count=len(sources))

        for source, metadata in zip(sources, meta_list):
            try:
                bytestream = get_bytestream_from_source(source)
            except Exception as e:
                logger.warning("Could not read %s. Skipping it. Error: %s", source, e)
                continue
            try:
                pdf_reader = PdfReader(io.BytesIO(bytestream.data))
                document = self._converter.convert(pdf_reader, **self.conversion_params)

            except Exception as e:
                logger.warning("Could not read %s and convert it to Document, skipping. %s", source, e)
                continue

            merged_metadata = {**bytestream.meta, **metadata, **document.meta} # War
            document.meta = merged_metadata
            documents.append(document)
        return {"documents": documents}
`

The text was updated successfully, but these errors were encountered:

masci · 2024-02-05T08:12:17Z

Thanks @warichet for the detailed feature request!

I think we should definitely implement this one, but as we're finalising 2.0.0 I'm not sure we can prioritise this feature at this very moment. I'm adding the "contributions wanted" label, in case someone wants to give it a try without waiting for us.

warichet · 2024-02-05T09:15:14Z

Thanks
If you need help don't hesitate

mohitlal31 · 2024-02-27T12:21:32Z

@masci I'd like to work on this. I can create a draft PR soon for early feedback.

anakin87 · 2024-05-08T11:09:23Z

After #7361 and #7362,
defining a custom converter that allows specifying start page and end page
can be done as follows

from typing import Optional

from pypdf import PdfReader
from haystack import Document, default_from_dict, default_to_dict
from haystack.components.converters.pypdf import PyPDFToDocument

class ConverterWithPages:
    def __init__(self, start_page: Optional[int] = None, end_page: Optional[int] = None):
        self.start_page = start_page or 0
        self.end_page = end_page

        self._upper_bound = end_page+1 if end_page is not None else -1

    def convert(self, reader: "PdfReader") -> Document:
        text_pages = []
        for page in reader.pages[self.start_page : self._upper_bound]:
            text_pages.append(page.extract_text())
        text = "\f".join(text_pages)
        return Document(content=text)

    def to_dict(self):
        """Serialize the converter to a dictionary."""
        return default_to_dict(self, start_page=self.start_page, end_page=self.end_page)

    @classmethod
    def from_dict(cls, data):
        """Deserialize the converter from a dictionary."""
        return default_from_dict(cls, data)


pypdf_converter = PyPDFToDocument(converter=ConverterWithPages(start_page=0, end_page=2))
res = pypdf_converter.run(sources=["/home/anakin87/apps/haystack/test/test_files/pdf/react_paper.pdf"])

print(res)
print(res["documents"][0].content)

The solution is not entirely straightforward, but not too complex either.
I am closing this issue for now. If more requests in this direction come in the future, we can improve or modify the component.

masci added Contributions wanted! Looking for external contributions 2.x Related to Haystack v2.0 type:feature New feature or request labels Feb 5, 2024

This was referenced Mar 2, 2024

feat: Add Flexible Conversion Parameters to PDF Converters #7290

Closed

To be able to pass custom arguments to a custom File Converter method #7291

Closed

anakin87 mentioned this issue Mar 14, 2024

Improve PyPDFToDocument #7361

Closed

anakin87 closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Flexible Conversion Parameters to PDF Converters #6896

Add Flexible Conversion Parameters to PDF Converters #6896

warichet commented Feb 2, 2024

masci commented Feb 5, 2024

warichet commented Feb 5, 2024

mohitlal31 commented Feb 27, 2024

anakin87 commented May 8, 2024

Add Flexible Conversion Parameters to PDF Converters #6896

Add Flexible Conversion Parameters to PDF Converters #6896

Comments

warichet commented Feb 2, 2024

Feature Request

Proposed Solution

Benefits

Example Usage

masci commented Feb 5, 2024

warichet commented Feb 5, 2024

mohitlal31 commented Feb 27, 2024

anakin87 commented May 8, 2024