Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Flexible Conversion Parameters to PDF Converters #6896

Closed
warichet opened this issue Feb 2, 2024 · 4 comments
Closed

Add Flexible Conversion Parameters to PDF Converters #6896

warichet opened this issue Feb 2, 2024 · 4 comments
Labels
2.x Related to Haystack v2.0 Contributions wanted! Looking for external contributions type:feature New feature or request

Comments

@warichet
Copy link

warichet commented Feb 2, 2024

Feature Request

Currently, our library's PDF converters only support static, predefined conversion parameters. This limitation makes it difficult to adapt to varied use cases where dynamic parameters (like specific page numbers to convert) are necessary. I propose we add a feature that allows users to pass dynamic conversion parameters as a dictionary to our PDF converters.

Proposed Solution

Introduce a system for flexible conversion parameters for PDF converters. This can be achieved by accepting a conversion_params dictionary in the PyPDFToDocument constructor and passing these parameters to the converter via **kwargs.

Benefits

  • Increased Flexibility: Allows users to specify dynamic conversion parameters, such as start_page, end_page, and other converter-specific options.
  • Ease of Extension: Adding new conversion parameters becomes trivial, without needing modifications in method or class signatures.
  • Compatibility: Maintains compatibility with existing converters that do not require additional parameters.

Example Usage

conversion_params = {"start_page": 1, "end_page": 10}
converter = PyPDFToDocument(converter_name="custom", conversion_params=conversion_params)

`class CustomConverter:
    """
    Le convertisseur personnalisé qui extrait le texte des pages d'un objet PdfReader et retourne un objet Document,
    en tenant compte des paramètres supplémentaires tels que start_page et end_page.
    """
    def convert(self, reader: "PdfReader", **kwargs) -> Document:
        # Extraire les paramètres start_page et end_page de kwargs, avec des valeurs par défaut
        start_page = kwargs.get('start_page', 0)
        end_page = kwargs.get('end_page', len(reader.pages) - 1)

        text_with_pages = ""
        page_starts = []
        current_length = 0

        # Si end_page est défini comme -1, traiter jusqu'à la fin du document
        if end_page == -1 or end_page >= len(reader.pages):
            end_page = len(reader.pages) - 1

        for page_num, page in enumerate(reader.pages[start_page:end_page + 1], start=start_page):
            page_text = page.extract_text()
            if page_text:
                # Ajouter un marqueur de début de page si ce n'est pas la première page de texte
                if current_length > 0:
                    page_starts.append(current_length)
                text_with_pages += f"{page_text}\n"
                current_length = len(text_with_pages)

        # Ajouter le titre et le sujet depuis les métadonnées du PDF, si disponibles
        title = self.add_title(reader)
        subject = self.add_subject(reader)

        return Document(content=text_with_pages, meta={"page_starts": page_starts, 'title': title, 'subject': subject})

    def add_title(self, reader: "PdfReader"):
        # Extraction du titre depuis les métadonnées du PDF
        metadata = reader.metadata
        title = metadata.get('/Title', '')
        return title

    def add_subject(self, reader: "PdfReader"):
        # Extraction du sujet depuis les métadonnées du PDF
        metadata = reader.metadata
        subject = metadata.get('/Subject', '')
        return subject`

`# This registry is used to store converters names and instances.
# It can be used to register custom converters.
CONVERTERS_REGISTRY: Dict[str, PyPDFConverter] = {"default": DefaultConverter(), "custom": CustomConverter()}


@component
class PyPDFToDocument:
    """
    Converts PDF files to Document objects.
    It uses a converter that follows the PyPDFConverter protocol to perform the conversion.
    A default text extraction converter is used if no custom converter is provided.

    Usage example:
    ```python
    from haystack.components.converters.pypdf import PyPDFToDocument

    converter = PyPDFToDocument()
    results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
    documents = results["documents"]
    print(documents[0].content)
    # 'This is a text from the PDF file.'
    ```
    """

    def __init__(self, converter_name: str = "default", conversion_params: dict = None):
        """
        Initializes the PyPDFToDocument component with an optional custom converter.
        :param converter_name: A converter name that is registered in the CONVERTERS_REGISTRY.
            Defaults to 'default'.
        """
        pypdf_import.check()

        try:
            converter = CONVERTERS_REGISTRY[converter_name]
        except KeyError:
            msg = (
                f"Invalid converter_name: {converter_name}.\n Available converters: {list(CONVERTERS_REGISTRY.keys())}"
            )
            raise ValueError(msg) from KeyError
        self.converter_name = converter_name
        self._converter: PyPDFConverter = converter
        self.conversion_params = conversion_params or {}

    def to_dict(self):
        # do not serialize the _converter instance
        return default_to_dict(self, converter_name=self.converter_name)

    @component.output_types(documents=List[Document])
    def run(self, sources: List[Union[str, Path, ByteStream]], meta: Optional[List[Dict[str, Any]]] = None):
        """
        Converts a list of PDF sources into Document objects using the configured converter.

        :param sources: A list of PDF data sources, which can be file paths or ByteStream objects.
        :param meta: Optional metadata to attach to the Documents.
          This value can be either a list of dictionaries or a single dictionary.
          If it's a single dictionary, its content is added to the metadata of all produced Documents.
          If it's a list, the length of the list must match the number of sources, because the two lists will be zipped.
          Defaults to `None`.
        :return: A dictionary containing a list of Document objects under the 'documents' key.
        """
        documents = []
        meta_list = normalize_metadata(meta, sources_count=len(sources))

        for source, metadata in zip(sources, meta_list):
            try:
                bytestream = get_bytestream_from_source(source)
            except Exception as e:
                logger.warning("Could not read %s. Skipping it. Error: %s", source, e)
                continue
            try:
                pdf_reader = PdfReader(io.BytesIO(bytestream.data))
                document = self._converter.convert(pdf_reader, **self.conversion_params)

            except Exception as e:
                logger.warning("Could not read %s and convert it to Document, skipping. %s", source, e)
                continue

            merged_metadata = {**bytestream.meta, **metadata, **document.meta} # War
            document.meta = merged_metadata
            documents.append(document)
        return {"documents": documents}
`
@masci masci added Contributions wanted! Looking for external contributions 2.x Related to Haystack v2.0 type:feature New feature or request labels Feb 5, 2024
@masci
Copy link
Contributor

masci commented Feb 5, 2024

Thanks @warichet for the detailed feature request!

I think we should definitely implement this one, but as we're finalising 2.0.0 I'm not sure we can prioritise this feature at this very moment. I'm adding the "contributions wanted" label, in case someone wants to give it a try without waiting for us.

@warichet
Copy link
Author

warichet commented Feb 5, 2024

Thanks
If you need help don't hesitate

@mohitlal31
Copy link
Contributor

@masci I'd like to work on this. I can create a draft PR soon for early feedback.

@anakin87
Copy link
Member

anakin87 commented May 8, 2024

After #7361 and #7362,
defining a custom converter that allows specifying start page and end page
can be done as follows

from typing import Optional

from pypdf import PdfReader
from haystack import Document, default_from_dict, default_to_dict
from haystack.components.converters.pypdf import PyPDFToDocument

class ConverterWithPages:
    def __init__(self, start_page: Optional[int] = None, end_page: Optional[int] = None):
        self.start_page = start_page or 0
        self.end_page = end_page

        self._upper_bound = end_page+1 if end_page is not None else -1

    def convert(self, reader: "PdfReader") -> Document:
        text_pages = []
        for page in reader.pages[self.start_page : self._upper_bound]:
            text_pages.append(page.extract_text())
        text = "\f".join(text_pages)
        return Document(content=text)

    def to_dict(self):
        """Serialize the converter to a dictionary."""
        return default_to_dict(self, start_page=self.start_page, end_page=self.end_page)

    @classmethod
    def from_dict(cls, data):
        """Deserialize the converter from a dictionary."""
        return default_from_dict(cls, data)


pypdf_converter = PyPDFToDocument(converter=ConverterWithPages(start_page=0, end_page=2))
res = pypdf_converter.run(sources=["/home/anakin87/apps/haystack/test/test_files/pdf/react_paper.pdf"])

print(res)
print(res["documents"][0].content)

The solution is not entirely straightforward, but not too complex either.
I am closing this issue for now. If more requests in this direction come in the future, we can improve or modify the component.

@anakin87 anakin87 closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 Contributions wanted! Looking for external contributions type:feature New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

4 participants