Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add PyPDFToDocument component (2.0) #5850

Merged
merged 6 commits into from
Sep 21, 2023
Merged

feat: Add PyPDFToDocument component (2.0) #5850

merged 6 commits into from
Sep 21, 2023

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Sep 20, 2023

Related issues

#5670

Why:

This PR introduces the PyPDFToDocument component to Haystack 2.0. This component is designed to convert PDF files into a list of Document objects, which can then be seamlessly integrated into the Haystack 2.0 pipeline.

What:

A new PyPDFToDocument component has been added to the file_converters package. This package serves as a collection of various file conversion utilities within Haystack.

How can it be used:

Here's a code snippet demonstrating how to use the new PyPDFToDocument component:

from haystack.preview.components.file_converters.pypdf import PyPDFToDocument

paths = [preview_samples_path / "pdf" / "react_paper.pdf"]
converter = PyPDFToDocument()
output = converter.run(paths=paths)
docs = output["documents"]
assert len(docs) == 1
assert "ReAct" in docs[0].text

How did you test it:

Unit tests have been added to cover the new component. Additionally, manual tests were conducted, as described in the "How can it be used" section above.

Notes For Reviewer:

Please ensure that the new component and its unit tests are correctly implemented. Double-check to make sure everything aligns with the project's standards and that there are no unintended side effects.

@vblagoje vblagoje added the 2.x Related to Haystack v2.0 label Sep 20, 2023
@vblagoje vblagoje requested a review from a team as a code owner September 20, 2023 13:59
@vblagoje vblagoje requested review from julian-risch and ZanSara and removed request for a team September 20, 2023 13:59
@vblagoje vblagoje removed the request for review from julian-risch September 20, 2023 13:59
@github-actions github-actions bot added the type:documentation Improvements on the docs label Sep 20, 2023
@vblagoje vblagoje requested a review from a team as a code owner September 20, 2023 14:37
@vblagoje vblagoje requested review from dfokina and removed request for a team September 20, 2023 14:37
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! 🚀

@ZanSara ZanSara merged commit 92a6221 into main Sep 21, 2023
63 of 73 checks passed
@ZanSara ZanSara deleted the pypdf branch September 21, 2023 09:52
@ZanSara ZanSara mentioned this pull request Sep 28, 2023
@vblagoje
Copy link
Member Author

vblagoje commented Nov 8, 2023

@Timoeller @ZanSara Inspired by the #4467 thread, perhaps we can adjust the default PyPDFToDocument impl to https://github.com/deepset-ai/haystack/tree/improved_pdf_to_document which would allow complete pdf to Document conversion customization without reinventing the wheel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants