Skip to content

feat: add public layout-base extraction support on PDFToTextConverter#3137

Merged
masci merged 7 commits intodeepset-ai:mainfrom
danielbichuetti:add_layout_parameter_to_pdf_converter
Sep 13, 2022
Merged

feat: add public layout-base extraction support on PDFToTextConverter#3137
masci merged 7 commits intodeepset-ai:mainfrom
danielbichuetti:add_layout_parameter_to_pdf_converter

Conversation

@danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Sep 1, 2022

Related Issues

Proposed Changes:

Today, PDFToTextConverter has a private method which support turning on layout-based text extraction. The default option is stream ordered text extraction. Usually, this isn't an issue. However, some PDFs have a quite unfamiliar stream ordering, which is different from the physical layout.
This change implements a public parameter that makes possible to the user to choose between default stream-based extraction, or the layout-based extraction.

How did you test it?

Already existing PDFToTextConverter tests have been run again.
One specific scenario PDF file where the stream content is not the same as physical layout order has been added, and the test using the new parameter also has been included.

Notes for the reviewer

I have removed ancient comments, one doubled super().init (it happens at the start, and then it was being again called at the end, with useless effect)

Checklist

@danielbichuetti danielbichuetti requested review from a team as code owners September 1, 2022 16:18
@danielbichuetti danielbichuetti requested review from masci and removed request for a team September 1, 2022 16:18
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the additional cleaning :)

@danielbichuetti
Copy link
Contributor Author

The new documentation has been generated. CI is now green again.

@danielbichuetti
Copy link
Contributor Author

@masci Sorry to bother you, but I think this PR got lost in the Forgotten Lands. 😃

@masci
Copy link
Contributor

masci commented Sep 13, 2022

@danielbichuetti 🙈 apologies, merging now!

@masci masci merged commit df1f420 into deepset-ai:main Sep 13, 2022
@danielbichuetti danielbichuetti deleted the add_layout_parameter_to_pdf_converter branch September 13, 2022 15:00
brandenchan pushed a commit that referenced this pull request Sep 21, 2022
…#3137)

* feat(PDFToTextConverter): add option to get text in physical layout order

* test: add physical layout extraction test to PDFToTextConverter

* refactor: change layout parameter attribution places

* docs: manually trigger pre-commits

* docs: generate new docs to comply with pydoc-markdown style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDFToTextConverter is not forwarding layout parameter to private method

3 participants