Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BREAKING CHANGE: revert table extraction off by default for PDFs and images #3035

Merged
merged 9 commits into from
May 17, 2024

Conversation

MthwRobinson
Copy link
Contributor

@MthwRobinson MthwRobinson commented May 16, 2024

Summary

Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior.

MthwRobinson and others added 3 commits May 16, 2024 11:57
…images <- Ingest test fixtures update (#3036)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt, just the one comment but I think it's worth reading and removing Excel types from the list.

Approving in advance, you can make whatever changes you think appropriate based on my remarks before you merge :)

unstructured/partition/auto.py Outdated Show resolved Hide resolved
…images <- Ingest test fixtures update (#3046)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
@MthwRobinson MthwRobinson added this pull request to the merge queue May 17, 2024
Merged via the queue into main with commit ec987dc May 17, 2024
42 checks passed
@MthwRobinson MthwRobinson deleted the 3021/default-table-off branch May 17, 2024 16:04
Comment on lines 143 to +144
headers: dict[str, str] = {},
skip_infer_table_types: list[str] = [],
skip_infer_table_types: list[str] = ["pdf", "jpg", "png", "heic"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should really never do this. Initializing mutable default parameters, such as skip_infer_table_types, with a mutable data type (e.g., list, dict, etc.) can lead to unexpected behavior. The default value is shared across all instances of the function call, meaning that any modification to this parameter within one function call will affect subsequent calls. This can introduce subtle bugs that are difficult to trace and debug. Instead, initialize the parameter with None and set the default value inside the function if it is None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @MalteHB ! Opened #3063 to address.

github-merge-queue bot pushed a commit that referenced this pull request May 23, 2024
This PR aims to add backward compatibility for the deprecated
`pdf_infer_table_structure` parameter. A missing part of turning table
extraction for PDFs and Images off by default in
#3035, which was
turned on in #2588.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Turn table extraction off by default for PDFs and images
5 participants