-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BREAKING CHANGE: revert table extraction off by default for PDFs and images #3035
Conversation
…images <- Ingest test fixtures update (#3036) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matt, just the one comment but I think it's worth reading and removing Excel types from the list.
Approving in advance, you can make whatever changes you think appropriate based on my remarks before you merge :)
…nstructured into 3021/default-table-off
…images <- Ingest test fixtures update (#3046) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
headers: dict[str, str] = {}, | ||
skip_infer_table_types: list[str] = [], | ||
skip_infer_table_types: list[str] = ["pdf", "jpg", "png", "heic"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should really never do this. Initializing mutable default parameters, such as skip_infer_table_types
, with a mutable data type (e.g., list, dict, etc.) can lead to unexpected behavior. The default value is shared across all instances of the function call, meaning that any modification to this parameter within one function call will affect subsequent calls. This can introduce subtle bugs that are difficult to trace and debug. Instead, initialize the parameter with None
and set the default value inside the function if it is None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR aims to add backward compatibility for the deprecated `pdf_infer_table_structure` parameter. A missing part of turning table extraction for PDFs and Images off by default in #3035, which was turned on in #2588. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Summary
Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior.