Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: form parsing placeholders #3034

Merged
merged 14 commits into from
May 16, 2024
Merged

Conversation

MillCheck
Copy link
Contributor

Allows introduction of form extraction in the future - sets up the FormKeysValues element & format, puts in an empty function call in the partition_pdf_or_image pipeline.

Copy link

sentry-io bot commented May 16, 2024

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: unstructured/documents/elements.py

Function Unhandled Issue
to_dict **ValueError: operands could not be broadcast together with shapes (768,) (0,) ** ...
Event Count: 8

Did you find this useful? React with a 👍 or 👎

@MillCheck
Copy link
Contributor Author

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: unstructured/documents/elements.py

Function Unhandled Issue
to_dict **ValueError: operands could not be broadcast together with shapes (768,) (0,) ** ...
Event Count: 8
Did you find this useful? React with a 👍 or 👎

I did not touch that and I refuse to use this PR to fix that.

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Couple small coments.

example-docs/fake_form_element/form.json Outdated Show resolved Hide resolved
unstructured/documents/elements.py Show resolved Hide resolved
unstructured/documents/form_utils.py Outdated Show resolved Hide resolved
unstructured/documents/form_utils.py Outdated Show resolved Hide resolved
unstructured/partition/pdf.py Show resolved Hide resolved
@MthwRobinson
Copy link
Contributor

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:
📄 File: unstructured/documents/elements.py
Function Unhandled Issue
to_dict **ValueError: operands could not be broadcast together with shapes (768,) (0,) ** ...
Event Count: 8
Did you find this useful? React with a 👍 or 👎

I did not touch that and I refuse to use this PR to fix that.

Concur, not sure why that alert appeared.

@MillCheck
Copy link
Contributor Author

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:
📄 File: unstructured/documents/elements.py
Function Unhandled Issue
to_dict **ValueError: operands could not be broadcast together with shapes (768,) (0,) ** ...
Event Count: 8
Did you find this useful? React with a 👍 or 👎

I did not touch that and I refuse to use this PR to fix that.

Concur, not sure why that alert appeared.

The error is rather sensible - many numpy/pandas/similar objects overwrite the == operator into a broadcasting one, one of the previous pull requests likely allowed such situation to happen. Now any change to elements.py will get this notification ;)

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending a minor changelog nit!

CHANGELOG.md Outdated Show resolved Hide resolved
@MillCheck MillCheck enabled auto-merge May 16, 2024 13:53
@MillCheck MillCheck added this pull request to the merge queue May 16, 2024
Merged via the queue into main with commit e6ada05 May 16, 2024
42 checks passed
@MillCheck MillCheck deleted the feat/form-parsing-placeholders branch May 16, 2024 14:53
github-merge-queue bot pushed a commit that referenced this pull request May 17, 2024
### Summary

Closes #3034 and reenables ARM64 in the docker build and publish job.
This was taken out in #3039 because we've only build `libreoffice` for
AMD64 and not ARM64. If Chainguard publishes an `apk` for `libreoffice`,
we can support a Chainguard image for both architectures. The smoke test
now differs for both architectures, to reflect differences in the
directory structure.

### Testing

Build and publish ran successfully for ARM64 (job
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907497))
and AMD64 (job
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907826)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants