Skip to content

Conversation

@cragwolfe
Copy link
Contributor

@cragwolfe cragwolfe commented Nov 20, 2023

Executive Summary

Eyeballing or saving html in a Table element (in the metadata.text_as_html field) takes some manual effort. This script provides a quick way to do so given an unstructured .json file that adheres to the usual schema (i.e., that's returned by the Unstructured API).

Testing Instructions

Get some unstructured output that includes a table. E.g.

124_PDFsam_Basel III - Finalising post-crisis reforms.pdf

./unstructured-get-json.sh --tables --hi-res \
  124_PDFsam_Basel\ III\ -\ Finalising\ post-crisis\ reforms.pdf

Then use this the following script to view the structure and content of the tables: (note that output file was copied to the clipboard from prior command):

./u-tables-inspect.sh \
"<snip>/tmp/unst-outputs/124_PDFsam_Basel III - Finalising post-crisis reforms.pdf-hi-res.json"

@cragwolfe
Copy link
Contributor Author

*CI cancelled after the shellcheck check passed -- no point in burning CI minutes for an irrelevant change.

Copy link
Contributor

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This looks like a very useful script for table visualization. 🎉

Screenshot_1
Screenshot_2

@cragwolfe cragwolfe merged commit d7456ab into main Nov 22, 2023
@cragwolfe cragwolfe deleted the crag/add-tables-inspect-script branch November 22, 2023 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants