Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/write_elements #269

Closed
Matthieu-Tinycoaching opened this issue Feb 23, 2023 · 7 comments · Fixed by #273
Closed

feat/write_elements #269

Matthieu-Tinycoaching opened this issue Feb 23, 2023 · 7 comments · Fixed by #273
Labels
enhancement New feature or request

Comments

@Matthieu-Tinycoaching
Copy link

Hi,

Is there any way to write List[Element] data into file and load from it then, in order to avoid to partition data each time?

@Matthieu-Tinycoaching Matthieu-Tinycoaching added the enhancement New feature or request label Feb 23, 2023
@MthwRobinson
Copy link
Contributor

@Matthieu-Tinycoaching - Checkout out convert_to_isd and isd_to_elements in our docs. You can serialize and deserialize from JSON like this:

with open("elements.json", "w") as f:
    json.dump(convert_to_isd(elements), f)

with open("elements.json", "r") as f:
    elements = isd_to_elements(json.load(f))

Would that meet your needs? We'd also be happy to include an elements_to_json and elements_from_json to wrap that.

@Matthieu-Tinycoaching
Copy link
Author

Hi @MthwRobinson thanks for the tip! It seems to do the job.

However, when trying this on the example data layout-parser-paper.pdf, FigureCaption and Text data seem to be lost when reimporting data from the JSON file.

When counting for the types of elements present in the document just after parsing:
Counter({<class 'unstructured.documents.elements.ListItem'>: 223, <class 'unstructured.documents.elements.NarrativeText'>: 69, <class 'unstructured.documents.elements.Title'>: 15, <class 'unstructured.documents.elements.FigureCaption'>: 6, <class 'unstructured.documents.elements.Text'>: 2})

When doing this after reimporting from the JSON file:
Counter({<class 'unstructured.documents.elements.ListItem'>: 223, <class 'unstructured.documents.elements.NarrativeText'>: 69, <class 'unstructured.documents.elements.Title'>: 15})

Any idea?

@MthwRobinson
Copy link
Contributor

Looks like we need to add handling for those element types in elements_to_isd. As of now, I think the element metadata gets lost when you read it in from JSON too. I'll update the issue description to reflect that. We'll be able to get a fix in for that shortly. Thanks for flagging!

https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/base.py#L26-L33

@MthwRobinson
Copy link
Contributor

Added #270 to capture the missing elements. Will add some serialization/deserialization helper functions while we're in there.

@MthwRobinson
Copy link
Contributor

@Matthieu-Tinycoaching - There's a PR up to address the issue you flagged. That also adds helper functions for saving to/loading from JSON

@Matthieu-Tinycoaching
Copy link
Author

@MthwRobinson nice!
Do you know when the new release with this fix will be out?

@MthwRobinson
Copy link
Contributor

MthwRobinson commented Feb 23, 2023

Right now! Just released the updated in 0.4.15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants