-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/Move the category field to Element #3056
feat/Move the category field to Element #3056
Conversation
assert ( | ||
elements[7].metadata.parent_id is None | ||
), "CheckBox should be None, as it's not a Text based element" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what it started failing because of this change, and didn't have time to dig deeply why moving a field up a hierarchy, causes a change in some other field. I know removing test is not great approach 😓 . But, if the #3053 is merged (so checkboxes are deprecated), that in my mind would justify removing of tests of checkboxes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agree regarding the text boxes. What was the output of the failure when you ran the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And perhaps comment that out with a note, and then if we move forward with #3053 we can remove the test there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hubert-rutkowski85 I think you'll find the explanation for this behavior here:
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common.py#L230-L234 where this quirk in the original implementation was relied upon :)
Pretty sure a change like this would fix it up:
@@ 233
- if not element_category:
+ if element_category == "UncategorizedText":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MthwRobinson well the error was
FAILED test_unstructured/partition/test_common.py::test_set_element_hierarchy - AssertionError: CheckBox should be None, as it's not a Text based element
assert '26b389dd-3b5d-453e-92f5-a320bb68ba00' is None
+ where '26b389dd-3b5d-453e-92f5-a320bb68ba00' = <unstructured.documents.elements.ElementMetadata object at 0x7fad8d886ec0>.parent_id
+ where <unstructured.documents.elements.ElementMetadata object at 0x7fad8d886ec0> = <unstructured.documents.elements.CheckBox object at 0x7fad8d8854e0>.metadata
Will try out Steve's suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scanny indeed it fixes that case, but unfortunately it fails with 2 other:
> assert elements[3].metadata.parent_id == elements[1].id, "Text should be child of Title"
E AssertionError: Text should be child of Title
E assert None == '4987ecfd-05bd-43a7-8fec-e46d3ecc8d4e'
E + where None = <unstructured.documents.elements.ElementMetadata object at 0x7f02dbdc9420>.parent_id
E + where <unstructured.documents.elements.ElementMetadata object at 0x7f02dbdc9420> = <unstructured.documents.elements.Text object at 0x7f02dbdc8520>.metadata
E + and '4987ecfd-05bd-43a7-8fec-e46d3ecc8d4e' = <unstructured.documents.elements.Title object at 0x7f02dbdca950>.id
and
> assert elements[11].metadata.parent_id == elements[8].id, "Text should be child of Title 2"
E AssertionError: Text should be child of Title 2
E assert None == '74ad86f7-562a-4386-b24c-e8bb72359f7a'
E + where None = <unstructured.documents.elements.ElementMetadata object at 0x7f01b84bfdf0>.parent_id
E + where <unstructured.documents.elements.ElementMetadata object at 0x7f01b84bfdf0> = <unstructured.documents.elements.Text object at 0x7f01b84bfcd0>.metadata
E + and '74ad86f7-562a-4386-b24c-e8bb72359f7a' = <unstructured.documents.elements.Title object at 0x7f01b84be200>.id
I propose to not change the set_element_hierarchy
logic, as it looks like there are some more intricacies, and that would be fixing for a case which soon will be not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems reasonable to me. Could you spin off a follow on issue to clean that up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm how about doing that in #3053 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple small comments, but otherwise good if CI passes.
assert ( | ||
elements[7].metadata.parent_id is None | ||
), "CheckBox should be None, as it's not a Text based element" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agree regarding the text boxes. What was the output of the failure when you ran the test?
assert ( | ||
elements[7].metadata.parent_id is None | ||
), "CheckBox should be None, as it's not a Text based element" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And perhaps comment that out with a note, and then if we move forward with #3053 we can remove the test there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# Conflicts: # CHANGELOG.md # unstructured/__version__.py
It's pretty basic change, just literally moved the category field to Element class. Can't think of other changes that are needed here, because I think pretty much everything expected the category to be directly in elements list.
For local testing, IDE's and linters should see difference in that
category
is now in Element.