Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/Move the category field to Element #3056

Merged
merged 11 commits into from
May 23, 2024

Conversation

hubert-rutkowski85
Copy link
Contributor

@hubert-rutkowski85 hubert-rutkowski85 commented May 20, 2024

It's pretty basic change, just literally moved the category field to Element class. Can't think of other changes that are needed here, because I think pretty much everything expected the category to be directly in elements list.

For local testing, IDE's and linters should see difference in that category is now in Element.

@hubert-rutkowski85 hubert-rutkowski85 linked an issue May 20, 2024 that may be closed by this pull request
@Unstructured-IO Unstructured-IO deleted a comment from sentry-io bot May 20, 2024
Comment on lines -418 to -420
assert (
elements[7].metadata.parent_id is None
), "CheckBox should be None, as it's not a Text based element"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what it started failing because of this change, and didn't have time to dig deeply why moving a field up a hierarchy, causes a change in some other field. I know removing test is not great approach 😓 . But, if the #3053 is merged (so checkboxes are deprecated), that in my mind would justify removing of tests of checkboxes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree regarding the text boxes. What was the output of the failure when you ran the test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And perhaps comment that out with a note, and then if we move forward with #3053 we can remove the test there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hubert-rutkowski85 I think you'll find the explanation for this behavior here:
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common.py#L230-L234 where this quirk in the original implementation was relied upon :)

Pretty sure a change like this would fix it up:

@@ 233
- if not element_category:
+ if element_category == "UncategorizedText":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MthwRobinson well the error was

FAILED test_unstructured/partition/test_common.py::test_set_element_hierarchy - AssertionError: CheckBox should be None, as it's not a Text based element
assert '26b389dd-3b5d-453e-92f5-a320bb68ba00' is None
 +  where '26b389dd-3b5d-453e-92f5-a320bb68ba00' = <unstructured.documents.elements.ElementMetadata object at 0x7fad8d886ec0>.parent_id
 +    where <unstructured.documents.elements.ElementMetadata object at 0x7fad8d886ec0> = <unstructured.documents.elements.CheckBox object at 0x7fad8d8854e0>.metadata

Will try out Steve's suggestion

Copy link
Contributor Author

@hubert-rutkowski85 hubert-rutkowski85 May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scanny indeed it fixes that case, but unfortunately it fails with 2 other:

>       assert elements[3].metadata.parent_id == elements[1].id, "Text should be child of Title"
E       AssertionError: Text should be child of Title
E       assert None == '4987ecfd-05bd-43a7-8fec-e46d3ecc8d4e'
E        +  where None = <unstructured.documents.elements.ElementMetadata object at 0x7f02dbdc9420>.parent_id
E        +    where <unstructured.documents.elements.ElementMetadata object at 0x7f02dbdc9420> = <unstructured.documents.elements.Text object at 0x7f02dbdc8520>.metadata
E        +  and   '4987ecfd-05bd-43a7-8fec-e46d3ecc8d4e' = <unstructured.documents.elements.Title object at 0x7f02dbdca950>.id

and

> assert elements[11].metadata.parent_id == elements[8].id, "Text should be child of Title 2"
E       AssertionError: Text should be child of Title 2
E       assert None == '74ad86f7-562a-4386-b24c-e8bb72359f7a'
E        +  where None = <unstructured.documents.elements.ElementMetadata object at 0x7f01b84bfdf0>.parent_id
E        +    where <unstructured.documents.elements.ElementMetadata object at 0x7f01b84bfdf0> = <unstructured.documents.elements.Text object at 0x7f01b84bfcd0>.metadata
E        +  and   '74ad86f7-562a-4386-b24c-e8bb72359f7a' = <unstructured.documents.elements.Title object at 0x7f01b84be200>.id

I propose to not change the set_element_hierarchy logic, as it looks like there are some more intricacies, and that would be fixing for a case which soon will be not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems reasonable to me. Could you spin off a follow on issue to clean that up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm how about doing that in #3053 ?

@hubert-rutkowski85 hubert-rutkowski85 marked this pull request as ready for review May 21, 2024 16:09
Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small comments, but otherwise good if CI passes.

Comment on lines -418 to -420
assert (
elements[7].metadata.parent_id is None
), "CheckBox should be None, as it's not a Text based element"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree regarding the text boxes. What was the output of the failure when you ran the test?

CHANGELOG.md Show resolved Hide resolved
Comment on lines -418 to -420
assert (
elements[7].metadata.parent_id is None
), "CheckBox should be None, as it's not a Text based element"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And perhaps comment that out with a note, and then if we move forward with #3053 we can remove the test there.

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Merged via the queue into main with commit b8d894f May 23, 2024
46 checks passed
@hubert-rutkowski85 hubert-rutkowski85 deleted the 3055-featmove-the-category-field-to-element branch May 23, 2024 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat/Move the category field to Element
3 participants