Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save table prediction in cells format #2892

Merged
merged 21 commits into from
Apr 25, 2024
Merged

Save table prediction in cells format #2892

merged 21 commits into from
Apr 25, 2024

Conversation

plutasnyy
Copy link
Contributor

This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation.

This PR has to be merged, after Unstructured-IO/unstructured-inference#335

@plutasnyy plutasnyy self-assigned this Apr 15, 2024
@@ -8,9 +8,11 @@
# unstructured.documents.elements.Image
from PIL import Image as PILImage
from PIL import ImageSequence
from unstructured_inference.models.tables import cells_to_html
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in unstructured we use a decorator on functions to note if this function requires additional extras to run. And unstructured_inference is one of those extras. As a result we don't import from unstructured_inference at module level normally but import only where it is needed and decorate the function with @requires_dependencies("unstructured_inference"). You can find examples in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the explanation!


def test_simple_table_cell_parsing_from_table_transformer_when_missing_input():
table_transformer_cell = {"row_nums": [], "column_nums": [], "cell text": "text"}
with pytest.raises(ValueError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a expected error message pattern matching here as well to make this test more explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


def test_simple_table_cell_parsing_from_table_transformer_when_missing_row_nums():
cell = {"row_nums": [], "column_nums": [1], "cell text": "text"}
with pytest.raises(ValueError) as exception_info:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(ValueError) as exception_info:
with pytest.raises(ValueError, match=f'Cell {str(cell)} has missing values under "row_nums" key'):

Comment on lines 6 to 7
def test_simple_table_cell_parsing_from_table_transformer_when_expected_input():
table_transformer_cell = {"row_nums": [3, 2, 1], "column_nums": [6, 7], "cell text": "text"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use parametrize and add two more test cases:

  • row nums has only 1 value
  • column nums has only 1 value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 39 to 42
width = len(column_nums)
height = len(row_nums)

return cls(x=x, y=y, w=width, h=height, content=tatr_table_cell.get("cell text", ""))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can combine 36-42 as one line

Suggested change
width = len(column_nums)
height = len(row_nums)
return cls(x=x, y=y, w=width, h=height, content=tatr_table_cell.get("cell text", ""))
width = len(column_nums)
height = len(row_nums)
return cls(x=min(column_nums), y=min(row_nums), w=len(column_nums), h=len(row_nums), content=tatr_table_cell.get("cell text", ""))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, done

Comment on lines +6 to +9
x: int
y: int
w: int
h: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use full names here to make the code read more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used 'w' and 'h' intentially to match deckard format as is currently used in mini-holistic. Then, for example, same code can be used during evaluation

Copy link
Collaborator

@badGarnet badGarnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix changelog; other than that it looks good

@plutasnyy plutasnyy added this pull request to the merge queue Apr 25, 2024
Merged via the queue into main with commit df1f7bc Apr 25, 2024
42 checks passed
@plutasnyy plutasnyy deleted the feat/table-as-cells branch April 25, 2024 11:54
github-merge-queue bot pushed a commit that referenced this pull request May 7, 2024
This pull request add metrics that are calculated based on
table_as_cells instead of text_as_html. This change is required for
comprehensive metrics calculation, as previously every colspan or
rowspan predicted was considered to be an incorrect predicted (even if
it was correct prediction)

This change has to be merged after
#2892 which
introduces table_as_cells field
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants