Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix surrounding context extraction in ParsrConverter #2162

Merged
merged 7 commits into from
Feb 24, 2022

Conversation

bogdankostic
Copy link
Contributor

@bogdankostic bogdankostic commented Feb 10, 2022

This PR fixes the extraction of surrounding contexts of tables in the ParsrConverter. Previously, the surrounding context was always empty if remove_page_headers or remove_page_footers was set to False. Also, preceding_context_len and following_context_len was not taken into account.

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the changes and our Parsr implementation now returns the context.

It does not return sentences but whole lines, so if a single line consists of many sentences a lot of text is returned.
Please open another issue for this to keep for later. I don't think we should work on this now.

But I would like to see in both Parsr and Azure some tests for the additional metadata extraction, especially since we are using external tools that might change return types without our notice. Can you please add some tests for the metadata extraction?

@bogdankostic bogdankostic requested review from Timoeller and removed request for tholor February 17, 2022 21:06
@bogdankostic
Copy link
Contributor Author

I added the tests and created issue #2217. Furthermore, I unified ParsrConverter's and AzureConverter's behavior. (ParsrConverter was previously returning whole paragraphs instead of lines as surrounding context.)

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@bogdankostic bogdankostic merged commit b03e9f5 into master Feb 24, 2022
@bogdankostic bogdankostic deleted the fix_parsr_surrounding_context branch February 24, 2022 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

preceeding and following context not working for parsr
2 participants