Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/egork/all row ids out of range #192

Closed
wants to merge 17 commits into from

Conversation

egork520
Copy link
Contributor

@egork520 egork520 commented Jan 6, 2023

Fixing issue #191

@egork520 egork520 requested a review from kyleclo January 6, 2023 19:18
@kyleclo
Copy link
Collaborator

kyleclo commented Jan 6, 2023

can you add a test case that captures what this PR is supposed to fix?

@egork520
Copy link
Contributor Author

can you add a test case that captures what this PR is supposed to fix?

Hello @kyleclo I think it is ready now

@egork520
Copy link
Contributor Author

@kyleclo I've added 2 more test cases with empty-non-empy and non-empty-empty pages, please take a look.

@@ -30,6 +30,32 @@ def test_parse(self):
for keyword in ["Field", "Task", "SOTA", "Base", "Frozen", "Finetune", "NER"]:
assert keyword in doc.symbols[:100]

def test_parse_empty_page(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something is weird w/ these 3 new tests. if there's an empty page, why is len(doc.pages) == 0 and not len(doc.pages) == 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test the I have one page which is empty. Due to the checks done, if the page doesn't have tokens it is not getting recorded. Do you think we still should have page count > 0 in the case of an empty page?

@@ -214,11 +214,11 @@ def parse(self, input_pdf_path: str) -> Document:
all_row_ids.extend(
[i + last_row_id + 1 for i in line_ids_of_fine_tokens]
)
last_row_id = all_row_ids[-1]
last_row_id = all_row_ids[-1] if all_word_ids else -1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One the empty pages all_word_ids is going to be empty. I am doing a check before referencing element -1, which raises an error in case of an empty list.

@egork520 egork520 closed this Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants