-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/egork/all row ids out of range #192
Conversation
can you add a test case that captures what this PR is supposed to fix? |
…row_ids_out_of_range
…row_ids_out_of_range
…row_ids_out_of_range
Hello @kyleclo I think it is ready now |
@kyleclo I've added 2 more test cases with empty-non-empy and non-empty-empty pages, please take a look. |
@@ -30,6 +30,32 @@ def test_parse(self): | |||
for keyword in ["Field", "Task", "SOTA", "Base", "Frozen", "Finetune", "NER"]: | |||
assert keyword in doc.symbols[:100] | |||
|
|||
def test_parse_empty_page(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something is weird w/ these 3 new tests. if there's an empty page, why is len(doc.pages) == 0
and not len(doc.pages) == 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test the I have one page which is empty. Due to the checks done, if the page doesn't have tokens it is not getting recorded. Do you think we still should have page count > 0 in the case of an empty page?
@@ -214,11 +214,11 @@ def parse(self, input_pdf_path: str) -> Document: | |||
all_row_ids.extend( | |||
[i + last_row_id + 1 for i in line_ids_of_fine_tokens] | |||
) | |||
last_row_id = all_row_ids[-1] | |||
last_row_id = all_row_ids[-1] if all_word_ids else -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One the empty pages all_word_ids is going to be empty. I am doing a check before referencing element -1, which raises an error in case of an empty list.
Fixing issue #191