Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfplumber.parser all_row_ids is out of the range in the case of empty page (no words on the page) #191

Closed
egork520 opened this issue Jan 6, 2023 · 1 comment
Assignees

Comments

@egork520
Copy link
Contributor

egork520 commented Jan 6, 2023

Hello @kyleclo I identified an issue in referencing all_word_ids[-1] in case of no words detected on the page. I could try to fix it by checking first if the list is empty. But if you know a better fix please let me know

Here is the page screen shot:
Screen Shot 2023-01-06 at 10 53 07 AM

And the paper:

f87f9a26543e03c985867d0dbff1b900ecb6e46d.pdf

Here is the stack trace:

`File ~/Documents/codes/git/ai2/s2/mmda/src/mmda/parsers/pdfplumber_parser.py:170, in PDFPlumberParser.parse(self, input_pdf_path)
166 all_tokens.extend(fine_tokens)
167 all_row_ids.extend(
168 [i + last_row_id + 1 for i in line_ids_of_fine_tokens]
169 )
--> 170 last_row_id = all_row_ids[-1]
171 all_word_ids.extend(
172 [i + last_word_id + 1 for i in word_ids_of_fine_tokens]
173 )
174 last_word_id = all_word_ids[-1]

IndexError: list index out of range
`

@egork520
Copy link
Contributor Author

egork520 commented Jan 11, 2023

Link to the fix: PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants