pdfplumber.parser all_row_ids is out of the range in the case of empty page (no words on the page) #191

egork520 · 2023-01-06T18:58:07Z

Hello @kyleclo I identified an issue in referencing all_word_ids[-1] in case of no words detected on the page. I could try to fix it by checking first if the list is empty. But if you know a better fix please let me know

Here is the page screen shot:

And the paper:

f87f9a26543e03c985867d0dbff1b900ecb6e46d.pdf

Here is the stack trace:

`File ~/Documents/codes/git/ai2/s2/mmda/src/mmda/parsers/pdfplumber_parser.py:170, in PDFPlumberParser.parse(self, input_pdf_path)
166 all_tokens.extend(fine_tokens)
167 all_row_ids.extend(
168 [i + last_row_id + 1 for i in line_ids_of_fine_tokens]
169 )
--> 170 last_row_id = all_row_ids[-1]
171 all_word_ids.extend(
172 [i + last_word_id + 1 for i in word_ids_of_fine_tokens]
173 )
174 last_word_id = all_word_ids[-1]

IndexError: list index out of range
`

egork520 · 2023-01-11T17:44:52Z

Link to the fix: PR

egork520 added a commit that referenced this issue Jan 6, 2023

Adding a check if all_rows_ids is not empty. Link to the issue #191

25014a9

egork520 mentioned this issue Jan 6, 2023

Bugfix/egork/all row ids out of range #192

Closed

egork520 added a commit that referenced this issue Jan 6, 2023

Adding a check if token_dicts is not empty. Link to the issue #191

75674d7

jaronjaron assigned egork520 Jan 9, 2023

jaronjaron closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfplumber.parser all_row_ids is out of the range in the case of empty page (no words on the page) #191

pdfplumber.parser all_row_ids is out of the range in the case of empty page (no words on the page) #191

egork520 commented Jan 6, 2023

egork520 commented Jan 11, 2023 •

edited

Loading

pdfplumber.parser all_row_ids is out of the range in the case of empty page (no words on the page) #191

pdfplumber.parser all_row_ids is out of the range in the case of empty page (no words on the page) #191

Comments

egork520 commented Jan 6, 2023

egork520 commented Jan 11, 2023 • edited Loading

egork520 commented Jan 11, 2023 •

edited

Loading