Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) #205

Laykou · 2022-01-19T21:59:55Z

When trying to parse this PDF rose_production_split_pages.pdf (file was removed), we're getting error:

 RangeError:
       index out of range
      # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:364:in `pos='
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:364:in `_parse_'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:79:in `parse'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/pdf_public.rb:98:in `initialize'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/api.rb:40:in `new'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/api.rb:40:in `parse'

How we call it:

CombinePDF.parse(blob.download, allow_optional_content: true).pages

This happens on version 1.0.21 and 1.0.22 however not on 1.0.20.

Now we wanted to move to Ruby 3.1 and we need matrix fix which is in 1.0.22 but we cannot upgrade because of this failing PDF example.

The text was updated successfully, but these errors were encountered:

Laykou · 2022-01-19T22:06:32Z

@boazsegev For some reason this fix b966e70 broke it

boazsegev · 2022-01-21T23:23:44Z

Hi @Laykou

Thank you for opening this issue.

Please note my comments: here for issue #185 and here for issue #191.

I usually prefer lax parsers that allow formatting errors to be ignored when possible. However, issue #185 showed that a specific type of error cannot be safely ignored, which required that the parser become more strict.

I strongly suspect, from the description of the issue, that the specific PDF file is malformed.

Testing the PDF @ https://www.datalogics.com/products/pdf-tools/pdf-checker/ fails ... the testing suite doesn't even recognize the file as a PDF, not to mention listing the errors.

I have been authoring and maintaining this gem by myself for over 7 years and have been looking for a new maintainer for over 2 years. The community is enjoying my work, but not really contributing, so... 🤷🏼‍♂️ ... please forgive me for not investing more time and effort to solve this issue.

Kindly,
Bo.

DimaSamodurov · 2022-02-16T00:29:44Z

Hi @boazsegev ,
It appears that the Length property of the stream can be incorrect in more cases than the presence of the 'endstream' keyword within the content. Anyway, preferring one over another way to extending the scanner position leads to issues.
Many of these issues are acceptable for the end users, provided result looks well. E.g. swallowing the "index is out of range" error would fix the parsing of the file attached. Then it can be combined and work can be done.
Can we swallow the error "index is out of range" and display warning for this case? Would such a PR make sense?

Laykou · 2024-05-13T14:41:54Z

Do you think this could be fixed in a newer version?

julitrows · 2024-05-14T10:14:50Z

Getting index out of range (RangeError) on a user uploaded PDF in version 1.0.26 as well.

mtwzim · 2024-05-23T20:39:22Z

Hi @Laykou

Thank you for opening this issue.

Please note my comments: here for issue #185 and here for issue #191.

I usually prefer lax parsers that allow formatting errors to be ignored when possible. However, issue #185 showed that a specific type of error cannot be safely ignored, which required that the parser become more strict.

I strongly suspect, from the description of the issue, that the specific PDF file is malformed.

Testing the PDF @ https://www.datalogics.com/products/pdf-tools/pdf-checker/ fails ... the testing suite doesn't even recognize the file as a PDF, not to mention listing the errors.

I have been authoring and maintaining this gem by myself for over 7 years and have been looking for a new maintainer for over 2 years. The community is enjoying my work, but not really contributing, so... 🤷🏼‍♂️ ... please forgive me for not investing more time and effort to solve this issue.

Kindly, Bo.

There are some pull requests created that could possibly solve this problem but so far they have not been merged and the problem occurs even after almost a year after PRs were submitted.

#209
#215

Can you take a look at them?

funkypierre mentioned this issue Apr 5, 2022

Swallow "index out of range" error #209

Open

This was referenced Apr 22, 2022

Swallow 'index out of range' error #211

Closed

Swallow 'index out of range' error decisely/combine_pdf#1

Open

stiaannel mentioned this issue Mar 20, 2023

1.0.21 causes previously-consumable PDFs to fail now with RangeError #191

Open

This was referenced Jun 22, 2023

Fix 'pos=': index out of range (RangeError) #229

Closed

Fix 'pos=': index out of range (RangeError) moser-inc/combine_pdf#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) #205

Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) #205

Laykou commented Jan 19, 2022 •

edited

Loading

Laykou commented Jan 19, 2022

boazsegev commented Jan 21, 2022

DimaSamodurov commented Feb 16, 2022 •

edited

Loading

Laykou commented May 13, 2024 •

edited

Loading

julitrows commented May 14, 2024

mtwzim commented May 23, 2024 •

edited

Loading

Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) #205

Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) #205

Comments

Laykou commented Jan 19, 2022 • edited Loading

Laykou commented Jan 19, 2022

boazsegev commented Jan 21, 2022

DimaSamodurov commented Feb 16, 2022 • edited Loading

Laykou commented May 13, 2024 • edited Loading

julitrows commented May 14, 2024

mtwzim commented May 23, 2024 • edited Loading

Laykou commented Jan 19, 2022 •

edited

Loading

DimaSamodurov commented Feb 16, 2022 •

edited

Loading

Laykou commented May 13, 2024 •

edited

Loading

mtwzim commented May 23, 2024 •

edited

Loading