Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing specific PDF in 1.0.21 - RangeError: index out of range (works in 1.0.20) #205

Open
Laykou opened this issue Jan 19, 2022 · 6 comments

Comments

@Laykou
Copy link

Laykou commented Jan 19, 2022

When trying to parse this PDF rose_production_split_pages.pdf (file was removed), we're getting error:

 RangeError:
       index out of range
      # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:364:in `pos='
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:364:in `_parse_'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/parser.rb:79:in `parse'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/pdf_public.rb:98:in `initialize'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/api.rb:40:in `new'
     # /Users/laykou/.rvm/gems/ruby-3.1.0/gems/combine_pdf-1.0.22/lib/combine_pdf/api.rb:40:in `parse'

How we call it:

CombinePDF.parse(blob.download, allow_optional_content: true).pages

This happens on version 1.0.21 and 1.0.22 however not on 1.0.20.

Now we wanted to move to Ruby 3.1 and we need matrix fix which is in 1.0.22 but we cannot upgrade because of this failing PDF example.

@Laykou
Copy link
Author

Laykou commented Jan 19, 2022

@boazsegev For some reason this fix b966e70 broke it

@boazsegev
Copy link
Owner

Hi @Laykou

Thank you for opening this issue.

Please note my comments: here for issue #185 and here for issue #191.

I usually prefer lax parsers that allow formatting errors to be ignored when possible. However, issue #185 showed that a specific type of error cannot be safely ignored, which required that the parser become more strict.

I strongly suspect, from the description of the issue, that the specific PDF file is malformed.

Testing the PDF @ https://www.datalogics.com/products/pdf-tools/pdf-checker/ fails ... the testing suite doesn't even recognize the file as a PDF, not to mention listing the errors.

I have been authoring and maintaining this gem by myself for over 7 years and have been looking for a new maintainer for over 2 years. The community is enjoying my work, but not really contributing, so... 🤷🏼‍♂️ ... please forgive me for not investing more time and effort to solve this issue.

Kindly,
Bo.

@DimaSamodurov
Copy link

DimaSamodurov commented Feb 16, 2022

Hi @boazsegev ,
It appears that the Length property of the stream can be incorrect in more cases than the presence of the 'endstream' keyword within the content. Anyway, preferring one over another way to extending the scanner position leads to issues.
Many of these issues are acceptable for the end users, provided result looks well. E.g. swallowing the "index is out of range" error would fix the parsing of the file attached. Then it can be combined and work can be done.
Can we swallow the error "index is out of range" and display warning for this case? Would such a PR make sense?

@Laykou
Copy link
Author

Laykou commented May 13, 2024

Do you think this could be fixed in a newer version?

@julitrows
Copy link

Getting index out of range (RangeError) on a user uploaded PDF in version 1.0.26 as well.

@mtwzim
Copy link

mtwzim commented May 23, 2024

Hi @Laykou

Thank you for opening this issue.

Please note my comments: here for issue #185 and here for issue #191.

I usually prefer lax parsers that allow formatting errors to be ignored when possible. However, issue #185 showed that a specific type of error cannot be safely ignored, which required that the parser become more strict.

I strongly suspect, from the description of the issue, that the specific PDF file is malformed.

Testing the PDF @ https://www.datalogics.com/products/pdf-tools/pdf-checker/ fails ... the testing suite doesn't even recognize the file as a PDF, not to mention listing the errors.

I have been authoring and maintaining this gem by myself for over 7 years and have been looking for a new maintainer for over 2 years. The community is enjoying my work, but not really contributing, so... 🤷🏼‍♂️ ... please forgive me for not investing more time and effort to solve this issue.

Kindly, Bo.

There are some pull requests created that could possibly solve this problem but so far they have not been merged and the problem occurs even after almost a year after PRs were submitted.

#209
#215

Can you take a look at them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants