parsing of PaperPort PDFs fails #36

vitstradal · 2015-11-18T12:02:52Z

Hi,

I have PDF created by PaperPort 12 (probably http://www.nuance.com/imaging/paperport/paperport-upgrade-to-12.asp, but not sure), and convert_pdf create document with empty pages.

How to reproduce:

Download pdf: http://vitas.matfyz.cz/tmp/couldnt-connect-refrence.pdf (contains 2 pages 2 images of scanned papers)
run

require "combine_pdf"
require "pp"

pdf =  CombinePDF.load("./couldnt-connect-refrence.pdf");
pdf.number_pages
pdf.save "out.pdf"

output is:

$ ruby com.rb 
couldn't connect a reference!!! could be a null or removed (empty) object, Silent error!!!
 Object raising issue: {:is_reference_only=>true, :indirect_generation_number=>0, :indirect_reference_id=>8, :referenced_object=>nil}
couldn't connect a reference!!! could be a null or removed (empty) object, Silent error!!!
 Object raising issue: {:is_reference_only=>true, :indirect_generation_number=>0, :indirect_reference_id=>7, :referenced_object=>nil}
couldn't connect a reference!!! could be a null or removed (empty) object, Silent error!!!
 Object raising issue: {:is_reference_only=>true, :indirect_generation_number=>0, :indirect_reference_id=>11, :referenced_object=>nil}
couldn't connect a reference!!! could be a null or removed (empty) object, Silent error!!!
 Object raising issue: {:is_reference_only=>true, :indirect_generation_number=>0, :indirect_reference_id=>10, :referenced_object=>nil}

more info:

$ pdfinfo couldnt-connect-refrence.pdf 
Title:          
Subject:        
Keywords:       
Author:         
Creator:        PaperPort 12
Producer:       PaperPort 12
CreationDate:   Fri Nov 13 13:42:58 2015
ModDate:        Fri Nov 13 13:42:58 2015
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          2
Encrypted:      no
Page size:      612 x 841.68 pts
Page rot:       0
File size:      466188 bytes
Optimized:      no
PDF version:    1.3

$ pdfinfo out.pdf 
Title:          
Subject:        
Keywords:       
Author:         
Creator:        PaperPort 12
Producer:       Ruby CombinePDF 0.2.11 Library
CreationDate:   Wed Nov 18 12:57:23 2015
ModDate:        Wed Nov 18 12:57:23 2015
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          2
Encrypted:      no
Page size:      612 x 841.68 pts
Page rot:       0
File size:      240627 bytes
Optimized:      no
PDF version:    1.3

The text was updated successfully, but these errors were encountered:

boazsegev · 2015-11-18T15:34:32Z

Hi vitstradal,

Thank you for reporting this issue!

I have a busy day ahead of me, but I will try to look into it as soon as possible.

boazsegev · 2015-11-19T04:29:37Z

Hi Vitstradal,

I found the issue...

PaperPort has an issue where PDF data will be placed within a PDF comment.

PDF comments start with a "%" sign and end with an EOL marker ("\r" or "\n"). PaperPort ommitted the EOL marker, placing critical data within the comment.

I wrote a work-around that parses the comment's data and attempts to salvage the misplaced critical information.

This workaround assumes that comments would not contain PDF parsable data at the very end of the comment's line... which is an unsafe assumption. hence, if I get reports that this workaround breaks valid PDF files with comments, I might remove it!

I'm running some tests and I will release an updated version shortly.

boazsegev · 2015-11-19T05:14:27Z

I released the updated version (v. 0.2.12). It works on my computer... please let me know if it's working for you.

vitstradal · 2015-11-20T10:19:49Z

if I get reports that this workaround breaks valid PDF files with comments, I might remove it!

I understand, buggy PaperPort, but what you want for €50 :-)

Anyway: It is weird , that ordinal PDF viewer (evince for me) will parse it.

Thank you very much.

boazsegev · 2015-11-20T16:37:42Z

I'm happy this works for you :-)

As for the original PDF viewer reading the file:

At the end of the PDF file there is something called an X-Ref table. This table tells the PDF viewer the binary address of each object.

Normally, PDF viewers follow the X-Ref table and find objects using their binary address (even if the data starts inside a comment line).

But CombinePDF works a little differently - It reads the file to extract all the data by reading it completely, top to bottom, building a tree of objects as well as a list of objects... so that when you take a page out of the PDF, the fonts and the resources are automatically attached to that page (no need to search for them in the X-Ref table)...

...When I saw PaperPort's PDF, I was thinking about rewriting the parser to match binary viewers at some point (I believe I can still build a tree this way) - but it takes more work and I'm not sure how the performance might be effected... Maybe another time ;-)

boazsegev added the bug label Nov 18, 2015

boazsegev closed this as completed Nov 19, 2015

boazsegev pushed a commit that referenced this issue Nov 19, 2015

improve the comment review (issue #36)

ad714c1

boazsegev changed the title ~~couldn't connect a reference!!!~~ parsing of PaperPort PDFs fails Nov 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing of PaperPort PDFs fails #36

parsing of PaperPort PDFs fails #36

vitstradal commented Nov 18, 2015

boazsegev commented Nov 18, 2015

boazsegev commented Nov 19, 2015

boazsegev commented Nov 19, 2015

vitstradal commented Nov 20, 2015

boazsegev commented Nov 20, 2015

parsing of PaperPort PDFs fails #36

parsing of PaperPort PDFs fails #36

Comments

vitstradal commented Nov 18, 2015

boazsegev commented Nov 18, 2015

boazsegev commented Nov 19, 2015

boazsegev commented Nov 19, 2015

vitstradal commented Nov 20, 2015

boazsegev commented Nov 20, 2015