Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adobe Reader cannot open files created with pdfPig, and when opening it on chrome, a transparent png is corrupted #395

Closed
mjolivet-lucca opened this issue Dec 3, 2021 · 11 comments

Comments

@mjolivet-lucca
Copy link
Contributor

mjolivet-lucca commented Dec 3, 2021

Hello,

I use PdfPig to open multipages PDFs, split them and recreate them.

Here is a sample of the code :

    var stream = File.OpenRead(pdfPath);
    var sourceDocument = PdfDocument.Open(stream);
    var documentBuilder = new PdfDocumentBuilder();
    documentBuilder.AddPage(sourceDocument, 1);
    var newDoc = documentBuilder.Build();
    // then create fileStream from newDoc byte[]

It usually works fine, but we have some files which cannot be opened after the split operation with Adobe reader :

acrobat_error

translation :

"There was an error processing a page. There was a problem reading this document (110)."

The thing is that when I try to open it in chrome, it works fine, only with a slight issue : the PDF contains a transparent PNG and in the split file it seems to be corrupted :

Original image

edisource_ok

Split file image

edisource_error

I can't send you the file since it contains sensitive data (Edit : I have since succeeded in creating another file which allows us to reproduce the behaviour; It can be found at the end), but I may have a clue :
I saw this issue from another pdf library :
parallax/jsPDF#862

They had something similar happening, and it seems that it was because something was missing with the PNG predictors :

jsPDF fix

getPredictorFromCompression = function getPredictorFromCompression(compression) {
        var predictor;
        switch (compression) {
            case jsPDFAPI.image_compression.FAST:
                predictor = 11;
                break;
            case jsPDFAPI.image_compression.MEDIUM:
                predictor = 13;
                break;
            case jsPDFAPI.image_compression.SLOW:
                predictor = 14;
                break;
            default:
                predictor = 12;
                break;
        }
        return predictor;
    },

I didn't succeed to recreate another PDF with the same issue. => There is a file to recreate the issue at the end of this comment.

Here are some informations about the file found with PDF Architect :

  • Application : PDFCreator Free 4.2.0
  • PDF created by : GPL GHostscript 9.52
  • PDF Version : 1.4

If I succeed into creating a pdf without sensitive informations and with the issue, I'll upload it.

If you need any other information, feel free to ask me.

EDIT

I succeded in creating a file which have the same issue :

TEST_SOURCE_FILE.pdf

I used the same PNG (copied from source pdf), pasted it in a word doc and used PDFCreator to create this new pdf.

When using it with the code above, the output file is corrupted :

CORRUPTED_FILE.pdf

I saw a thread talking about png profiles which can also create issues, but I don't know if it's the case here :
https://legacy.imagemagick.org/discourse-server/viewtopic.php?t=32930

@mjolivet-lucca
Copy link
Contributor Author

I tried something new : Extracting 1 page from the "corrupted file" to see what happens, and I have an exception.

Code used :

            var pdfPath = @"C:\TEST_SOURCE_FILE.pdf";
            var stream = File.OpenRead(pdfPath);
            var sourceDocument = PdfDocument.Open(stream);
            var documentBuilder = new PdfDocumentBuilder();
            documentBuilder.AddPage(sourceDocument, 1);
            var newDoc = documentBuilder.Build();
            var targetDocument = PdfDocument.Open(newDoc);
            var targetDocumentBuilder = new PdfDocumentBuilder();
            targetDocumentBuilder.AddPage(targetDocument, 1);

Exception :

UglyToad.PdfPig.Core.PdfDocumentFormatException: Could not locate object with reference: 3 0 despite a full document search.
   at UglyToad.PdfPig.Tokenization.Scanner.PdfTokenScanner.BruteForceFileToFindReference(IndirectReference reference) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Tokenization\Scanner\PdfTokenScanner.cs:line 748
   at UglyToad.PdfPig.Tokenization.Scanner.PdfTokenScanner.Get(IndirectReference reference) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Tokenization\Scanner\PdfTokenScanner.cs:line 709
   at UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IndirectReference reference, IPdfTokenScanner scanner) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Parser\Parts\DirectObjectFinder.cs:line 48
   at UglyToad.PdfPig.Writer.WriterUtil.CopyToken(IPdfStreamWriter writer, IToken tokenToCopy, IPdfTokenScanner tokenScanner, IDictionary`2 referencesFromDocument, Dictionary`2 callstack) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Writer\WriterUtil.cs:line 129
   at UglyToad.PdfPig.Writer.WriterUtil.CopyToken(IPdfStreamWriter writer, IToken tokenToCopy, IPdfTokenScanner tokenScanner, IDictionary`2 referencesFromDocument, Dictionary`2 callstack) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Writer\WriterUtil.cs:line 92
   at UglyToad.PdfPig.Writer.PdfDocumentBuilder.<AddPage>g__CopyResourceDict|34_0(IToken token, Dictionary`2 destinationDict, <>c__DisplayClass34_0& ) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Writer\PdfDocumentBuilder.cs:line 473
   at UglyToad.PdfPig.Writer.PdfDocumentBuilder.AddPage(PdfDocument document, Int32 pageNumber) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Writer\PdfDocumentBuilder.cs:line 394

The exception is thrown on the last line of code written in this comment.

@draperk
Copy link

draperk commented Dec 14, 2021

This exactly describes the problem I am having as well.
Thanks!

@plaisted
Copy link
Contributor

I'll try to take a look at this soon. Hopefully won't be too hard to determine issue with the example PDFs provided.

@mjolivet-lucca
Copy link
Contributor Author

Thanks for the heads up @plaisted ! 😄

@plaisted
Copy link
Contributor

It appears this is related to string encoding and the way PdfPig parses string. Pdfs can store raw byte data in a string (colorspace data in this case) and I think PdfPig is not properly handling the raw byte data. PdfPig internally converts the byte data to a c# string and then uses the c# string when it's serializing it again. At a minimum it's changing the encoding but suspect it may be corrupting the data as well.

I'm don't think there's a fix without modifying the way PdfPig handles string which would take some thought.

@mjolivet-lucca
Copy link
Contributor Author

Is there a way to detect when raw byte data is stored in the pdf, and in this case storing byte[] instead of a string ?

@plaisted
Copy link
Contributor

Conceptually yes, but string handling in PDFs is pretty complicated and don't want to break existing functionality. Additionally I don't think PdfPig is handling string encoding correctly overall, it treats non-unicode strings as ISO-8859 encoded. By the PDF spec they actually use PDFDocEncoding which is similar but not identical to ISO-8859. Looks like there may be some code in there where PDFDocEncoding was started to be implemented but never completed.

I'll try to look a little more later this week, I may be able to fix this current issue and leave the ISO-8859 inconsistency as is for now.

@plaisted
Copy link
Contributor

Think there is a fix in #401 if you want to test. I copied page from the TEST_SOURCE_FILE to new PDF and it was no longer corrupt.

@mjolivet-lucca
Copy link
Contributor Author

I just tested it and it works with all the files I used. Thanks a lot !

@mjolivet-lucca
Copy link
Contributor Author

I don't have the rights to do it, but you can link PR #401 to this issue.

@mjolivet-lucca
Copy link
Contributor Author

PR #401 has just been merged to master, therefore I close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants