Adobe Reader cannot open files created with pdfPig, and when opening it on chrome, a transparent png is corrupted #395

mjolivet-lucca · 2021-12-03T14:21:26Z

Hello,

I use PdfPig to open multipages PDFs, split them and recreate them.

Here is a sample of the code :

    var stream = File.OpenRead(pdfPath);
    var sourceDocument = PdfDocument.Open(stream);
    var documentBuilder = new PdfDocumentBuilder();
    documentBuilder.AddPage(sourceDocument, 1);
    var newDoc = documentBuilder.Build();
    // then create fileStream from newDoc byte[]

It usually works fine, but we have some files which cannot be opened after the split operation with Adobe reader :

translation :

"There was an error processing a page. There was a problem reading this document (110)."

The thing is that when I try to open it in chrome, it works fine, only with a slight issue : the PDF contains a transparent PNG and in the split file it seems to be corrupted :

Original image

Split file image

I can't send you the file since it contains sensitive data (Edit : I have since succeeded in creating another file which allows us to reproduce the behaviour; It can be found at the end), but I may have a clue :
I saw this issue from another pdf library :
parallax/jsPDF#862

They had something similar happening, and it seems that it was because something was missing with the PNG predictors :

jsPDF fix

getPredictorFromCompression = function getPredictorFromCompression(compression) {
        var predictor;
        switch (compression) {
            case jsPDFAPI.image_compression.FAST:
                predictor = 11;
                break;
            case jsPDFAPI.image_compression.MEDIUM:
                predictor = 13;
                break;
            case jsPDFAPI.image_compression.SLOW:
                predictor = 14;
                break;
            default:
                predictor = 12;
                break;
        }
        return predictor;
    },

~~I didn't succeed to recreate another PDF with the same issue.~~ => There is a file to recreate the issue at the end of this comment.

Here are some informations about the file found with PDF Architect :

Application : PDFCreator Free 4.2.0
PDF created by : GPL GHostscript 9.52
PDF Version : 1.4

If I succeed into creating a pdf without sensitive informations and with the issue, I'll upload it.

If you need any other information, feel free to ask me.

EDIT

I succeded in creating a file which have the same issue :

TEST_SOURCE_FILE.pdf

I used the same PNG (copied from source pdf), pasted it in a word doc and used PDFCreator to create this new pdf.

When using it with the code above, the output file is corrupted :

CORRUPTED_FILE.pdf

I saw a thread talking about png profiles which can also create issues, but I don't know if it's the case here :
https://legacy.imagemagick.org/discourse-server/viewtopic.php?t=32930

mjolivet-lucca · 2021-12-06T13:28:11Z

I tried something new : Extracting 1 page from the "corrupted file" to see what happens, and I have an exception.

Code used :

            var pdfPath = @"C:\TEST_SOURCE_FILE.pdf";
            var stream = File.OpenRead(pdfPath);
            var sourceDocument = PdfDocument.Open(stream);
            var documentBuilder = new PdfDocumentBuilder();
            documentBuilder.AddPage(sourceDocument, 1);
            var newDoc = documentBuilder.Build();
            var targetDocument = PdfDocument.Open(newDoc);
            var targetDocumentBuilder = new PdfDocumentBuilder();
            targetDocumentBuilder.AddPage(targetDocument, 1);

Exception :

UglyToad.PdfPig.Core.PdfDocumentFormatException: Could not locate object with reference: 3 0 despite a full document search.
   at UglyToad.PdfPig.Tokenization.Scanner.PdfTokenScanner.BruteForceFileToFindReference(IndirectReference reference) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Tokenization\Scanner\PdfTokenScanner.cs:line 748
   at UglyToad.PdfPig.Tokenization.Scanner.PdfTokenScanner.Get(IndirectReference reference) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Tokenization\Scanner\PdfTokenScanner.cs:line 709
   at UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IndirectReference reference, IPdfTokenScanner scanner) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Parser\Parts\DirectObjectFinder.cs:line 48
   at UglyToad.PdfPig.Writer.WriterUtil.CopyToken(IPdfStreamWriter writer, IToken tokenToCopy, IPdfTokenScanner tokenScanner, IDictionary`2 referencesFromDocument, Dictionary`2 callstack) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Writer\WriterUtil.cs:line 129
   at UglyToad.PdfPig.Writer.WriterUtil.CopyToken(IPdfStreamWriter writer, IToken tokenToCopy, IPdfTokenScanner tokenScanner, IDictionary`2 referencesFromDocument, Dictionary`2 callstack) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Writer\WriterUtil.cs:line 92
   at UglyToad.PdfPig.Writer.PdfDocumentBuilder.<AddPage>g__CopyResourceDict|34_0(IToken token, Dictionary`2 destinationDict, <>c__DisplayClass34_0& ) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Writer\PdfDocumentBuilder.cs:line 473
   at UglyToad.PdfPig.Writer.PdfDocumentBuilder.AddPage(PdfDocument document, Int32 pageNumber) in C:\Sites\PdfPig\src\UglyToad.PdfPig\Writer\PdfDocumentBuilder.cs:line 394

The exception is thrown on the last line of code written in this comment.

draperk · 2021-12-14T16:07:58Z

This exactly describes the problem I am having as well.
Thanks!

plaisted · 2021-12-20T18:11:41Z

I'll try to take a look at this soon. Hopefully won't be too hard to determine issue with the example PDFs provided.

mjolivet-lucca · 2021-12-20T19:13:30Z

Thanks for the heads up @plaisted ! 😄

plaisted · 2021-12-20T21:35:55Z

It appears this is related to string encoding and the way PdfPig parses string. Pdfs can store raw byte data in a string (colorspace data in this case) and I think PdfPig is not properly handling the raw byte data. PdfPig internally converts the byte data to a c# string and then uses the c# string when it's serializing it again. At a minimum it's changing the encoding but suspect it may be corrupting the data as well.

I'm don't think there's a fix without modifying the way PdfPig handles string which would take some thought.

mjolivet-lucca · 2021-12-21T09:47:55Z

Is there a way to detect when raw byte data is stored in the pdf, and in this case storing byte[] instead of a string ?

plaisted · 2021-12-21T19:46:22Z

Conceptually yes, but string handling in PDFs is pretty complicated and don't want to break existing functionality. Additionally I don't think PdfPig is handling string encoding correctly overall, it treats non-unicode strings as ISO-8859 encoded. By the PDF spec they actually use PDFDocEncoding which is similar but not identical to ISO-8859. Looks like there may be some code in there where PDFDocEncoding was started to be implemented but never completed.

I'll try to look a little more later this week, I may be able to fix this current issue and leave the ISO-8859 inconsistency as is for now.

plaisted · 2021-12-22T02:23:58Z

Think there is a fix in #401 if you want to test. I copied page from the TEST_SOURCE_FILE to new PDF and it was no longer corrupt.

mjolivet-lucca · 2021-12-22T09:53:47Z

I just tested it and it works with all the files I used. Thanks a lot !

mjolivet-lucca · 2021-12-22T10:14:15Z

I don't have the rights to do it, but you can link PR #401 to this issue.

mjolivet-lucca · 2021-12-30T13:15:10Z

PR #401 has just been merged to master, therefore I close this issue.

plaisted mentioned this issue Dec 22, 2021

adjust string serialization to handle raw byte data properly #401

Merged

mjolivet-lucca closed this as completed Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adobe Reader cannot open files created with pdfPig, and when opening it on chrome, a transparent png is corrupted #395

Adobe Reader cannot open files created with pdfPig, and when opening it on chrome, a transparent png is corrupted #395

mjolivet-lucca commented Dec 3, 2021 •

edited

Loading

mjolivet-lucca commented Dec 6, 2021

draperk commented Dec 14, 2021

plaisted commented Dec 20, 2021

mjolivet-lucca commented Dec 20, 2021

plaisted commented Dec 20, 2021

mjolivet-lucca commented Dec 21, 2021

plaisted commented Dec 21, 2021

plaisted commented Dec 22, 2021

mjolivet-lucca commented Dec 22, 2021

mjolivet-lucca commented Dec 22, 2021

mjolivet-lucca commented Dec 30, 2021

Adobe Reader cannot open files created with pdfPig, and when opening it on chrome, a transparent png is corrupted #395

Adobe Reader cannot open files created with pdfPig, and when opening it on chrome, a transparent png is corrupted #395

Comments

mjolivet-lucca commented Dec 3, 2021 • edited Loading

Original image

Split file image

jsPDF fix

EDIT

mjolivet-lucca commented Dec 6, 2021

draperk commented Dec 14, 2021

plaisted commented Dec 20, 2021

mjolivet-lucca commented Dec 20, 2021

plaisted commented Dec 20, 2021

mjolivet-lucca commented Dec 21, 2021

plaisted commented Dec 21, 2021

plaisted commented Dec 22, 2021

mjolivet-lucca commented Dec 22, 2021

mjolivet-lucca commented Dec 22, 2021

mjolivet-lucca commented Dec 30, 2021

mjolivet-lucca commented Dec 3, 2021 •

edited

Loading