Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while process PDF #781

Open
szanati opened this issue Mar 17, 2016 · 10 comments
Open

Error while process PDF #781

szanati opened this issue Mar 17, 2016 · 10 comments

Comments

@szanati
Copy link
Member

szanati commented Mar 17, 2016

I received the follow error on a package with a pdf file:

error while processing 1(sip-files/09-06-2013.pdf): bad status
http://transform.fda.fcla.edu/transform/pdf_norm?location=file:/var/daitss/data/work/ETAL9VQ5Q_V6OA41/files/original/1/data: 500
/opt/pdfapilot/pdfaPilot /var/daitss/data/work/ETAL9VQ5Q_V6OA41/files/original/1/data --fontfolder=/usr/share/fonts/msttcorefonts/ --onlypdfa --substitute --outputfile=/var/daitss/tmp/d20160317-22104-1k0gniu/data/transformed.pdf --report=XML,IFNOPDFA,PATH=/var/daitss/tmp/d20160317-22104-1k0gniu/pdfapilot_report.xml failed, output: Input /var/daitss/data/work/ETAL9VQ5Q_V6OA41/files/original/1/data
Pages 32
PDFA Regular
Progress 100 %
Summary Corrections 0
Summary Errors 0
Summary Warnings 0
Summary Infos 0
Duration 00:05
Error 1010 The PDF file may be corrupt (unable to open PDF file).

@szanati
Copy link
Member Author

szanati commented Mar 17, 2016

I ran the pdf thru the GUI description service and it said the it was Well-Formed and valid and event outcome was a success. I tried the package on Ripple and it received the same error. On ripple I also tried editing the daitss-config.yml file under the transform_service I changed "skip_undefined" from false to true and it went thru the pdf steps it did not archive due to another issue on Ripple which will be handled next week involving squid.

@szanati
Copy link
Member Author

szanati commented Mar 17, 2016

The package on production is in the stashspace named: Github_781. It is in the directory: /var/daitss/data/stash/Github_781/ETAL9VQ5Q_V6OA41. On Ripple its in the workspace: /var/daitss/data/work/ENF28E4YI_X7LTMP. On ripple the original package is in: /var/daitss/ops/stephen/AA00038892_00002

@cchou
Copy link
Member

cchou commented Aug 31, 2016

This package fails with PDF to PDF/A conversion with PdfaPilot. Would need to submit an issue ticket to PdfaPilot vendor.

Alternatively, you can try to get this package ingested by turning off pdfa normalization.

@cchou
Copy link
Member

cchou commented Sep 30, 2016

@lydiam
Copy link
Member

lydiam commented Apr 25, 2017

Email from Carol:

Response from callas. Looks like you can fix those PDFs with PDFapilot, though I am not sure how you want to pursue it seems it means the SIPs will be changed.

-Carol
---------- Forwarded message ----------
From: callas software support 3rdlevelsupport@callassoftware.com
Date: Fri, Apr 21, 2017 at 8:21 AM
Subject: Re: Problems with many PDF files using PDFaPilot
To: "cchoufl@gmail.com" cchoufl@gmail.com

Hello Carol,

as David has already mentioned the cases have underlying issues, however, in both cases the PDF structure seems to be corrupt. Acrobat is still able to display the file, however the more thorough analysis with the PDF/A validator/converter fails. We will further investigate to make sure that this assumption is correct.

There is, however, already a known workaround for that problem: Both files can actually be converted when they are first converted to PostScript and back to PDF. You can do so by using ./pdfaPilot --redistill on command line.

Would that work for you as a - at least temporary - solution?

Best regards,
Dietrich

--------------- Original Message ---------------
From: callas software support team [support@callassoftware.com]
Sent: 19.04.2017 21:15
To: cchoufl@gmail.com; d.seggern@callassoftware.com
Subject: Re: Problems with many PDF files using PDFaPilot

Hi Carol,

I've reproduced the problem for both files. The underlying cause appears to be different for both files, they will be looked at by development to determine what is causing this and whether anything can be done about it.

I'll keep you posted!
David.

--------------- Original Message ---------------
From: carol chou [cchoufl@gmail.com]
Sent: 19/04/2017 7:50
To: d.seggern@callassoftware.com
Subject: Re: Problems with many PDF files using PDFaPilot

Hi Dietrich,

Our sys admin has installed the new version of PDFaPIlot, . Some of the problem files can now ben converted but the following two still give out errors during the conversion:

http://www.fcla.edu/daitss-test/files/00004-04-2009.pdf http://www.fcla.edu/daitss-test/files/00004-04-2009.pdf

Progress 100 %

Errors 16660 Device process color used but no PDF/A OutputIntent

Errors 114 Font not embedded (and text rendering mode not 3)

Errors 24 Annotation has no Flags entry

Errors 24 Annotation not set to print

Errors 6280 CharSet missing for Type 1 font

Summary Corrections 72

Summary Errors 23102

Summary Warnings 0

Summary Infos 0

Duration 00:54

Error 1000 Unknown error (unknown exception)

http://www.fcla.edu/daitss-test/files/09-06-2013.pdf http://www.fcla.edu/daitss-test/files/09-06-2013.pdf
[cchou@ripple GH_781]$ /opt/pdfapilot-6.2.256/pdfaPilot 09-06-2013.pdf --fontfolder=/usr/share/fonts/msttcorefonts/ --onlypdfa --substitute --outputfile=09-06-2013-o.pdf --report=XML,IFNOPDFA,PATH=report.xml

Serialization This pdfaPilot instance is running with a Coldspare or Developer license and may only be used in production as a temporary replacement for a full license on another computer.

Input /home/cchou/pdfaError/GH_781/09-06-2013.pdf

Pages 32

PDFA Regular

Progress 100 %

Summary Corrections 0

Summary Errors 0

Summary Warnings 0

Summary Infos 0

Duration 00:01

Error 1010 The PDF file may be corrupt (unable to open PDF file).

Here is the pdfapilot version the sys admin has installed for us.
callas pdfaPilot CLI 6.2.256 (x64)

2000-2016 callas software gmbh

Can you take a look again and provide us some solutions?

Thanks,

-Carol

On Mon, Oct 10, 2016 at 5:09 AM, Dietrich von Seggern <d.seggern@callassoftware.com mailto:d.seggern@callassoftware.com> wrote:
Hi Carol,

what version of pdfaPilot are you using?

I was not able to reproduce any issues with the current release (callas pdfaPilot CLI 6.0.245 (x64)) on a Mac. The reason my either be the font situation or the version.

Best regards,
Dietrich

--
Dietrich von Seggern | Managing Director
callas software GmbH | Schönhauser Allee 6/7 | 10119 Berlin | Germany
Tel +49.30.44390310 tel:+49%2030%2044390310 | Fax +49.30.4416402 tel:+49%2030%204416402 | www.callassoftware.com http://www.callassoftware.com/
Amtsgericht Charlottenburg, HRB 59615 | Geschäftsführung: Olaf Drümmer, Ulrich Frotscher, Dietrich von Seggern

Meet us at:

callas VIP Event, Berlin: November 7 - 8 (+ 9)
https://en.xing-events.com/vip2016.html https://en.xing-events.com/vip2016.html

PDF Day Australia, Sydney: November 25
https://en.xing-events.com/PDFday-Australia.html https://en.xing-events.com/PDFday-Australia.html

On 9 Oct 2016, at 03:35, carol chou <cchoufl@gmail.com mailto:cchoufl@gmail.com> wrote:

Hi Mr. Seggern,

I am working with Florida Virtual Campus who has been using PDFaPilot to convert the PDF in their archive into PDFA. Recently, we have run into some PDFAPIlot errors with some of the PDFs in the archive. Can you please see if this is something that PDFAPilot can fix? The PDFs can be download at

http://www.fcla.edu/daitss-test/files/SCV20100314.pdf http://www.fcla.edu/daitss-test/files/SCV20100314.pdf

http://www.fcla.edu/daitss-test/files/00004-04-2009.pdf http://www.fcla.edu/daitss-test/files/00004-04-2009.pdf

http://www.fcla.edu/daitss-test/files/09-06-2013.pdf http://www.fcla.edu/daitss-test/files/09-06-2013.pdf

FYI, I am enclosing the pdfaPilot error at the end of my email too.

Thanks,

Carol

--
David van Driessche
Mail: david.van.driessche@fourpees.com
Cell: +32 473 89 44 46
Skype: david-van-driessche

Four Pees
Nijverheidskaai 14
9040 Sint-Amandsberg, Belgium

www.fourpees.com
ref:_00D201c3C._500w01bNASQ:ref

@lydiam
Copy link
Member

lydiam commented Apr 25, 2017

Do we still have the original SIPs? We may need to fix the PDFs in the original SIPs (in consultation with their owners) and resubmit and abort the stashed SIPs with corrupt files. We'll need to discuss this.

@cchou cchou mentioned this issue Apr 27, 2017
@lydiam
Copy link
Member

lydiam commented May 2, 2017

This is worth emailing UF about, since they seem to have done multiple submissions of 3 different package names. They may need to authorize that we 'abort' some of the duplicates, and then we'll have fewer problem packages to deal with. Determine if we still have the SIPs. If we do, we should experiment with correcting one of the problem PDFs with PDF/A pilot by converting to PDF/A and back to PDF. Based on the results of this investigation decide how to proceed.

@lydiam
Copy link
Member

lydiam commented May 30, 2017

I did some validation of the PDFs remaining in the DAITSS Github_781 stashspace using description.fcla.edu. The results:

  • /var/daitss/data/stash/Github_781/E3EAYISFT_87ROR0/files/original/1 is a well-formed and valid PDF file. (The 3-Heights PDF online validator tool confirmed this)

  • /var/daitss/data/stash/Github_781/E5PJXWAWA_MUHKTZ/files/original/1 is not well-formed and the anomaly is "Invalid object definition" - What does this mean exactly? I looked this up in JHOVE error messages (http://wiki.opf-labs.org/display/Documents/JHOVE+issues+and+error+messages#JHOVEissuesanderrormessages-%22Invalidobjectdefinition%22) and it doesn't really give me details. I tried the 3-heights PDF validator online tool (https://www.pdf-online.com and got the following errors:

    File data
    Compliance pdf1.2
    Result Document does not conform to PDF/A.
    Details Validating file "data" for conformance level pdf1.2
    The 'xref' keyword was not found or the xref table is malformed.
    The file trailer dictionary is missing or invalid.
    Error in Flate stream: data error.
    The operator has an invalid number of operands.
    The "Length" key of the stream object is wrong.
    The "Length" key of the stream object is wrong.
    The operator has an invalid number of operands.
    A path start operator was missing.
    The content stream contains an invalid operator.
    The "Length" key of the stream object is wrong.
    The "Length" key of the stream object is wrong.
    The operator has an invalid number of operands.
    Error in Flate stream: data error.
    An end text operator is missing.
    The content stream contains an invalid operator.
    The "Length" key of the stream object is wrong.
    Error in Flate stream: data error.
    The operator has an invalid number of operands.
    The document does not conform to the requested standard.
    The file format (header, trailer, objects, xref, streams) is corrupted.
    Done.

  • /var/daitss/data/stash/Github_781/EAKU060NA_VE67MN/files/original/1: the description service declares it well-formed and valid. The 3-Heights validator, however, gives the following error messages:

    File data
    Compliance pdf1.5
    Result Document does not conform to PDF/A.
    Details
    Validating file "data" for conformance level pdf1.5
    Error in Flate stream: data error.
    Error in Flate stream: stream error.
    The operator has an invalid number of operands.
    An end text operator is missing.
    The document does not conform to the requested standard.
    The document's meta data is either missing or inconsistent or corrupt.
    Done.

  • /var/daitss/data/stash/Github_781/ER9GZB3KZ_D6YG8G/files/original/1: the description service declares it well-formed and valid. The 3-Heights validator indicates that "Document validated successfully".

So it appears that the valid and well-formed PDFs may archive if the PDF/A Pilot is turned off. UF may need to recreate the other two.

Carol - can you confirm my conclusions?

@lydiam
Copy link
Member

lydiam commented May 30, 2017

I attempted to obtain details about the validity of the 4 remaining PDFs from Adobe Acrobat 9's Preflight feature but didn't have much success.

@szanati
Copy link
Member Author

szanati commented Jun 13, 2017

The original packages for this issue: AA00038892_00002, AA00047064_00008, and UF00098620_00421 are in: /var/daitss/ops/exceptions/tickets/GitHub_781 on darchive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants