Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
PDF version numbers based on deprecated mechanism #114
The other week a colleague sent me an unusual PDF that starts with the following header bytes:
Needless to say there is no such thing as "PDF 1.8"; closer inspection showed that apart from the erroneous version number in the header it was just an ordinary PDF 1.7. I threw this file at the latest version of DROID; as it turns out DROID completely fails to identify it at all - it won't even say the file is a PDF.
As a test I changed the header line in a hex editor to this:
After this change the file was correctly identified.
I also ran the faulty file through Unix File and Apache Tika. Both tools correctly identified it as
A glance at the PRONOM signature for PDF 1.7 (link here: http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1016&strPageToDisplay=signatures shows that PRONOM/DROID uses the header to identify the PDF version. However, use of the header for defining the version has been deprecated since PDF 1.4! See e.g. the spec of PDF 1.7 at the link below:
So as per the spec the version number in the header does not necessarily correspond to the actual version. To reliably establish the version number of a PDF the value in the trailer should be used (if present). This means that the way PRONOM/DROID currently identifies specific PDF versions gives no guarantee whatever of returning the actual version!
Don't see an easy solution for this, since to read the version info in the trailer one needs to completely parse the PDF, which I think is way beyond what a tool like DROID is (or should be) capable of.
As for the faulty "PDF 1.8" file: even though the version number in the header is beyond the range that is allowed by the PDF spec, it's still a bit worrying that it isn't even detected as PDF at all! A possible solution would be to define a generic PDF entry + corresponding signature, where the first byte sequence omits the character that is used for the version number. E.g:
This would then need to be given lower priority than the more specific PDF PUIDs (for what they're worth, see above comments).
As I cannot share the original PDF I created a file that replicates the problem, see attach:
The header isn't deprecated. In fact, it's required. The spec allows the
So we still have the problem of identifying the version number reliably
The spec enumerates the values "%PDF–1. 0" through "%PDF-1. 7" as the
I just checked the JHOVE source code, and it looks just for "%PDF-1.".
On 9/26/16 6:04 AM, Johan van der Knijff wrote:
Gary McGath, Freelance Writer and Software Developer
Perhaps "deprecated" isn't the correct word here (as it's still required), but the fact remains that on its own the value in the header cannot be relied upon to refect the true version of a PDF.
The most obvious case I can think of are PDFs that were incrementally updated. E.g. it is possible that a PDF started its life as PDF 1.5, and was then updated in a more recent version of Acrobat to 1.6. The addition of incremental updates was also the reason for the change starting with PDF 1.4. There's a pretty good explanation of incremental updates here:
I suppose such files are pretty rare in most archive/library settings, but I've never seen any data on this. A more serious side effect might be that some software vendors may simply not bother to update the version that is written to the header to the actual version, since the spec says anything goes as long as it is in the 0-7 range ...
Yep, that looks like the most sensible option to me as well.