Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate PDF/A files #285

Open
benjamingeer opened this issue Mar 5, 2019 · 7 comments
Open

Validate PDF/A files #285

benjamingeer opened this issue Mar 5, 2019 · 7 comments
Assignees
Milestone

Comments

@benjamingeer
Copy link
Contributor

This is the only PDF/A validation library I found that looks serious and well-maintained:

https://verapdf.org/home/

@benjamingeer
Copy link
Contributor Author

@mrivoal @loicjaouen @gfoo Lukas says we're not going to do this. You have to validate your own PDFs before putting them into Sipi. You could use https://verapdf.org/home/ for that.

@mrivoal
Copy link

mrivoal commented Jul 5, 2019

I guess we could do that by ourselves now. But at some point (and I guess sooner than later), when a user can upload a PDF to Sipi using Salsah or KUIRL, his PDF should be validated somehow, isn't it?

So, the idea is that Sipi only hosts validated PDF/A?

@benjamingeer
Copy link
Contributor Author

If I understood correctly, @lrosenth said each project is responsible for ensuring that its data is valid before upload/import. Perhaps a GUI could handle validation before submitting the file to Sipi.

@lrosenth
Copy link
Collaborator

lrosenth commented Jul 5, 2019 via email

@subotic
Copy link
Contributor

subotic commented Jul 5, 2019

I think that Sipi needs to do this. In the context of long-term preservation, this is a must-have. But as @lrosenth said, it will take a bit longer to actually implement.

@benjamingeer benjamingeer reopened this Jul 5, 2019
@mrivoal
Copy link

mrivoal commented Jul 5, 2019

Ok, fine. We will do our own PDF validation for now.

But at some point, Sipi should definitely check during the upload every file format we are going to accept.

@mrivoal
Copy link

mrivoal commented Oct 21, 2019

@lrosenth : a question regarding the PDF formats we are willing to store and preserve though Sipi.

  • Are we only going accept PDF/A or will we also accept other well formed and validated versions of PDF? I am asking because other archives (such as the CINES, in France) accept other versions of PDF, provided the file are validated against the declared format.

  • If we only accept PDF/A , what is the prefered version of PDF/A we should accept? The CINES seems to prefer PDF/A 1a for archiving, but what do you think?

(As we are going to convert some 1200 PDF files for Lumières.Lausanne, let's choose the right version/format!)


Here are the comments from the CINES on PDF/A versions:

Format Commentaire
PDFA 1a Basé sur PDF 1.4 mais plus restrictive : pas de dépendances externes, polices embarquées, pas de transparence, métadonnées XMP obligatoires. C'est le format d'archivage à privilégier bien que difficile à générer.
PDF 1.4 Basé sur PDF 1.4 - moins exigeant que 1a, structure logique du document non obligatoire. Bon format d'archivage si PDFA-1a trop compliqué à générer.
PDFA 2a Basé sur PDF 1.7 - fichier PDF/A embarquable, structure logique obligatoire.
PDFA 2u PDF adapté à l'accessibilité.
PDFA 2b Basé sur PDF 1.7, identique à PDF/A-2b sans structure logique obligatoire.
PDFA 3a Basé sur PDF 1.7 - fichier de n'importe quel format embarquable, structure logique obligatoire. Format axé sur l'accessibilité

@subotic subotic added this to the Backlog milestone Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants