Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: checksum verification happens twice #918

Closed
jrwdunham opened this issue Feb 9, 2018 · 9 comments
Closed

Problem: checksum verification happens twice #918

jrwdunham opened this issue Feb 9, 2018 · 9 comments

Comments

@jrwdunham
Copy link
Contributor

jrwdunham commented Feb 9, 2018

Goal: improve the performance of Archivematica in processing large transfers (many files and/or large files).

Context: AM's global checksum_type setting can be set via the GUI at administration/general/ where the options are MD5, SHA-1, SHA-256, and SHA-512. On a standard AM install, the default value appears to be SHA-256.

Overview: relevant micro-services in order of occurrence

  1. Checksums are calculated near the start of Transfer and stored in the db.
  2. Checksums are verified (re-calculated and compared to db values) near the end of ingest.
  3. Bag is created, near the end of ingest, which involves calculating checksums and storing them in manifest-.txt.
  4. Bag is verified, near the end of ingest, which involves re-calculating checksums and confirming they match what is in manifest-.txt.

Details

"Assign checksums and file sizes to objects" (Transfer)

  • A.k.a. updateSizeAndChecksum_v0.0 or archivematicaUpdateSizeAndChecksum.py.
  • Creates a new gearman worker task for each file which:
    • calculates the checksum of each file and stores it in the Files table in the database; and
    • uses the checksum algorithm specified in the checksum_type-named row in the DashboardSettings table (which defaults to 'sha256').

"Verify checksums generated on ingest" (Ingest)

  • A.k.a. verifyPREMISChecksums_v0.0 or verifyPREMISChecksums.py.
  • Creates a new gearman worker task for each file which:
    • fetches the file's checksum (and type/algorithm) from the db;
    • re-calculates the file's checksum (with the given algorithm);
    • verifies that the db checksum and the just-calculated one match;
    • creates a 'fixity check' type event in the database to document that the checksum of the file made early on in transfer has not changed by the end of ingest.

"Prepare AIP" (Ingest)

  • A.k.a. bagit_v0.0 or archivematicaBagWithEmptyDirectories.py.
  • Calls bag create which "creates a bag from supplied files/directories, completes the bag, and then writes in a specified format."
  • The checksum algorithm passed to bag is taken from the checksum_type-named row in the
    DashboardSettings table.
  • 'sha512' is the default algorithm if there is no checksum_type-named row in DashboardSettings table.

"Verify AIP" (Ingest)

  • A.k.a. verifyAIP_v0.0 or verifyAIP.py.
  • Micro-service uses the bag (BagIt) CLI and calls bag verifypayloadmanifests, which re-calculates checksums for all files in the AIP/bag and verifies that they match what is documented in manifest-<ALGORITHM>.txt, e.g., manifest-sha256.txt.

Proposed Solution

  1. Remove the "Verify checksums generated on ingest" micro-service.
  2. Enhance "Verify AIP" to bulk query the db for Transfer-generated checksums and then verify that they match what is documented in the bag-generated manifest-<ALGORITHM>.txt.

Problem with Proposed Solution

The "Verify checksums generated on ingest" micro-service creates events in the db that must end up in the AIP METS. However, the "Verify AIP" micro-service necessarily occurs after the "Generate METS.xml document".

Solution 1 (preferred) to the Problem with Proposed Solution

  1. Continue to do (1) and (2) of the proposed solution above.
  2. Have "Verify AIP" create an AIP-level "fixity check" PREMIS:EVENT that it can pass to the Storage Service, which will document this verification (fixity check of the AIP as a whole) in the pointer file.

Problems with this:

  1. Uncompressed AIPs have no pointer files. Thus uncompressed AIPs would have not AIP-level fixity check events, unless we also begin creating pointer files for uncompressed AIPs, which is motivated here but also by the fact that encrypted uncompressed AIPs need pointer files and replicated uncompressed AIPs need pointer files.

Solution 2 to the Problem with Proposed Solution

Instead of removing the "Verify checksums generated on ingest" micro-service, convert it to a once-per-unit type micro-service which simply optimistically creates the 'fixity check' PREMIS events in the database and specifies bag as the tool used. Then, if "Verify AIP" ultimately fails, the AIP as a whole has failed so the inaccurate METS is not an issue.

Problems with this:

  1. It will result in the same number of 'fixity check' PREMIS:EVENTs in the METS file as there are currently. It would be good to reduce this number, if possible, so that the METS file is not so huge for large transfers and, correspondingly, so that the time needed to write the METS file is decreased.
  2. It seems like a bad idea to "optimistically" create PREMIS:EVENTs that are later invalidated and document them in a METS file. (Though we could invalidate them in the database post hoc, if necessary.)
@jrwdunham jrwdunham self-assigned this Feb 9, 2018
@jrwdunham jrwdunham added Request: discussion The path towards resolving the issue is unclear and opinion is sought from other community members. Status: refining The issue needs additional details to ensure that requirements are clear. Waffle label. Columbia University Library CUL: phase 2 labels Feb 9, 2018
@jrwdunham jrwdunham added this to the 1.8.0 milestone Feb 9, 2018
@evelynPM
Copy link
Contributor

Solution 1 looks ok to me. The only problem is that there would be no fixity check at all for uncompressed AIPs, which don't have pointer files. So the solution is not optimal for uncompressed AIPs, but in my opinion a premis fixity check event is not essential metadata for long-term preservation.

@barmintor
Copy link

If a PREMIS event describing checksum verification at the bag level is planned for an upcoming release, is there a ticket for that milestone that could be referenced in #918? It seems like an important part of understanding the context for removing the file level events.

@jrwdunham
Copy link
Contributor Author

The issue describing the need for pointer files for uncompressed AIPs is described at artefactual/archivematica-storage-service#324.

@jrwdunham jrwdunham added Status: ready The issue is sufficiently described/scoped to be picked up by a developer. Waffle label. and removed Status: refining The issue needs additional details to ensure that requirements are clear. Waffle label. labels Feb 22, 2018
@jrwdunham jrwdunham added Status: in progress Issue that is currently being worked on. Waffle label. and removed Status: ready The issue is sufficiently described/scoped to be picked up by a developer. Waffle label. Request: discussion The path towards resolving the issue is unclear and opinion is sought from other community members. labels Mar 22, 2018
@jrwdunham jrwdunham added Status: review The issue has been merged and is ready for review. Waffle label. and removed Status: in progress Issue that is currently being worked on. Waffle label. labels Mar 23, 2018
@nickwilkinson nickwilkinson removed this from the 1.8.0 milestone Apr 24, 2018
@sromkey
Copy link
Contributor

sromkey commented Jun 29, 2018

Columbia: Phase 2

@sromkey sromkey added this to the 1.8.0 milestone Aug 22, 2018
@sallain
Copy link
Member

sallain commented Oct 3, 2018

For QA, I looked at three things:

  1. Remove the "Verify checksums generated on ingest" micro-service.
  2. Enhance "Verify AIP" to bulk query the db for Transfer-generated checksums and then verify that they match what is documented in the bag-generated manifest-.txt.
  3. Have "Verify AIP" create an AIP-level "fixity check" PREMIS:EVENT that it can pass to the Storage Service, which will document this verification (fixity check of the AIP as a whole) in the pointer file.

They all looked good, I think. I checked on all the platforms we support - Ubuntu xenial, bionic, CentOS, and rpms.

  • Microservice: Verify checksums generated on ingest no longer occurs.
  • The Verify AIP task output contains the message All checksums (count=5) generated at start of transfer match those generated by BagIt (bag).
  • The pointer file for a compressed AIP contains a single fixity check event for the bag, which passed.

@jrwdunham is there anything else that needs to be tested?

@sallain
Copy link
Member

sallain commented Oct 9, 2018

@ross-spencer redirecting my comment above to you! Feel free to pass on as appropriate. I think I'm good, just want to confirm.

@ross-spencer
Copy link
Contributor

@sallain this looks good to me. The only thing on top of these questions I'd ask, is what a failure or negative result looks like, so I modified a SIP while it was still in the backlog (via the command line) and ended up with what looks like the right result:

image

You might want to recreate this for your own satisfaction, otherwise, it looks verified to me!

@sallain
Copy link
Member

sallain commented Oct 9, 2018

Yay thanks @ross-spencer! That error looks right to me too!

@sallain sallain removed the Status: review The issue has been merged and is ready for review. Waffle label. label Oct 9, 2018
@sallain
Copy link
Member

sallain commented Oct 23, 2018

Added to release notes; I don't think this needs to be documented anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants