Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can a mailbag be used to package multiple versions of email, such as for weeding or redactions? #7

gwiedeman opened this issue Jun 29, 2021 · 1 comment


Copy link

gwiedeman commented Jun 29, 2021

Feedback from the working meeting also showed us that archives workflows will likely require email messages to be weeded and/or redacted. How would they package, say, an EML file with the full content of an email along with a redacted PDF version for access?

This use case made us realize that the packages that Nicholas Garza and Andrea Lee might envision to include weeded and redacted emails are very different from the package that this draft of the Mailbag specification is implicitly promising. A package that includes an unredacted EML and a redacted PDF is a good example of an Archival Information Package (AIP) from the OAIS model. It can be useful for an AIP to include multiple versions of materials, both preservation and access copies.

Although a mailbag is a Bagit bag, and a Bagit bag can also be used for an AIP, they are serving two very different functions. Right now, a mailbag aims to package multiple representations of an email message to ensure preservation, not multiple versions, like an AIP. The different representations within a mailbag must contain the same content to ensure that all parts of an email are preserved.

For example, think about the use case of an EML export with embedded images hosted on external servers. If Andrea Lee needs to preserve that embedded image and redact text from that EML file, she may need multiple mailbags. If she tries to package just the original EML along with a redacted PDF, neither format will include a complete copy of the message. If the external server hosting the image file goes offline, the PDF will still display the image, but the EML file will not. So if in the future, Andrea Lee is able to release the complete unredacted email, it would be very difficult for her to generate another complete PDF access file, since the unredacted EML file does not contain the image.

So if we work to include redactions in mailbags, we’ll have to manage both multiple representations and multiple versions of email messages. One approach might be to append suffixes to filenames, like “_preservation” or “_original.” If the message has a redacted version, it could include a “_redacted” or similar suffix. If we would prefer not to change the original filenames, we could specify different directories that would contain “preservation” and “redacted” versions. We may also need to include a pathway to generate derivatives, like PDFs, or WARCs from either the original or redacted versions.

We expect that most reduction use cases would not redact all messages, only the messages containing risky information. If only a portion of messages have a redacted version, it could be difficult for Nicholas Garza to extract a set of “access” messages to share with users. The “access” versions for some messages may be in a redacted directory, while others may be in the preservation/original directory. If mailbag instead uses filename suffixes to delineate versions, than they all could be in the same directory, but Nicholas would have to sort out the “_preservation” messages when “_redaction” messages exist. Andrea Lee might write a script to automate this, but it still adds cost and the potential for error.

Including redactions would obviously add complexity to the specification, making it harder to implement and maintain. Still, the bigger concern may be that including redactions in mailbags could limit the possible redaction workflows available to archivists. If Andrea Lee or Nicholas Garza have different local requirements for redactions than we expect, they may not be able to use the redactions workflow we include within mailbags, which would greatly undermine the utility of the specification.

A better approach might be to use multiple mailbags for redactions. For the EML and redacted PDF use case, Andrea Lee can create a mailbag with an unredacted EML and unredacted PDF, and a second mailbag with the redacted EML and PDF. She can choose to manage the different mailbags by appending suffixes or just keeping them in separate directories. It may be a good idea to keep redacted copies completely separate from the original, rather than packaged in the same mailbag to reduce the risk that the original is exposed.

For the use case with many EMLs, some requiring redactions, the unredacted mailbag can contain all unredacted EMLs and PDFs. The redacted mailbag could contain redacted EMLs and PDFs, as well as the originals when redaction was not required. This would make the second mailbag useful for access. Alternatively, if Andrea Lee was concerned about disk space, she could only include messages where redactions were necessary in the second mailbag. So, using multiple mailbags actually gives archivists the flexibility to make most of these decisions themselves.

The feedback we received was really helpful in making us realize that we implicitly assumed that mailbags would contain multiple representations of email, but not multiple versions, and that while a mailbag can be part of an AIP, it is not an AIP in itself. If we choose not to include multiple versions we can rework the specification so the assumptions we're making are clear and explicit.

We’re not as confident as we would like here, so please let us know what you think!

Copy link
Contributor Author

gwiedeman commented Jul 9, 2021

@gwiedeman gwiedeman transferred this issue from UAlbanyArchives/mailbagit Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

1 participant