Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of original filenames #4

Closed
gwiedeman opened this issue Jun 29, 2021 · 4 comments
Closed

Use of original filenames #4

gwiedeman opened this issue Jun 29, 2021 · 4 comments

Comments

@gwiedeman
Copy link
Contributor

gwiedeman commented Jun 29, 2021

In drafting the specification, we discovered multiple places where there is the potential for cross-platform filename issues. This is one downside to relying on filesystems for structure, as not all strings are valid file or directory names.

During the packaging of a mailbag, all messages are assigned a new Mailbag-Message-ID that must be filename-safe. These IDs can be UUIDs or merely be sequential numbers, and are used as filenames for new derivative files, such as when PDFs are created from an MBOX file. Unfortunately, we don’t feel that we can use the Message-ID field that’s usually included in emails as it may not be filesystem-safe.

The issue is that users may be packaging mailbags from EML files or even legacy use cases with PDF files and these files already have filenames for individual messages. While they should be coming from the same filesystem and thus be safe to use for derivative files as well, comments from the working meeting suggested that it may be simpler to even use Mailbag-Message-ID in these cases. Interestingly, commenters suggested that even filenames for EML files were not originally created by the user sending the messages. Still, we think Nicholas Garza and Gary Richardon in particular would be alarmed if these filenames were overridden during the creation of a mailbag. So we plan to keep the original filenames whenever possible.

Perhaps a bigger issue is for attachments. We plan to keep the original filenames for attachments, but files embedded within MBOX or EML files may not have been created in the same filesystem being used to package a mailbag. We still think it’s important to keep the original names here, but for cases when an attachment filename is invalid, it is now required to renamed the file using the Mailbag-Message-ID and document changes in an original_filenames.txt file.

@nkrabben
Copy link

nkrabben commented Jul 21, 2021

We've found that any type of file system metadata that we would like to retain, including file names is far easier to preserve when copied to a separate text document like the proposed original_filenames.txt file.

Is the original-filenames.txt in the proposed spec yet? I'm curious whether it would be better to have a complete listing of all attachments, with option original-filenames, like an attachments.csv file with columns for mailbag-message-id, original-filename, mailbag-attachment-id.

@gwiedeman
Copy link
Contributor Author

gwiedeman commented Apr 15, 2022

Thanks for the feedback! I finally got to ask the advisory board about this and while we think keeping original attachment filenames when possible is important to our user personas, we agree that the current original_filenames.txt is insufficient. There's also some helpful comments in the 0.3 spec that I'm cross-linking. We've also discovered rare cases where we got a file were unable to extract a filename that this would help handle.

We're now planning to replace original-filenames.txt with attachments.csv before a release similar to what @nkrabben suggested. We're thinking of requiring an attachments.csv for all messages, this way implementations can either preserve the original names or not and mailbag consumers can safely rely on the CSV either way. They should all be tiny files so it shouldn't be much overhead. A CSV should add the necessary structure that original-filenames.txt was lacking, and we plan to apply the same rules as mailbag.csv Cases where we are unable to get a filename can be better documented in this CSV as well.

@gwiedeman
Copy link
Contributor Author

gwiedeman commented Apr 15, 2022

If possible, it might be nice to generalize the columns to something like "original-attachment" and "packaged-attachment" or similar, as that might allow for migrated attachments during packaging as per this comment.

@gwiedeman
Copy link
Contributor Author

gwiedeman commented May 25, 2022

addressed in the mailbagit tool by UAlbanyArchives/mailbagit#187. attachments.csv is planned for the next release of the spec.

@gwiedeman gwiedeman transferred this issue from UAlbanyArchives/mailbagit Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants