Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of mailbag.csv #6

Open
gwiedeman opened this issue Jun 29, 2021 · 3 comments
Open

Use of mailbag.csv #6

gwiedeman opened this issue Jun 29, 2021 · 3 comments

Comments

@gwiedeman
Copy link
Contributor

@gwiedeman gwiedeman commented Jun 29, 2021

The choice of using a CSV tag file to serialize message-level information was also questioned during the working meeting. CSVs can create the potential for error since they can be written using a variety of different delimiters and dialects. Large numbers of rows may also create issues, as different tools have limits, often around 1 million rows. We had some useful discussions about using JSON or another serialization that did not have these issues, but concluded that CSVs were more useful for the Nicholas Garza, Teresa Burns, and Gary Richardson personas, since they are likely to be more comfortable opening and reading a CSV file using spreadsheet software than a JSON file. A suggestion from the Working Meeting was to break up the CSV into multiple files after a certain number of rows, much like WARC files, so we decided to split the file after 100,000 rows.

We also discussed how the specification’s requirement of a separate mailbag.csv tag file is one of the few major costs in meeting the specification over a generic Bagit bag. In reconsidering this, we realized that the reason this CSV file was required was that it pointed to where messages were within the payload directory and also acted as a lookup between the Message-ID and filename-safe Mailbag-Message-ID fields. We had originally required message header information in the mailbag.csv as well but we’ve decided that this should be optional. Feedback from the working meeting also suggested including a column for attachments, so we added an integer field for the number of attachments.

@jamiepb
Copy link

@jamiepb jamiepb commented Jul 7, 2021

Is there a way in the mailbag.csv file or elsewhere to indicate a one-to-many relationship among derivatives, for example if there is a single or a few pst files that are converted into eml?

@gwiedeman
Copy link
Contributor Author

@gwiedeman gwiedeman commented Jul 9, 2021

Thank you for your comment. Currently no, and the challenge of documenting this type of relationship is one of the main reasons the Advisory board was hesitant about including multiple email accounts per #2. Though multiple PSTs would not necessarily mean multiple accounts so we definitely need to discuss this more. I could see multiple exports from the same account over time being a common use case.

@jamiepb
Copy link

@jamiepb jamiepb commented Jul 12, 2021

Currently Office 365's email export tool cuts pst files off around 10GB and while that's liable to change over time, in my experience our recent email account exports have been 1-3 pst files and will continue to grow. Allowing for email accounts that comprise multiple psts will make the specification more widely applicable and scalable, whether it's in mailbag.csv or the subfolder structure or up to the user to document elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants