Bagit is unhelpfully agnostic about line endings. It supports both LF and CRFL and does not contain a standard place to document which is used. It seems very problematic to try and detect line endings. Bagit-python only supports LF line endings, so any bag created with Bagit-python will have LF line endings. We discussed using a custom field in bag-info.txt for this, but decided that since Bagit-python mandates LF, we can require it in the specification.
The main problem is that CSV files most commonly use CRLF line endings as required by RFC4180.
Thus, currently mailbag mandates UTF-8 for all tag files, but requires CRLF line endings for mailbag.csv, and LF line endings for all other tag files. ¯\(ツ)/¯
The text was updated successfully, but these errors were encountered:
Thanks for your comment and sorry for taking so long to address this. I think only requiring LF for the defined tag files is a good idea, and we'll make that change before a release.
Having experienced fun encoding issues, I also love the idea of requiring more encoding information/portability generally, but I think we have to follow bagit and bagit-python. Looking into it briefly, it seems like bagit-python writes tag files with encoding='utf-8' does not include the byte mark? My non-expert instinct is to be agnostic like bagit as there doesn't appear to be a consensus on whether it should be included for utf-8. Definitely open to more expert opinions though.
My preference is to discourage the byte-order mark, but that comes from a naive coding point-of-view where I generally have to invoke extra arguments to handle the BOM. If there are good reasons to allow the BOM (and compatibility with tools that include a BOM by default seems like a good one), agnosticism seems good. I'd love to hear an argument for requiring the BOM, mostly so I can better understand why it's useful.
This has been changed in the draft 1.0 release to required LF for tag files defined by bagit, CRLF for mailbag.csv. Considered recommending LF for other tag files, but that wouldn't make sense for other CSV files for example so its left agnostic.
Looking into the BOM more, I agree that it would be better if there were no BOMs in UTF-8 tag files. Since we're requiring utf-8, it should be fair to SHOULD NOT a BOM.