feat(encoding/csv): handle CSV byte-order marks #3143

lionel-rowe · 2023-01-28T09:22:12Z

Adds a bom option to CSV StringifyOptions, which prepends U+FEFF (BOM/byte-order mark) to the output
Automatically strips leading BOM when parsing CSVs, even if trimLeadingSpace option is not set

This is largely for compatibility with MS Excel's handling of CSV files containing Unicode text:

When opening a CSV file with no BOM, Excel assumes the encoding is some ANSI encoding (Windows-1252?), rather than anything sensible like UTF-8. As a result, a UTF-8 file containing 文,字\r\n is displayed as æ–‡ å—, rather than the expected 文字.
When opening a CSV file with a UTF-8 BOM, Excel correctly treats it as UTF-8.
When using "Save As", you're given multiple CSV options:
- The first is "CSV UTF-8", which adds a BOM.
- The second, a ways down the list, is simply called "CSV". This doesn't add a BOM and instead transforms all non-Windows-1252 characters into question marks, permanently corrupting the file without warning.
- There are also options for Macintosh and MS-DOS, which I haven't looked into what they do as they seem pretty niche.

For comparison, Google Sheets:

When opening a UTF-8 CSV file with no BOM, Google Sheets correctly treats it as UTF-8.
When opening a CSV file with a UTF-8 BOM, Google Sheets correctly treats it as UTF-8. However, when re-saving, it omits the BOM, which limits interoperability with Excel.

The current behavior of std/encoding/csv is to treat the BOM as a literal character when reading and always omit it when writing. This PR makes the new default behavior the same as what Google Sheets does (at least for UTF-8 files) — read either with or without BOM, but still always omit BOM when writing. Alternatively, by supplying the new bom: true option to stringify, a BOM is also prepended when writing.

For completeness, some other possible behaviors could be:

Only truncate BOM on parsing if a specific option is supplied (trimLeadingSpace or a dedicated trimBom option). This seems like a bad idea, as certain valid UTF-8 CSVs generated by Excel would cause a parser error (e.g. ${BYTE_ORDER_MARK}"a""b"\r\n), while others would contain invalid data in the first cell (e.g. /\D/.test(parse(`${BYTE_ORDER_MARK}123\r\n`)[0][0]) == true).
Default bom in StringifyOptions to true. Advantage would be ensuring round-trip compatibility with Excel given default options. However, that would be a breaking change, plus it seems uncalled for to choose a default solely due to Excel's bad behavior.
Automatically detect whether to prepend a BOM based on whether the input contains non-Windows-1252 characters. Significant additional complexity, potential performance cost, plus hard-to-reason-about "magic" behavior.

kt3k

LGTM. Thanks!

feat(encoding/csv): handle CSV byte-order marks

fabaf4c

lionel-rowe requested a review from kt3k as a code owner January 28, 2023 09:22

deno-deploy bot deployed to Preview January 28, 2023 09:22 View deployment

kt3k approved these changes Jan 30, 2023

View reviewed changes

kt3k merged commit 9943f5e into denoland:main Feb 3, 2023

kt3k mentioned this pull request Feb 6, 2023

csv: rename bom to byteOrderMark #3165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(encoding/csv): handle CSV byte-order marks #3143

feat(encoding/csv): handle CSV byte-order marks #3143

lionel-rowe commented Jan 28, 2023 •

edited

Loading

kt3k left a comment

feat(encoding/csv): handle CSV byte-order marks #3143

feat(encoding/csv): handle CSV byte-order marks #3143

Conversation

lionel-rowe commented Jan 28, 2023 • edited Loading

kt3k left a comment

Choose a reason for hiding this comment

lionel-rowe commented Jan 28, 2023 •

edited

Loading