Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(encoding/csv): handle CSV byte-order marks #3143

Merged
merged 1 commit into from
Feb 3, 2023

Conversation

lionel-rowe
Copy link
Contributor

@lionel-rowe lionel-rowe commented Jan 28, 2023

  • Adds a bom option to CSV StringifyOptions, which prepends U+FEFF (BOM/byte-order mark) to the output
  • Automatically strips leading BOM when parsing CSVs, even if trimLeadingSpace option is not set

This is largely for compatibility with MS Excel's handling of CSV files containing Unicode text:

  • When opening a CSV file with no BOM, Excel assumes the encoding is some ANSI encoding (Windows-1252?), rather than anything sensible like UTF-8. As a result, a UTF-8 file containing 文,字\r\n is displayed as æ–‡ å­—, rather than the expected 文 字.
  • When opening a CSV file with a UTF-8 BOM, Excel correctly treats it as UTF-8.
  • When using "Save As", you're given multiple CSV options:
    • The first is "CSV UTF-8", which adds a BOM.
    • The second, a ways down the list, is simply called "CSV". This doesn't add a BOM and instead transforms all non-Windows-1252 characters into question marks, permanently corrupting the file without warning.
    • There are also options for Macintosh and MS-DOS, which I haven't looked into what they do as they seem pretty niche.

For comparison, Google Sheets:

  • When opening a UTF-8 CSV file with no BOM, Google Sheets correctly treats it as UTF-8.
  • When opening a CSV file with a UTF-8 BOM, Google Sheets correctly treats it as UTF-8. However, when re-saving, it omits the BOM, which limits interoperability with Excel.

The current behavior of std/encoding/csv is to treat the BOM as a literal character when reading and always omit it when writing. This PR makes the new default behavior the same as what Google Sheets does (at least for UTF-8 files) — read either with or without BOM, but still always omit BOM when writing. Alternatively, by supplying the new bom: true option to stringify, a BOM is also prepended when writing.

For completeness, some other possible behaviors could be:

  • Only truncate BOM on parsing if a specific option is supplied (trimLeadingSpace or a dedicated trimBom option). This seems like a bad idea, as certain valid UTF-8 CSVs generated by Excel would cause a parser error (e.g. ${BYTE_ORDER_MARK}"a""b"\r\n), while others would contain invalid data in the first cell (e.g. /\D/.test(parse(`${BYTE_ORDER_MARK}123\r\n`)[0][0]) == true).
  • Default bom in StringifyOptions to true. Advantage would be ensuring round-trip compatibility with Excel given default options. However, that would be a breaking change, plus it seems uncalled for to choose a default solely due to Excel's bad behavior.
  • Automatically detect whether to prepend a BOM based on whether the input contains non-Windows-1252 characters. Significant additional complexity, potential performance cost, plus hard-to-reason-about "magic" behavior.

Copy link
Member

@kt3k kt3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@kt3k kt3k merged commit 9943f5e into denoland:main Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants