UTF-8 encoding not compatible with RFC4180; not clear #204

Closed
scraperdragon opened this Issue Jul 3, 2015 · 4 comments

Projects

None yet

3 participants

@scraperdragon

RFC4180 states that the acceptable bytes of a CSV field are

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
(also COMMA / CR / LF / 2DQUOTE if the field is quoted)

Any UTF-8 encoded string which is not equivalent to its (7-bit) ASCII encoding will not match this pattern. In addition, it is not clear whether UTF-8 encoded strings which contain control characters other than CR/LF are permissable or not.

It should be clearly documented that this change from the RFC has occurred. I would expect TEXTDATA to be redefined as something like "a valid UTF-8 code unit that is not COMMA, CR, LF, or DQUOTE"

This would also permit the use of control characters; i.e. characters 0-31 and 127, excluding CR and LF. Via the RFC, it is currently forbidden to use tabs in fields, for example.

It is also worth noting that the RFC also mandates:

  • commas as separators
  • using CRLF as a line ending
    These are also potential areas where the dataprotocols spec permits the file to violate the RFC.

this is mostly nargling over details, mind.

@rufuspollock
Contributor

@scraperdragon good point. Would you like to submit a pull request or do you have specific wording changes you would recommend?

@pwalsh
Member
pwalsh commented Mar 7, 2016

@danfowler @rgrp can we move to clarify this? UTF-8 encoding and alternate separators are so important to the application of CSV and the data protocol specs in general, that we should be explicit in how we diverge from RFC4180.

@rufuspollock rufuspollock self-assigned this Mar 7, 2016
@rufuspollock
Contributor

@pwalsh @scraperdragon what do we think of changing the start of the CSV section to the following:

CSV files included in a Tabular Data Package package MUST conform to [RFC 4180
"Common Format and MIME Type for Comma-Separated Values (CSV) Files"][rfc4180]
subject to the following exceptions and additions.

Exceptions:

-   Files MUST be encoded as UTF-8 (the RFC requires 7-bit ASCII) 
-   The standard line terminator character can be LF or CRLF (the RFC allows CRLF only)
-   Files MAY (but SHOULD NOT) deviate from standard CSV in terms of delimiters
    (e.g. tab rather than ","), quote characters and other parameters (see below)
@pwalsh
Member
pwalsh commented Mar 7, 2016

@rgrp LGTM

@rufuspollock rufuspollock added a commit that closed this issue Mar 7, 2016
@rufuspollock rufuspollock [tdp][s]: no substantive change but clarify deviations from CSV RFC -…
… fixes #204.

* spec allows for CSV files that are not "strict" CSV as per CSV RFC (e.g. tab instead of "," as separator)
eca7610
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment