New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 encoding not compatible with RFC4180; not clear #204

Closed
scraperdragon opened this Issue Jul 3, 2015 · 4 comments

Comments

Projects
None yet
3 participants
@scraperdragon

scraperdragon commented Jul 3, 2015

RFC4180 states that the acceptable bytes of a CSV field are

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
(also COMMA / CR / LF / 2DQUOTE if the field is quoted)

Any UTF-8 encoded string which is not equivalent to its (7-bit) ASCII encoding will not match this pattern. In addition, it is not clear whether UTF-8 encoded strings which contain control characters other than CR/LF are permissable or not.

It should be clearly documented that this change from the RFC has occurred. I would expect TEXTDATA to be redefined as something like "a valid UTF-8 code unit that is not COMMA, CR, LF, or DQUOTE"

This would also permit the use of control characters; i.e. characters 0-31 and 127, excluding CR and LF. Via the RFC, it is currently forbidden to use tabs in fields, for example.

It is also worth noting that the RFC also mandates:

  • commas as separators
  • using CRLF as a line ending
    These are also potential areas where the dataprotocols spec permits the file to violate the RFC.

this is mostly nargling over details, mind.

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Jul 6, 2015

@scraperdragon good point. Would you like to submit a pull request or do you have specific wording changes you would recommend?

@pwalsh

This comment has been minimized.

Member

pwalsh commented Mar 7, 2016

@danfowler @rgrp can we move to clarify this? UTF-8 encoding and alternate separators are so important to the application of CSV and the data protocol specs in general, that we should be explicit in how we diverge from RFC4180.

@rufuspollock rufuspollock self-assigned this Mar 7, 2016

@rufuspollock

This comment has been minimized.

Contributor

rufuspollock commented Mar 7, 2016

@pwalsh @scraperdragon what do we think of changing the start of the CSV section to the following:

CSV files included in a Tabular Data Package package MUST conform to [RFC 4180
"Common Format and MIME Type for Comma-Separated Values (CSV) Files"][rfc4180]
subject to the following exceptions and additions.

Exceptions:

-   Files MUST be encoded as UTF-8 (the RFC requires 7-bit ASCII) 
-   The standard line terminator character can be LF or CRLF (the RFC allows CRLF only)
-   Files MAY (but SHOULD NOT) deviate from standard CSV in terms of delimiters
    (e.g. tab rather than ","), quote characters and other parameters (see below)
@pwalsh

This comment has been minimized.

Member

pwalsh commented Mar 7, 2016

@rgrp LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment