UTF-8 encoding not compatible with RFC4180; not clear #204

scraperdragon · 2015-07-03T14:58:29Z

RFC4180 states that the acceptable bytes of a CSV field are

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
(also COMMA / CR / LF / 2DQUOTE if the field is quoted)

Any UTF-8 encoded string which is not equivalent to its (7-bit) ASCII encoding will not match this pattern. In addition, it is not clear whether UTF-8 encoded strings which contain control characters other than CR/LF are permissable or not.

It should be clearly documented that this change from the RFC has occurred. I would expect TEXTDATA to be redefined as something like "a valid UTF-8 code unit that is not COMMA, CR, LF, or DQUOTE"

This would also permit the use of control characters; i.e. characters 0-31 and 127, excluding CR and LF. Via the RFC, it is currently forbidden to use tabs in fields, for example.

It is also worth noting that the RFC also mandates:

commas as separators
using CRLF as a line ending
These are also potential areas where the dataprotocols spec permits the file to violate the RFC.

this is mostly nargling over details, mind.

The text was updated successfully, but these errors were encountered:

rufuspollock · 2015-07-06T11:05:08Z

@scraperdragon good point. Would you like to submit a pull request or do you have specific wording changes you would recommend?

pwalsh · 2016-03-07T06:21:49Z

@danfowler @rgrp can we move to clarify this? UTF-8 encoding and alternate separators are so important to the application of CSV and the data protocol specs in general, that we should be explicit in how we diverge from RFC4180.

rufuspollock · 2016-03-07T13:08:58Z

@pwalsh @scraperdragon what do we think of changing the start of the CSV section to the following:

CSV files included in a Tabular Data Package package MUST conform to [RFC 4180
"Common Format and MIME Type for Comma-Separated Values (CSV) Files"][rfc4180]
subject to the following exceptions and additions.

Exceptions:

-   Files MUST be encoded as UTF-8 (the RFC requires 7-bit ASCII) 
-   The standard line terminator character can be LF or CRLF (the RFC allows CRLF only)
-   Files MAY (but SHOULD NOT) deviate from standard CSV in terms of delimiters
    (e.g. tab rather than ","), quote characters and other parameters (see below)

pwalsh · 2016-03-07T13:30:41Z

@rgrp LGTM

rufuspollock self-assigned this Mar 7, 2016

rufuspollock added the Ready for PR label Mar 7, 2016

rufuspollock closed this as completed in eca7610 Mar 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 encoding not compatible with RFC4180; not clear #204

UTF-8 encoding not compatible with RFC4180; not clear #204

scraperdragon commented Jul 3, 2015

rufuspollock commented Jul 6, 2015

pwalsh commented Mar 7, 2016

rufuspollock commented Mar 7, 2016

pwalsh commented Mar 7, 2016

UTF-8 encoding not compatible with RFC4180; not clear #204

UTF-8 encoding not compatible with RFC4180; not clear #204

Comments

scraperdragon commented Jul 3, 2015

rufuspollock commented Jul 6, 2015

pwalsh commented Mar 7, 2016

rufuspollock commented Mar 7, 2016

pwalsh commented Mar 7, 2016