New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 encoding not compatible with RFC4180; not clear #204

scraperdragon opened this Issue Jul 3, 2015 · 4 comments


None yet
3 participants

scraperdragon commented Jul 3, 2015

RFC4180 states that the acceptable bytes of a CSV field are

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
(also COMMA / CR / LF / 2DQUOTE if the field is quoted)

Any UTF-8 encoded string which is not equivalent to its (7-bit) ASCII encoding will not match this pattern. In addition, it is not clear whether UTF-8 encoded strings which contain control characters other than CR/LF are permissable or not.

It should be clearly documented that this change from the RFC has occurred. I would expect TEXTDATA to be redefined as something like "a valid UTF-8 code unit that is not COMMA, CR, LF, or DQUOTE"

This would also permit the use of control characters; i.e. characters 0-31 and 127, excluding CR and LF. Via the RFC, it is currently forbidden to use tabs in fields, for example.

It is also worth noting that the RFC also mandates:

  • commas as separators
  • using CRLF as a line ending
    These are also potential areas where the dataprotocols spec permits the file to violate the RFC.

this is mostly nargling over details, mind.


This comment has been minimized.


rufuspollock commented Jul 6, 2015

@scraperdragon good point. Would you like to submit a pull request or do you have specific wording changes you would recommend?


This comment has been minimized.


pwalsh commented Mar 7, 2016

@danfowler @rgrp can we move to clarify this? UTF-8 encoding and alternate separators are so important to the application of CSV and the data protocol specs in general, that we should be explicit in how we diverge from RFC4180.

@rufuspollock rufuspollock self-assigned this Mar 7, 2016


This comment has been minimized.


rufuspollock commented Mar 7, 2016

@pwalsh @scraperdragon what do we think of changing the start of the CSV section to the following:

CSV files included in a Tabular Data Package package MUST conform to [RFC 4180
"Common Format and MIME Type for Comma-Separated Values (CSV) Files"][rfc4180]
subject to the following exceptions and additions.


-   Files MUST be encoded as UTF-8 (the RFC requires 7-bit ASCII) 
-   The standard line terminator character can be LF or CRLF (the RFC allows CRLF only)
-   Files MAY (but SHOULD NOT) deviate from standard CSV in terms of delimiters
    (e.g. tab rather than ","), quote characters and other parameters (see below)

This comment has been minimized.


pwalsh commented Mar 7, 2016

@rgrp LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment