-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 encoding not compatible with RFC4180; not clear #204
Comments
@scraperdragon good point. Would you like to submit a pull request or do you have specific wording changes you would recommend? |
@danfowler @rgrp can we move to clarify this? UTF-8 encoding and alternate separators are so important to the application of CSV and the data protocol specs in general, that we should be explicit in how we diverge from RFC4180. |
@pwalsh @scraperdragon what do we think of changing the start of the CSV section to the following:
|
@rgrp LGTM |
RFC4180 states that the acceptable bytes of a CSV field are
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
(also
COMMA / CR / LF / 2DQUOTE
if the field is quoted)Any UTF-8 encoded string which is not equivalent to its (7-bit) ASCII encoding will not match this pattern. In addition, it is not clear whether UTF-8 encoded strings which contain control characters other than CR/LF are permissable or not.
It should be clearly documented that this change from the RFC has occurred. I would expect TEXTDATA to be redefined as something like "a valid UTF-8 code unit that is not COMMA, CR, LF, or DQUOTE"
This would also permit the use of control characters; i.e. characters 0-31 and 127, excluding CR and LF. Via the RFC, it is currently forbidden to use tabs in fields, for example.
It is also worth noting that the RFC also mandates:
These are also potential areas where the dataprotocols spec permits the file to violate the RFC.
this is mostly nargling over details, mind.
The text was updated successfully, but these errors were encountered: