Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
UTF-8 encoding not compatible with RFC4180; not clear #204
RFC4180 states that the acceptable bytes of a CSV field are
Any UTF-8 encoded string which is not equivalent to its (7-bit) ASCII encoding will not match this pattern. In addition, it is not clear whether UTF-8 encoded strings which contain control characters other than CR/LF are permissable or not.
It should be clearly documented that this change from the RFC has occurred. I would expect TEXTDATA to be redefined as something like "a valid UTF-8 code unit that is not COMMA, CR, LF, or DQUOTE"
This would also permit the use of control characters; i.e. characters 0-31 and 127, excluding CR and LF. Via the RFC, it is currently forbidden to use tabs in fields, for example.
It is also worth noting that the RFC also mandates:
this is mostly nargling over details, mind.