Skip to content

CSV roundtrip and RFC 4180 compliance #263

@DiegoPino

Description

@DiegoPino

What?

This is a larger issue and i believe it is time to tackle it from the ground

CSV is a very permissive standard, and depending on who generated a CSV, how it was generated and the content of the Cells (specially when dealing with JSON encoded objects/arrays in a cell, being consistent, from reading, writing, appending is key

The key for a sane future processing is RFC 4180.

This implies, most importantly dealing with character escaping (or said differently, no escaping but double quoting) to avoid a CSV cell from a different row permeating into a different row. We see this specially happening when encoding/decoding from external sources via scripts but also from Excel and Google Sheets and even internally between VBO CSV exports and AMI set imports

Chances are you never had a problem, chances you had.

PHP 8.3

PHP has also reflected on this and has made a very clear statement (that also goes against its defaults pre 8.4 defaults)

Warning
In the input stream, the enclosure character can always be escaped by doubling it inside a quoted string, resulting in a single enclosure character in the parsed result. The escape character works differently: If a sequence of escape and enclosure characters appear in the input, both characters will be present in the parsed result. So for the default parameters, a CVS line like "a""b","c"d" will have the fields parsed as a"b and c"d, respectively.
Warning
As of PHP 8.4.0, depending on the default value of escape is deprecated. It needs to be provided explicitly either positionally or by the use of named arguments.

How to act on this

  • Remove any default escaping mechanism in AMI, for reading/writing CSVs.
  • Ensure any JSON encoded values in cell have double quotes HEX encoded (not needed internally but needed if opening a CSV in Excel)
  • Add a validation/pre-processing (even if lengthy) step to any user provided CSVs to check IF escaping is present (what breaks stuff most likely is \") and only use ONCE escaping for reading if that happens, but re-write into unescaped ones once done
  • Probably add a global switch/compatibility option for Older generated AMI sets OR a drush/update hook that does a pre-processing (can take long)
  • Document all this clearly
  • Generat a very wide set of demo CSVs to prove this. From Excel, from OpenOffice, Apple Numbers, Google Sheets, Google Sheets API (remove) with single values (not JSON encoded) that hold \" (one example is a Windows Path "c:\myfolder\myfile\" and JSON encoded with double and single quoted strings like My "best year" so far (a fallacy))

@alliomeria as discussed today

Metadata

Metadata

Labels

CSV ProcessingThings we do here and there to keep the tabulated goddesses happyWorking Group's 💜Imagined, curated and loved by the Working Group

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions