-
Notifications
You must be signed in to change notification settings - Fork 4
Description
What?
This is a larger issue and i believe it is time to tackle it from the ground
CSV is a very permissive standard, and depending on who generated a CSV, how it was generated and the content of the Cells (specially when dealing with JSON encoded objects/arrays in a cell, being consistent, from reading, writing, appending is key
The key for a sane future processing is RFC 4180.
This implies, most importantly dealing with character escaping (or said differently, no escaping but double quoting) to avoid a CSV cell from a different row permeating into a different row. We see this specially happening when encoding/decoding from external sources via scripts but also from Excel and Google Sheets and even internally between VBO CSV exports and AMI set imports
Chances are you never had a problem, chances you had.
PHP 8.3
PHP has also reflected on this and has made a very clear statement (that also goes against its defaults pre 8.4 defaults)
Warning
In the input stream, the enclosure character can always be escaped by doubling it inside a quoted string, resulting in a single enclosure character in the parsed result. The escape character works differently: If a sequence of escape and enclosure characters appear in the input, both characters will be present in the parsed result. So for the default parameters, a CVS line like "a""b","c"d" will have the fields parsed as a"b and c"d, respectively.
Warning
As of PHP 8.4.0, depending on the default value of escape is deprecated. It needs to be provided explicitly either positionally or by the use of named arguments.
How to act on this
- Remove any default escaping mechanism in AMI, for reading/writing CSVs.
- Ensure any JSON encoded values in cell have double quotes HEX encoded (not needed internally but needed if opening a CSV in Excel)
- Add a validation/pre-processing (even if lengthy) step to any user provided CSVs to check IF escaping is present (what breaks stuff most likely is
\") and only use ONCE escaping for reading if that happens, but re-write into unescaped ones once done - Probably add a global switch/compatibility option for Older generated AMI sets OR a drush/update hook that does a pre-processing (can take long)
- Document all this clearly
- Generat a very wide set of demo CSVs to prove this. From Excel, from OpenOffice, Apple Numbers, Google Sheets, Google Sheets API (remove) with single values (not JSON encoded) that hold
\"(one example is a Windows Path"c:\myfolder\myfile\"and JSON encoded with double and single quoted strings likeMy "best year" so far(a fallacy))
@alliomeria as discussed today