Transform1: CSV validation #28

kcoyle · 2021-05-27T19:14:36Z

Check that CSV is valid (e.g. each row same number of values, etc.) (This probably will be done by the built-in module that reads csv.)

message(s): errors here may result in program stopping with message. Message should be output to user

Is there anything else we should check for at this point?

tombaker · 2021-06-02T13:48:07Z

@kcoyle I looked for modules or algorithms to test whether a CSV conforms to RFC 4180 and it seems that RFC 4180 is only considered as guidelines.

There are some Python modules for integrity checking of CSVs (and for their conversion into JSON or YAML) - I'm looking at the Frictionless Framework, which outputs verbose metadata and debugging information about missing values, etc - but I do not see it necessarily as the role of the DCTAP script to provide extensive checking along these lines (or if it does, perhaps only by a dedicated subcommand that simply runs frictionless and prints its output).

Rather, I think the DCTAP script should parse a CSV and display it in ways that help a user spot its errors.

With regard to the RFC 4180 requirement that all rows have the same number of columns, for example, the standard Python module for parsing CSVs does not complain about header rows or data rows that are too short or too long, but simply handles them.

If a data row is shorter than the header row, it pads out short data rows with None values.
If the header row is one column shorter than the longest data column, it adds an empty string as the final header.
If the header row is more than one column shorter than the longest data column, it adds one empty string header followed by one header called None. The None column accumulates all additional value columns in a list.

I have created some tests that confirm this behavior, and I think it is enough for our purposes. The question, as I see it, is what sort of warning message could help a user creating or debugging a DCTAP instance, when an empty-string header or None header is encountered.

My function currently reads a file using the CSV module and exits with an error message if 'propertyID' is not found among the headers. No doubt the program could be made more flexible and robust by testing against files with unreadable characters or non-CSV filetypes like JPEG or by supporting the parsing of other tabular encodings such as XLSX files, but I'm inclined to have the program tolerate missing or non-standard headers - perhaps to flag them, but not to reject them or exit entirely.

kcoyle · 2021-06-02T15:18:06Z

@tombaker Thanks. That's fine if the normal modules handle it. We'll follow their lead. I'll close this unless anyone thinks we should do more -- at this point. Remember , this is a first pass so we want to do what is minimally functional.

kcoyle · 2021-06-28T20:25:10Z

Closing because we decided to use what standard modules do, so there are no special tests to add.

tombaker mentioned this issue Jun 2, 2021

CSV file is valid as CSV dcmi/dctap-python#2

Closed

kcoyle closed this as completed Jun 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform1: CSV validation #28

Transform1: CSV validation #28

kcoyle commented May 27, 2021

tombaker commented Jun 2, 2021

kcoyle commented Jun 2, 2021

kcoyle commented Jun 28, 2021

Transform1: CSV validation #28

Transform1: CSV validation #28

Comments

kcoyle commented May 27, 2021

tombaker commented Jun 2, 2021

kcoyle commented Jun 2, 2021

kcoyle commented Jun 28, 2021