-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform1: CSV validation #28
Comments
@kcoyle I looked for modules or algorithms to test whether a CSV conforms to RFC 4180 and it seems that RFC 4180 is only considered as guidelines. There are some Python modules for integrity checking of CSVs (and for their conversion into JSON or YAML) - I'm looking at the Frictionless Framework, which outputs verbose metadata and debugging information about missing values, etc - but I do not see it necessarily as the role of the DCTAP script to provide extensive checking along these lines (or if it does, perhaps only by a dedicated subcommand that simply runs Rather, I think the DCTAP script should parse a CSV and display it in ways that help a user spot its errors. With regard to the RFC 4180 requirement that all rows have the same number of columns, for example, the standard Python module for parsing CSVs does not complain about header rows or data rows that are too short or too long, but simply handles them.
I have created some tests that confirm this behavior, and I think it is enough for our purposes. The question, as I see it, is what sort of warning message could help a user creating or debugging a DCTAP instance, when an empty-string header or None header is encountered. My function currently reads a file using the CSV module and exits with an error message if 'propertyID' is not found among the headers. No doubt the program could be made more flexible and robust by testing against files with unreadable characters or non-CSV filetypes like JPEG or by supporting the parsing of other tabular encodings such as XLSX files, but I'm inclined to have the program tolerate missing or non-standard headers - perhaps to flag them, but not to reject them or exit entirely. |
@tombaker Thanks. That's fine if the normal modules handle it. We'll follow their lead. I'll close this unless anyone thinks we should do more -- at this point. Remember , this is a first pass so we want to do what is minimally functional. |
Closing because we decided to use what standard modules do, so there are no special tests to add. |
Check that CSV is valid (e.g. each row same number of values, etc.) (This probably will be done by the built-in module that reads csv.)
message(s): errors here may result in program stopping with message. Message should be output to user
Is there anything else we should check for at this point?
The text was updated successfully, but these errors were encountered: