Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform1: CSV validation #28

Closed
kcoyle opened this issue May 27, 2021 · 3 comments
Closed

Transform1: CSV validation #28

kcoyle opened this issue May 27, 2021 · 3 comments

Comments

@kcoyle
Copy link
Collaborator

kcoyle commented May 27, 2021

Check that CSV is valid (e.g. each row same number of values, etc.) (This probably will be done by the built-in module that reads csv.)

message(s): errors here may result in program stopping with message. Message should be output to user

Is there anything else we should check for at this point?

@tombaker
Copy link
Collaborator

tombaker commented Jun 2, 2021

@kcoyle I looked for modules or algorithms to test whether a CSV conforms to RFC 4180 and it seems that RFC 4180 is only considered as guidelines.

There are some Python modules for integrity checking of CSVs (and for their conversion into JSON or YAML) - I'm looking at the Frictionless Framework, which outputs verbose metadata and debugging information about missing values, etc - but I do not see it necessarily as the role of the DCTAP script to provide extensive checking along these lines (or if it does, perhaps only by a dedicated subcommand that simply runs frictionless and prints its output).

Rather, I think the DCTAP script should parse a CSV and display it in ways that help a user spot its errors.

With regard to the RFC 4180 requirement that all rows have the same number of columns, for example, the standard Python module for parsing CSVs does not complain about header rows or data rows that are too short or too long, but simply handles them.

  • If a data row is shorter than the header row, it pads out short data rows with None values.
  • If the header row is one column shorter than the longest data column, it adds an empty string as the final header.
  • If the header row is more than one column shorter than the longest data column, it adds one empty string header followed by one header called None. The None column accumulates all additional value columns in a list.

I have created some tests that confirm this behavior, and I think it is enough for our purposes. The question, as I see it, is what sort of warning message could help a user creating or debugging a DCTAP instance, when an empty-string header or None header is encountered.

My function currently reads a file using the CSV module and exits with an error message if 'propertyID' is not found among the headers. No doubt the program could be made more flexible and robust by testing against files with unreadable characters or non-CSV filetypes like JPEG or by supporting the parsing of other tabular encodings such as XLSX files, but I'm inclined to have the program tolerate missing or non-standard headers - perhaps to flag them, but not to reject them or exit entirely.

@kcoyle
Copy link
Collaborator Author

kcoyle commented Jun 2, 2021

@tombaker Thanks. That's fine if the normal modules handle it. We'll follow their lead. I'll close this unless anyone thinks we should do more -- at this point. Remember , this is a first pass so we want to do what is minimally functional.

@kcoyle
Copy link
Collaborator Author

kcoyle commented Jun 28, 2021

Closing because we decided to use what standard modules do, so there are no special tests to add.

@kcoyle kcoyle closed this as completed Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants