-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Non-CSV Metadata / Front Matter / Comments in CSV Files #31
Comments
The file didn't upload correctly, so here is the first hundred lines or so as an example. You can download data like this from NASA's CDAWeb.
|
+1 for this. Some known Java implementations supporting comments in CSV:
|
Arguably it would be better to move the comments into an appropriate official CSV schema (and such comments are not allowed in the CSV RFC), but that said it is quite common for CSV processing libraries to have a way of saying "skip/ignore x lines", another way (which might map quite well to the underlying libraries) would be to have
Or whatever numeric value would be appropriate |
I think for simple comments this could make sense to allow something like
I like this idea. I think that should be part of the tool rather than the Schema though. So in our CSV Validator tool, we could add a command line arg like: I do think we have to be careful about not overloading the CSV Schema. It works pretty-well at the moment because it does one thing and does it quite well. Certainly there is some scope for expansion, but some of that could be in the tool rather than the Schema. |
That makes a lot of sense. It also makes nice separation between the tasks of validating the front matter and validating the CSV data. One issue is metadata that that the CSV validator cannot validate without the help of the external tool. For instance, in the CSVY format, the YAML part does have a comment character, and it isn't a fixed number of lines, so I'm not sure how the CSV validator would know what lines to skip without the help of the YAML validator. I managed to write my own validator program for some custom CSV formats. What I did is have the part that validates the front matter return the line numbers corresponding to the front matter. The part that validates the CSV then takes those line numbers as input, and skips them. Perhaps there's other approaches? |
Sorry for the long message, I guess I've been thinking a lot about CSV's...
This issue is to suggest support for CSV files which contain non-CSV metadata or front matter at the top of the file, as well to raise the issue of comments within CSV files.
Although CSV files that begin with non-CSV metadata are beyond the type described in RFC 4180, they are quite common. Non-CSV data is typically used to include metadata about the data in the file, such as the equipment and parameters that went into an experiment.
I work with earth science data, where the idea of including multiple-line frontmatter in the file is quite common. I've attached a sample file from NASA as an example.
Supporting these kinds of files fully could entail a number of smaller changes, each of which might be considered independently. However, I've created one issue for the topic to try to unify discussion, at least at the initial stages.
Standards and Common Practices
There does not seem to be a widely-accepted standard for such files. I've ran across a few attempts at defining a standard, but they don't seem to have caught on widely:
https://csvy.org/ (looks more mature, though I don't think many libraries for CSV interaction support it)
https://github.com/csvspecs (looks to be work-in-progress)
As for common practices, I can speak to the spaces I'm familiar with, which are (mostly Python-based) tools for data processing used in the sciences and in data science.
The Pandas library supports specifying a comment character (i.e. '#') that denotes either whole lines or end-of-line comments:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#comments-and-empty-lines
Pandas is widely used, so this gives me the idea that at least some people use these types of comments.
The NASA Space Physics Data Facility (https://cdaweb.gsfc.nasa.gov/) uses the '#' comment character and formatting of the file I attached. The website allows you to download any of the measurements in their database in this format. But it also has several other export options, including a "normal" CSV with the metadata in a separate JSON file, as well as the raw data (in netCDF, which isn't a type of CSV at all). So perhaps they expect that people who are going to do lots of analysis will use the "normal" CSV files. This is to say that, while I think CSV Schema should support CSV files with metadata, I imagine some people would argue that real-world data collection should not be done using them.
Support within CSV Schema
As for the schema:
Ignoring Comments / Metadata
@adamretter suggested adding directives to ignore leading lines when validating CSV files (text is modified from his):
@IgnoreLeadingLines '#'
, which would simply ignore all lines from line 0 that start with a '#' character up until the first line that does not start with that character.@IgnoreCommentLines '#'
, which would just ignore any line which starts with a '#' character.@IgnoreLeadingLinesMatching "regular expression"
I think it would be useful to be able to ignore the leading lines, and I like these directives. The difference between
@IgnoreLeadingLines
and@IgnoreCommentLines
is helpful, since I could see situations that call for one but not the other.Validating Comments / Metadata
I think there also should be a way to validate the contents of the non-CSV lines, as well as the CSV data itself. But I'm not sure if this is something the CSV Schema itself should support, or if this would be better handled by a more general system that supports files with multiple parts (and might make use of CSV schema to describe the CSV part). I'm not sure whether such a system exists.
On the other hand, there definitely are CSV files like this out there, so one argument is that the CSV Schema should be able to describe them.
If this is something the CSV Schema might support, it would be helpful to have multiple options:
What seems ideal for the purpose of validating files with metadata is a way to say "this kind of header isn't CSV, but needs to be validated with X", where X is some external schema / tool. For instance, I might pass the metadata to a JSON validator or compare it with a YAML schema.
I think it would be ideal to be able to specify the type of non-CSV data in a flexible way that does not require the CSV Schema to maintain a list of supported metadata types. This would also be useful for people (such as myself) who have CSV files with metadata that is not in any standard format, but that they nonetheless may wish to use.
It would also be helpful to do what can be done to reduce the work for those implementing the language. Someone who is creating a CSV validator may have to explicitly include support for various metadata types, but hopefully this could be as simple as piping the data to existing JSON/YAML/whatever validators in their language, rather than expecting them to include their own support for each metadata type. I'm not versed enough in this area to give detailed recommendations, but it's a point to consider.
Other thoughts
Another issue to consider is end-of-line comments that occur in the data. I'm not sure how many people have files like this, but as I mentioned above, Pandas includes support for these comments. There's also the possibility of inline comments (between data elements), but that seems really far-fetched (I don't know why someone would try to create a CSV file like that).
Yet another issue is leading lines that are not marked with a comment character at all (the only way to tell is to look where the data starts). I happen to have some unfortunately-formatted files like this. Actually, if people were to adopt the CSVY standard (first link above), this would be a problem. The YAML header in CSVY could be any length, and it isn't marked by a comment character at the beginning of each line. (The end of the YAML block has the standard "---" that denotes the end of a document in YAML.)
Uploading OMNI_HRO_1MIN_27555.csv.txt…
The text was updated successfully, but these errors were encountered: