New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support "missing" values #97
Comments
Interesting one. I think this is definitely useful my only question is whether it is useful enough to get into json table schema - my worry is just about trying to keep mean and lean. I also wonder about naming: we could have @ldodds would this be useful at your end too do you think? |
This piece of metadata is used in ETL – both ways, and in analytics (where NULLs are avoided, as they might cause problems). EDIT: I agree with keeping it mean and lean, but this property is in higher level data processing (analytics/data mining) as important as datatype in ETL. In fact, storage datatype (int, string, ...) in higher level data processing is almost not important at all.... |
I second some sort of For example, in the Currency Codes datapackage the 4th field (Numeric Code) has a In MATLAB, parsing this column as an integer fails. I suppose other languages may treat any non-numeric value as a missing value, but having a specification of missing values makes more sense for validation purposes. An alternative would be to specify in the RFC that numeric fields use a common missing value. This option seems less flexible. |
@KrisKusano I feel you've identified a bug with the currency code datapackage - could you open an issue there. However, general point stands. @Stiivi / @KrisKusano I'd be happy to see this go in to JSON Table Schema if you were willing to finalize the proposal for review. |
@rgrp, good idea I opened an issue in the currency code repo. I would propose two changes to the JSON Table Schema Spec:
I am rather new to this organization, so I am not sure how to submit this proposal for review. Should I send a message (similar to above) to the data-protocols list? |
@KrisKusano I would add to the 2. also a short sentence about how to handle the missing value for better understanding:
I would also drop the "or array" and keep just "value", otherwise it would be difficult to conform to, not mentioning potential performance hits. We really don't need multiple empty values, just one designated. If there are values that might represent multiple empty values in a dataset, then that dataset should be cleaned first and it's description should not contain such field metadata. See another example of such metadata used in processing: COPY FROM in PostgreSQL where such value is represented by |
I can see how the missing value is useful, but I'm not sure it belongs in the JSON Table Schema spec. @Stiivi you note that its useful in higher level data processing to map a value into a different form. This makes me think its not part of the data schema and more about how consuming apps might want to process the data. I could see some apps, for example mapping zeros to null or something. From the work we've done on using this to describe CSVs, an empty string in a cell is a null value. For JSON based data I'd expect the idiomatic way to handle a missing value would be to have no key at all. So there could be some variation depending on what is being described? Just my 2p worth :) |
@ldodds isn't the purpose of the table schema spec to help consuming apps to consume the data correctly? Or is it just for humans? I think it should be the first one, otherwise we don't need the spec to require some JSON description, plain rich text would be much better for humans. The example you gave – "empty cell string in a cell is a null value" is one of classic examples, why you need a special explicit missing value designator. How do you distinguish between an empty string – user entered nothing, and missing value – value didn't even happened to be populated from some system? When I'm writing an ETL I want to know how to handle such situations, since It is important information to know. And I would say, even more important at the level of early data processing than format validation. Format validation can only tell you whether the data is OK or not and will tell you, whether you should continue or not. You encounter wrong format and you either choke and refuse to continue or spit the data into a separate erroneous data collection. Neither is helpful and both require human intervention later. "Missing value" handling is non-blocking and kind of automatically error reducing the whole ETL process. For an ETL designer/developer, "format" metadata is useless in most of the time. Or at least it is not helpful. It is an information that is of interest for a data quality person. Just to give you a comparison with another piece of metadata... |
OK, we're about to implement something here because it is really important for consuming data in many cases e.g. R. Probably look like:
See pandas info on this: http://pandas.pydata.org/pandas-docs/stable/missing_data.html Note that pandas "missing values" also includes NaN and infinity. (update: not any more "Note Prior to version v0.10.0 inf and -inf were also considered to be “null” in computations. This is no longer the case by default; use the mode.use_inf_as_null option to recover it.") @Stiivi do you have any recommendations based on your experience here. |
We already have strings that can be considered as So, would it not be cleaner to add I understand that does not exactly cover the full use case for
|
@pwalsh hmmm the type |
Jumping in this discussion, I also vote for a |
For now we have list of values considered to be
We could just allow to extend this list like:
Implementation will be simple for current |
OK, let me summarize the state of play:
This is going to go in asap ... |
OK here is the proposed new text. @Stiivi @pwalsh @akariv @KrisKusano
|
@rgrp I think it's a little too complicated and still somewhat ambiguous. First of all, it's important to state that the Then we simply state that
When put this way, then it's just adding a parameter for the already existing |
@akariv good feedback. I should say I think at this point this is going given the feedback so far - it is a common item mentioned and useful clarification. Re the definition of: "An empty string is considered to be a missing value." in current def of required. That would be moved out and we'd reference |
OK. I guess that as long as it's clear that |
As part of the missingValues example I suggest including """", to highlight the otherwise confusing case where there is a explicit ,"", rather than implicit ,, empty string in the csv:
|
@rgrp do you want to make a PR on this? |
@Butterwell good point. And it made me thing a bit. To check: would not "" in a CSV become the simple empty string after parsing? If so, would it not be equivalent to ""? I guess that raises a question of when you apply the missing value test (after or before parsing the source file). In general, I imagine these tests being applied after loading from source file into whatever language or system you are using. (I note that in the sample text I talk about comparison before parsing - i think strictly what I meant was |
Sure. After parsing. And so the onus is on the source implementation to parse ,"", and ,, separately if that is information that needs to be preserved. In that case I was suggesting, make ,"", be """" and ,, be "", but there is no particular reason for that to be the encoding at parse time. |
Comment by spec editor: we focus on addressing the second of the two concerns raised by @Stiivi - that is the problem of knowing what values indicate NULL.
Fields should have a way how to specify the missing values, if relevant. It is good practice mostly in reporting/data mining to have a piece of metadata that denotes which values are considered as "missing" in the dataset. This might be used in two ways:
One way is for tools to know what value to use if the data source value is empty. For example use
0
(zero) instead of an empty string:SQL equivalent would be:
COALESCE(amount, 0)
Other use is to tell the analytical/datamining software which values are considered as empty (
NULL
s). For example a string field with a content string"NULL"
or"(empty)"
will be converted to real NULL value.SQL equivalent would be:
NULLIF(name, '(missing value)')
orCASE name WHEN '(missing value)' THEN NULL ELSE name END
EDIT: Added SQL examples.
The text was updated successfully, but these errors were encountered: