Support "missing" values #97

Closed
Stiivi opened this Issue Feb 4, 2014 · 22 comments

Projects

None yet

9 participants

@Stiivi
Stiivi commented Feb 4, 2014 edited

Comment by spec editor: we focus on addressing the second of the two concerns raised by @Stiivi - that is the problem of knowing what values indicate NULL.

Fields should have a way how to specify the missing values, if relevant. It is good practice mostly in reporting/data mining to have a piece of metadata that denotes which values are considered as "missing" in the dataset. This might be used in two ways:

One way is for tools to know what value to use if the data source value is empty. For example use 0 (zero) instead of an empty string:

"fields"= [
{
    "name": "amount",
    "missing_value": 0
}
]

SQL equivalent would be: COALESCE(amount, 0)

Other use is to tell the analytical/datamining software which values are considered as empty (NULLs). For example a string field with a content string "NULL" or "(empty)" will be converted to real NULL value.

"fields": [
{
    "name": "name",
    "missing_value": "(missing value)"
}
]

SQL equivalent would be: NULLIF(name, '(missing value)') or CASE name WHEN '(missing value)' THEN NULL ELSE name END

EDIT: Added SQL examples.

@rufuspollock
Contributor

Interesting one. I think this is definitely useful my only question is whether it is useful enough to get into json table schema - my worry is just about trying to keep mean and lean.

I also wonder about naming: we could have default rather than missingValue i guess.

@ldodds would this be useful at your end too do you think?

@Stiivi
Stiivi commented Feb 14, 2014

missing_value is more descriptive in this case. For example for the second case, where you are replacing a real value that marks value as missing with NULL/None the default makes no sense. Also this term is more used in the ETL/DWH world for this purpose.

This piece of metadata is used in ETL – both ways, and in analytics (where NULLs are avoided, as they might cause problems).

EDIT: I agree with keeping it mean and lean, but this property is in higher level data processing (analytics/data mining) as important as datatype in ETL. In fact, storage datatype (int, string, ...) in higher level data processing is almost not important at all....

@KrisKusano

I second some sort of missing_value field descriptor (whatever the name).

For example, in the Currency Codes datapackage the 4th field (Numeric Code) has a type='integer'. The field contains values of Nil, (HTML char &#8212), and N.A..

In MATLAB, parsing this column as an integer fails. I suppose other languages may treat any non-numeric value as a missing value, but having a specification of missing values makes more sense for validation purposes.

An alternative would be to specify in the RFC that numeric fields use a common missing value. This option seems less flexible.

@rufuspollock
Contributor

@KrisKusano I feel you've identified a bug with the currency code datapackage - could you open an issue there.

However, general point stands.

@Stiivi / @KrisKusano I'd be happy to see this go in to JSON Table Schema if you were willing to finalize the proposal for review.

@KrisKusano KrisKusano referenced this issue in datasets/currency-codes Jul 6, 2014
Closed

Integer "Numeric Code" Field has Invalid Values #3

@KrisKusano

@rgrp, good idea I opened an issue in the currency code repo.

I would propose two changes to the JSON Table Schema Spec:

  1. The JSON Table Schema spec has a field type, which are based on a JSON spec. The dataprotocols spec should match the JSON spec that number and integer fields MUST be a number (i.e. cannot be a character). I also propose specifying that missing numbers be represented as empty CSV cells (i.e. ,,), which is not in the JSON spec. Here is a draft mod:

    • number: a number including floating point numbers. Value MUST be a number.
    • integer. an integer. Value MUST be an integer, no floating point numbers are allowed.
  2. The string missing value issue originally raised by @Stiivi, however, is not resolved by 1. A possible resolution is to add an optional key to the Field Constrains field descriptor:

    missingValue - A value or array that represent missing values for the field.
    Default missing values are an empty cell (i.e. ,, for a CSV data file) or empty
    string (i.e. "") for string fields. Missing values other than the defaults should
    be specified in this constraint.

I am rather new to this organization, so I am not sure how to submit this proposal for review. Should I send a message (similar to above) to the data-protocols list?

@Stiivi
Stiivi commented Jul 21, 2014

@KrisKusano I would add to the 2. also a short sentence about how to handle the missing value for better understanding:

If a missing value is encountered it should be converted into the NULL or equivalent value.

I would also drop the "or array" and keep just "value", otherwise it would be difficult to conform to, not mentioning potential performance hits. We really don't need multiple empty values, just one designated. If there are values that might represent multiple empty values in a dataset, then that dataset should be cleaned first and it's description should not contain such field metadata.

See another example of such metadata used in processing: COPY FROM in PostgreSQL where such value is represented by \N by default, configurable with the NULL keyword.

@ldodds
Contributor
ldodds commented Jul 22, 2014

I can see how the missing value is useful, but I'm not sure it belongs in the JSON Table Schema spec.

@Stiivi you note that its useful in higher level data processing to map a value into a different form. This makes me think its not part of the data schema and more about how consuming apps might want to process the data. I could see some apps, for example mapping zeros to null or something.

From the work we've done on using this to describe CSVs, an empty string in a cell is a null value. For JSON based data I'd expect the idiomatic way to handle a missing value would be to have no key at all. So there could be some variation depending on what is being described?

Just my 2p worth :)

@Stiivi
Stiivi commented Jul 22, 2014

@ldodds isn't the purpose of the table schema spec to help consuming apps to consume the data correctly? Or is it just for humans? I think it should be the first one, otherwise we don't need the spec to require some JSON description, plain rich text would be much better for humans.

The example you gave – "empty cell string in a cell is a null value" is one of classic examples, why you need a special explicit missing value designator. How do you distinguish between an empty string – user entered nothing, and missing value – value didn't even happened to be populated from some system?

When I'm writing an ETL I want to know how to handle such situations, since NULL and "" (empty string) have different meaning. It has implications ranging from data mining through detection of ETL issues to issues with the input systems.

It is important information to know. And I would say, even more important at the level of early data processing than format validation. Format validation can only tell you whether the data is OK or not and will tell you, whether you should continue or not. You encounter wrong format and you either choke and refuse to continue or spit the data into a separate erroneous data collection. Neither is helpful and both require human intervention later. "Missing value" handling is non-blocking and kind of automatically error reducing the whole ETL process. For an ETL designer/developer, "format" metadata is useless in most of the time. Or at least it is not helpful. It is an information that is of interest for a data quality person. Just to give you a comparison with another piece of metadata...

@rufuspollock
Contributor
rufuspollock commented Feb 29, 2016 edited

OK, we're about to implement something here because it is really important for consuming data in many cases e.g. R.

Probably look like:

missingValue: "value" or an array a values ["x", "y", "z"]

See pandas info on this: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

Note that pandas "missing values" also includes NaN and infinity. (update: not any more "Note Prior to version v0.10.0 inf and -inf were also considered to be “null” in computations. This is no longer the case by default; use the mode.use_inf_as_null option to recover it.")

@Stiivi do you have any recommendations based on your experience here.

@pwalsh
Member
pwalsh commented Feb 29, 2016

@rgrp

We already have strings that can be considered as null: http://dataprotocols.org/json-table-schema/#field-types-and-formats

So, would it not be cleaner to add nullable as a property (nullable: true || false), and then, IF a field is nullable, check it is of declared type or null.

I understand that does not exactly cover the full use case for missing_amount as declared in the original post, and this is intentional: we have two distinct use cases here IMHO:

  1. nullable fields
  2. default values
@rufuspollock
Contributor

@pwalsh hmmm the type null is, i assumed, for a specific field being entirely null (a bit unusual but possible). I also need to parse the distinction between missing values and simply "".

@akariv
akariv commented Mar 9, 2016

Jumping in this discussion, I also vote for a nullable property (or an opposite required property) which indicates if the value is allowed to be empty.
Essentially it determines the behaviour in cases where the value doesn't validate AND it is also empty-ish (usually I'd consider a string made of only whitespaces to be empty). If the field is required, then we would raise an exception. If it is not, then it would be loaded as a null value.
This is relevant to all datatypes - for number, integer, boolean, date etc an empty string is an invalid value which might be considered as error or just as a missing value.
For string an empty string is either a valid value or an invalid value (in case there are other constraints on the value, such as minLength etc.)

@pwalsh pwalsh referenced this issue in frictionlessdata/jsontableschema-py Mar 28, 2016
Merged

[#53] fixed null value cast #56

@roll
Member
roll commented Mar 29, 2016

required + nullable/missing + etc - it starts to be messy. Doesn't it? Missing will be another term to understand, define and implement.

For now we have list of values considered to be not a value (empty string is discussed here now - frictionlessdata/jsontableschema-py#56):

'null', 'none', 'nil', 'nan', '-', ''

We could just allow to extend this list like:

extra_null_values: ['N/A', 'empty']

Implementation will be simple for current jsontableschema-py for example.

@rufuspollock
Contributor

OK, let me summarize the state of play:

  • We will implement a missingValue property on all fields. Exact form tbd but likely as per #97 (comment)
  • nullable will not be implemented as duplicate of required (though we will need to note this in semantics for parsers that required=true => no null values)

This is going to go in asap ...

@rufuspollock rufuspollock self-assigned this Apr 19, 2016
@rufuspollock rufuspollock referenced this issue in ropenscilabs/datapkg May 4, 2016
Open

Open questions #3

@rufuspollock
Contributor

OK here is the proposed new text. @Stiivi @pwalsh @akariv @KrisKusano

By "missing" we simply mean null or "not present for whatever reason". Many
datasets arrive with missing data values, either because a value was not
collected or it never existed. 

Missing values may be indicated simply by the value being empty in other cases
a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc.

The `missingValue` property provides a way to indicate that these values should
be interpreted as equivalent to null.

Formally, strings indicating missing data values for a field `MAY` be provided
for a field using the `missingValue` property.

If present, the `missingValue` MUST be a single string or an array of strings,
for example:

    "missingValue": ""
    "missingValue": "-"
    "missingValue": ["Nan", "-"]

**Note**: `missingValue` are always strings rather than being the data type of
the particular field. This allows for comparison prior to parsing and for
fields to have missing value which are not of their type, for example a
`number` field to have missing values indicated by `-`.

The default value of `missingValue` for a non-string type field is the empty
string `""`. For string type fields there is no default for `missingValue` (for
string fields the empty string `""` is a valid value and need not indicate
null).

**Processing**: if a missing value is encountered it SHOULD be converted into
the NULL or equivalent value.
@akariv
akariv commented May 19, 2016

@rgrp I think it's a little too complicated and still somewhat ambiguous.

First of all, it's important to state that the missingValues is only relevant in cases where required==false (as if the value is required, it can't be missing anyway).

Then we simply state that missingValues override what is considered to be a "missing value" in the context of the required constraint. See the language there:

required – A boolean value which indicates whether a field must have a value in every row of the table. An empty string is considered to be a missing value.

When put this way, then it's just adding a parameter for the already existing required logic - and not creating a separate processing path.

@rufuspollock
Contributor
rufuspollock commented May 19, 2016 edited

@akariv good feedback.

I should say I think at this point this is going given the feedback so far - it is a common item mentioned and useful clarification.

Re the definition of: "An empty string is considered to be a missing value." in current def of required. That would be moved out and we'd reference missingValue.

@akariv
akariv commented May 19, 2016

OK. I guess that as long as it's clear that required and missingValues are tightly coupled, it doesn't really matter who refers to whom.

@Butterwell

As part of the missingValues example I suggest including """", to highlight the otherwise confusing case where there is a explicit ,"", rather than implicit ,, empty string in the csv:

"missingValue": ""
"missingValue": "-"
"missingValue": ["Nan", "-"]
"missingValue": "\"\""
@pwalsh
Member
pwalsh commented Jul 12, 2016

@rgrp do you want to make a PR on this?

@roll roll added backlog current and removed backlog current labels Aug 8, 2016
@rufuspollock
Contributor
rufuspollock commented Aug 11, 2016 edited

@Butterwell good point. And it made me thing a bit. To check: would not "" in a CSV become the simple empty string after parsing? If so, would it not be equivalent to ""?

I guess that raises a question of when you apply the missing value test (after or before parsing the source file). In general, I imagine these tests being applied after loading from source file into whatever language or system you are using.

(I note that in the sample text I talk about comparison before parsing - i think strictly what I meant was casting not parsing and will update appropriately).

@rufuspollock rufuspollock removed the backlog label Aug 11, 2016
@Butterwell

Sure. After parsing. And so the onus is on the source implementation to parse ,"", and ,, separately if that is information that needs to be preserved. In that case I was suggesting, make ,"", be """" and ,, be "", but there is no particular reason for that to be the encoding at parse time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment