Should -Inf be interpreted as a NULL value #6

peterdesmet · 2021-04-07T15:17:38Z

See:

https://github.com/enram/vpts/blob/c726842e60515d33fdf0f97343858a40f603b20c/table-schemas/vol2bird.json#L192

Note: if we don't do this, it will lead to type validation errors.

niconoe · 2021-06-11T07:50:35Z

I hijack this issue to push the question a bit further: in vol2bird's output, we find several different "not a number" values for columns such as ff, dbz, ...

For example:

-inf (negative infinity?)
nodata (actual data value -1000, but metadata says -1000 == nodata)
undetect (actual data value -999.0, but metadata says -999.0 == undetect)

In the case of the vpts "standard", do we want to keep all the nuances and have 3 different values ? Or is it "good enough" to normalize them to a single one?

Any opinion? @peterdesmet, @CeciliaNilsson709 , @adokter ?

peterdesmet · 2021-06-11T08:56:09Z

For implementation, we have two options:

Nuances are lost in data: transform nuances to empty values when vptstools writes to CSV
Nuances are lost when reading: keep the nuances in the data, but document them as to be interpreted as empty values by any reader. This can be done in missingValues.

I'm in favour of 2. @adokter what nuances should we allow?

niconoe · 2021-06-11T09:36:03Z

Since we are discussing the standard here, I think the right question is: from a data consumer standpoint, how useful/important is it to distinguish between those different kinds of non-data:

If it's a very niche usage: let's keep things simple and only keep a single value (null?)(subsequent implementation: converters will convert all three to this null value (data loss)/ consumers will have no way to distinguish those values
If that's an important nuance, let's express them as 3 different things in the data: consumers will then be able to choose if they want to distinguish those values or not for their use-case. The standard would work for more use-cases, the drawback is that it makes it slightly more complex.

Another way to think of it is: "when we later start building VPTS files from other sources (CAJUN?): will those nuances still make sense or not?"

adokter · 2021-06-11T17:52:01Z

For users it is important to distinguish between 0, -Inf, nodata and undetect (and bioRad plotting functions already use this info, not niche usage).

Instead of giving options to users, I feel it's important to force / strongly encourage users to be aware of the distinction between zero, nodata and undetect, as it's a potential source of confusion and incorrect interpretation of the data.

For NaN there is currently a double usage: both for undetect, and for cases where the algorithm failed to calculate a quantity, e.g. due to incomplete or insufficient data. So you could split those two (but that requires a change to vol2bird)

Re: how to code things in csv: -Inf is the equivalent of zero for logarithmic quantities like DBZH and dbz, so you could get rid of it by storing these as zero on a linear scale (although that is quite uncommon). There is no NULL value currently in use, so you could code NaN as NULL if that makes things easier

adokter · 2021-06-11T21:02:30Z

So, -Inf should not be interpreted as Null, but as 0 on a logarithmic scale

niconoe · 2021-06-15T11:58:41Z

Thanks @adokter: I will check exactly how to do it best in CSV/frictionless, but I definitely wants to keep the distinction between 0, -inf, nodata and undetect (I might come back to you if I need more precision about those).

About, For NaN there is currently a double usage: both for undetect, and for cases where the algorithm failed to calculate a quantity, e.g. due to incomplete or insufficient data. So you could split those two (but that requires a change to vol2bird): are you referring to the CSV (console) output of vol2bird? My current implementation of vph5_to_vpts doesn't use it, but the ODIM vertical profiles in h5 format. At first look I saw no NaN values there. So if I'm not mistaken, we have everything we need already, without requiring a change to vol2bird. Is that correct?

adokter · 2021-06-15T16:26:32Z

You're right, I was referring to the CSV output - in the hdf5 profiles the nodata and undetect are coded with an explicit value that is stored as an attribute.

niconoe · 2021-08-13T08:00:05Z

If I'm not mistaken, the conclusion/decision here is that the standard should distinguish between -inf, nodata and undetect for different variables.

For clarity, I suggest closing this issue and discussing how to achieve that in a different one.

niconoe added the question Further information is requested label Jun 11, 2021

niconoe closed this as completed Aug 13, 2021

niconoe mentioned this issue Aug 13, 2021

how to express -inf, nodata and undetect in a tabular data package #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should -Inf be interpreted as a NULL value #6

Should -Inf be interpreted as a NULL value #6

peterdesmet commented Apr 7, 2021

niconoe commented Jun 11, 2021

peterdesmet commented Jun 11, 2021

niconoe commented Jun 11, 2021

adokter commented Jun 11, 2021

adokter commented Jun 11, 2021

niconoe commented Jun 15, 2021

adokter commented Jun 15, 2021

niconoe commented Aug 13, 2021

Should -Inf be interpreted as a NULL value #6

Should -Inf be interpreted as a NULL value #6

Comments

peterdesmet commented Apr 7, 2021

niconoe commented Jun 11, 2021

peterdesmet commented Jun 11, 2021

niconoe commented Jun 11, 2021

adokter commented Jun 11, 2021

adokter commented Jun 11, 2021

niconoe commented Jun 15, 2021

adokter commented Jun 15, 2021

niconoe commented Aug 13, 2021