Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should -Inf be interpreted as a NULL value #6

Closed
peterdesmet opened this issue Apr 7, 2021 · 8 comments
Closed

Should -Inf be interpreted as a NULL value #6

peterdesmet opened this issue Apr 7, 2021 · 8 comments
Labels
question Further information is requested

Comments

@peterdesmet
Copy link
Member

See:

https://github.com/enram/vpts/blob/c726842e60515d33fdf0f97343858a40f603b20c/table-schemas/vol2bird.json#L192

Note: if we don't do this, it will lead to type validation errors.

@niconoe niconoe added the question Further information is requested label Jun 11, 2021
@niconoe
Copy link
Collaborator

niconoe commented Jun 11, 2021

I hijack this issue to push the question a bit further: in vol2bird's output, we find several different "not a number" values for columns such as ff, dbz, ...

For example:

  • -inf (negative infinity?)
  • nodata (actual data value -1000, but metadata says -1000 == nodata)
  • undetect (actual data value -999.0, but metadata says -999.0 == undetect)

In the case of the vpts "standard", do we want to keep all the nuances and have 3 different values ? Or is it "good enough" to normalize them to a single one?

Any opinion? @peterdesmet, @CeciliaNilsson709 , @adokter ?

@peterdesmet
Copy link
Member Author

For implementation, we have two options:

  1. Nuances are lost in data: transform nuances to empty values when vptstools writes to CSV
  2. Nuances are lost when reading: keep the nuances in the data, but document them as to be interpreted as empty values by any reader. This can be done in missingValues.

I'm in favour of 2. @adokter what nuances should we allow?

@niconoe
Copy link
Collaborator

niconoe commented Jun 11, 2021

Since we are discussing the standard here, I think the right question is: from a data consumer standpoint, how useful/important is it to distinguish between those different kinds of non-data:

  • If it's a very niche usage: let's keep things simple and only keep a single value (null?)(subsequent implementation: converters will convert all three to this null value (data loss)/ consumers will have no way to distinguish those values
  • If that's an important nuance, let's express them as 3 different things in the data: consumers will then be able to choose if they want to distinguish those values or not for their use-case. The standard would work for more use-cases, the drawback is that it makes it slightly more complex.

Another way to think of it is: "when we later start building VPTS files from other sources (CAJUN?): will those nuances still make sense or not?"

@adokter
Copy link
Collaborator

adokter commented Jun 11, 2021

For users it is important to distinguish between 0, -Inf, nodata and undetect (and bioRad plotting functions already use this info, not niche usage).

Instead of giving options to users, I feel it's important to force / strongly encourage users to be aware of the distinction between zero, nodata and undetect, as it's a potential source of confusion and incorrect interpretation of the data.

For NaN there is currently a double usage: both for undetect, and for cases where the algorithm failed to calculate a quantity, e.g. due to incomplete or insufficient data. So you could split those two (but that requires a change to vol2bird)

Re: how to code things in csv: -Inf is the equivalent of zero for logarithmic quantities like DBZH and dbz, so you could get rid of it by storing these as zero on a linear scale (although that is quite uncommon). There is no NULL value currently in use, so you could code NaN as NULL if that makes things easier

@adokter
Copy link
Collaborator

adokter commented Jun 11, 2021

So, -Inf should not be interpreted as Null, but as 0 on a logarithmic scale

@niconoe
Copy link
Collaborator

niconoe commented Jun 15, 2021

Thanks @adokter: I will check exactly how to do it best in CSV/frictionless, but I definitely wants to keep the distinction between 0, -inf, nodata and undetect (I might come back to you if I need more precision about those).

About, For NaN there is currently a double usage: both for undetect, and for cases where the algorithm failed to calculate a quantity, e.g. due to incomplete or insufficient data. So you could split those two (but that requires a change to vol2bird): are you referring to the CSV (console) output of vol2bird? My current implementation of vph5_to_vpts doesn't use it, but the ODIM vertical profiles in h5 format. At first look I saw no NaN values there. So if I'm not mistaken, we have everything we need already, without requiring a change to vol2bird. Is that correct?

@adokter
Copy link
Collaborator

adokter commented Jun 15, 2021

You're right, I was referring to the CSV output - in the hdf5 profiles the nodata and undetect are coded with an explicit value that is stored as an attribute.

@niconoe
Copy link
Collaborator

niconoe commented Aug 13, 2021

If I'm not mistaken, the conclusion/decision here is that the standard should distinguish between -inf, nodata and undetect for different variables.

For clarity, I suggest closing this issue and discussing how to achieve that in a different one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants