Filters applied to variable length data #2554
Replies: 24 comments 18 replies
-
Thanks for organizing this! My suggestion was make is so there would be a warning to a user by changing libdispatch/dfilter.c#L137 to instead be a This would allow existing code to continue working with a warning rather than breaking existing code with an error that is a bit cryptic and introducing this as a regression. If marking the filters optional instead would have the same effect, I think that's reasonable as well, though I think the suggestion above is narrower (i.e. the filters can still be mandatory for fixed length types, but wouldn't be applied for variable-length datatypes where they aren't supported). |
Beta Was this translation helpful? Give feedback.
-
One we decide on the desired action, the fix is straight forward. |
Beta Was this translation helpful? Give feedback.
-
I'll also mention that this breaks nccopy for existing files (and likely many other things besides nccopy). Given the attached (valid) netCDF4 file, nccopy errors out:
Using netCDF library version 4.9.2-development of Nov 11 2022 16:38:23 $ For users of things like nccopy or other utlities rather than direct users of the library, what are they supposed to do with this? |
Beta Was this translation helpful? Give feedback.
-
This is a tough decision. I would prefer if the netCDF layer implements the "Optional" behavior of not calling the filters on variables that do not meet the filter criteria, and does so silently. This includes the Quantize filters. I empathize most with the users who try to compress or copy (as above) data only to be rejected for reasons that may be unclear or confusing to them. I think most such users would prefer that the library compress the variables it can, and not fail on those it can't. |
Beta Was this translation helpful? Give feedback.
-
The normal convention of the netCDF API would be to return an error when you try to do anything that would not work. So trying to set a filter on a VLEN variable would not work, and the user would get an error if they tried. If this is not the case in the current codebase that is a bug and should be fixed. However, if there is a significant amount of user code this would break, the rules could be bent. But is there? Most users do not use variable length data types. Warnings are worthless and should be avoided. Who reads them? Where do they appear on HPC systems, when the code is running on 10000 processors? Just scream your warnings into the void. There are a few cases where we silently change the settings of a variable, for example if the user sets deflate, we turn on chunking. But I have found those cases always caused confusion. They are documented, but who studies the documentation? So I would avoid any more cases of changing settings silently, such as ignoring the user setting a filter on a var because it will not work. Let the user decide what they heck they want, and if we can't give it to them, return an error. If they want to compress a VLEN, then return an error and they can decide whether it is worth proceeding without compression, or else changing to a fixed length data structure. (And they might very often choose to do that.) |
Beta Was this translation helpful? Give feedback.
-
@sval-dev you make many excellent points. Perhaps at this point it is best to just include this on the list of things we silently do but wish we did not (kind of like paying sales tax or jogging). |
Beta Was this translation helpful? Give feedback.
-
I ran into this again today: Code was like the following (from python netCDF4 built on new netCDF4-C library):
Now I can work around by wrapping this in an if statement and not making zlib True when the dtype is a str, but having to fix code that was working isn't all that fun (and it's not even my code) and the "fix" isn't particularly useful. There is a related issue at Unidata/netcdf4-python#1205 where we can also make the fix (and the maintainer has agreed in principle to accept such a fix), but it would be a lot nicer to fix this here upstream. Is there any concrete work I can do to move this forward? |
Beta Was this translation helpful? Give feedback.
-
Just thought I'd check back in here. Anything I can do to help close this out? |
Beta Was this translation helpful? Give feedback.
-
Let me make sure I understand the proposal. I believe it is this:
So, the questions I have are:
|
Beta Was this translation helpful? Give feedback.
-
My votes would be:
|
Beta Was this translation helpful? Give feedback.
-
To be clear, the data is not bogus, it's just not compressed. We would never be writing corrupt data, and corrupt data wasn't written by previous versions of netCDF-4, it just failed to compress variable length fields. The previous behavior was to include the compression filters in the metadata, but applying them to variable length data was a silent no-op. The only issue that we are now throwing an error where before we were doing nothing and silently not compressing. This error breaks previously working code for what seems to be more or less no reason. Examples include nccopy itself, as shown above. If you have questions on the previous comments, please do ask them. |
Beta Was this translation helpful? Give feedback.
-
My $0.02 as a user:
|
Beta Was this translation helpful? Give feedback.
-
This is not correct. The old netcdf-c code was compressing the set of pointers to the variable length data. |
Beta Was this translation helpful? Give feedback.
-
OK, Please publish here the exact proposal you are making. Indicate what the behavior should be in the cases |
Beta Was this translation helpful? Give feedback.
-
See the discussion for #2189. |
Beta Was this translation helpful? Give feedback.
-
Lets finish this. I talked with Ward. The proposal on the table is this:
Anyone have violent objection? |
Beta Was this translation helpful? Give feedback.
-
Just thought I'd check back in on this. Anything I can do to help move this forward? |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
So I went searching in the HDF5 code (H5Z.c specifically).
|
Beta Was this translation helpful? Give feedback.
-
So, we can implement that exact same behavior by making all nc_def_var_filter calls create optional filters. |
Beta Was this translation helpful? Give feedback.
-
But note that behavior (#2554 (comment)) is equivalent to the PR I put up originally. So we have two choices:
The remaining problem is when we are given a dataset that has a filter on a var-len variable and it is not optional. |
Beta Was this translation helpful? Give feedback.
-
So the proposal is as follows:
Correct? |
Beta Was this translation helpful? Give feedback.
-
One additional comment. If a vlen variable has a non-optional filter, then it would be desirable to see the metadata |
Beta Was this translation helpful? Give feedback.
-
The discussion started here has been moved to this discussion.
In PR #2231, we explicitly disallowed
defining filters on variable length data.
This was consistent with this comment in the HDF5 code (H5Z.c), which states:
This issue arose because of the addition of filters to NCZarr. We decided to
follow the above comment. However, note that for both netcdf-4(HDF5) and NCZarr,
we currently mark all defined filters as Mandatory (as opposed to Optional).
So as noted in the comment, this causes an error if the type is variable length.
Note that the standard filters in HDF5 -- e.g. deflate and szip -- are marked
as Optional. This means that if one attempts to compress variable-length data
using one of the HDF5 built-in compressors, it will succeed in the sense that the compressor will not be applied to the data. Thus, no data corruption will occur, but neither will the data be compressed.
We have the option of marking all filters as Optional, in which case, we will
suppress them and they will not appear in the meta-data of the created netcdf
dataset. This will occur without any warning to the user.
I need some guidance about what to do. Leave it as is (error) or make it Optional.
This
Beta Was this translation helpful? Give feedback.
All reactions