Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable compact storage for netcdf-4 vars #1570

Merged
merged 14 commits into from
Dec 19, 2019
1 change: 1 addition & 0 deletions include/nc4internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@ typedef struct NC_VAR_INFO
void *fill_value;
size_t *chunksizes;
nc_bool_t contiguous; /**< True if variable is stored contiguously in HDF5 file */
nc_bool_t compact; /**< True if variable is in comact storage in HDF5 file */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this matters in C, but if this were C++, it would save space in the struct (due to alignment concerns) to have all the nc_bool_t together instead of having the int parallel_access in between. I haven't measured whether it makes any difference in C.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the nc_bool_t size is 4 bytes which is the same size as an int, so intermingling int and nc_bool_t doesn't change the size of the struct in this case. If the nc_bool_t were changed to the stdbool-defined bool, the size of the struct would drop by 40-bytes (something for the future...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nc_bool_t is just an int. If it ever changes, then whoever makes the change will be in charge of making the correct change. I don't generally code based on this kind of thinking. I can only code correctly, and hope that future netCDF programmers will do the same. ;-)

int parallel_access; /**< Type of parallel access for I/O on variable (collective or independent) */
nc_bool_t dimscale; /**< True if var is a dimscale */
nc_bool_t *dimscale_attached; /**< Array of flags that are true if dimscale is attached for that dim index */
Expand Down
1 change: 1 addition & 0 deletions include/netcdf.h
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,7 @@ NOTE: The NC_MAX_DIMS, NC_MAX_ATTRS, and NC_MAX_VARS limits
/**@{*/
#define NC_CHUNKED 0
#define NC_CONTIGUOUS 1
#define NC_COMPACT 2
/**@}*/

/** In HDF5 files you can set check-summing for each variable.
Expand Down
22 changes: 12 additions & 10 deletions libdispatch/dvar.c
Original file line number Diff line number Diff line change
Expand Up @@ -467,20 +467,21 @@ nc_def_var_fletcher32(int ncid, int varid, int fletcher32)
Note that this does not work for scalar variables. Only non-scalar
variables can have chunking.

@param[in] ncid NetCDF ID, from a previous call to nc_open or
@param ncid NetCDF ID, from a previous call to nc_open or
nc_create.

@param[in] varid Variable ID.
@param varid Variable ID.

@param[in] storage If ::NC_CONTIGUOUS, then contiguous storage is used
for this variable. Variables with one or more unlimited dimensions
cannot use contiguous storage. If contiguous storage is turned on, the
chunksizes parameter is ignored. If ::NC_CHUNKED, then chunked storage
is used for this variable. Chunk sizes may be specified with the
chunksizes parameter or default sizes will be used if that parameter
is NULL.
@param storage If ::NC_CONTIGUOUS or ::NC_COMPACT, then contiguous
or compact storage is used for this variable. Variables with one or
more unlimited dimensions cannot use contiguous or compact
storage. If contiguous or compact storage is turned on, the
chunksizes parameter is ignored. If ::NC_CHUNKED, then chunked
storage is used for this variable. Chunk sizes may be specified
with the chunksizes parameter or default sizes will be used if that
parameter is NULL.

@param[in] chunksizesp A pointer to an array list of chunk sizes. The
@param chunksizesp A pointer to an array list of chunk sizes. The
array must have one chunksize for each dimension of the variable. If
::NC_CONTIGUOUS storage is set, then the chunksizes parameter is
ignored.
Expand Down Expand Up @@ -539,6 +540,7 @@ nc_def_var_fletcher32(int ncid, int varid, int fletcher32)
if (chunksize[d] != chunksize_in[d]) ERR;
if (storage_in != NC_CHUNKED) ERR;
@endcode
@author Ed Hartnett, Dennis Heimbigner
*/
int
nc_def_var_chunking(int ncid, int varid, int storage,
Expand Down
4 changes: 3 additions & 1 deletion libhdf5/hdf5open.c
Original file line number Diff line number Diff line change
Expand Up @@ -1088,8 +1088,10 @@ static int get_chunking_info(hid_t propid, NC_VAR_INFO_T *var)
for (d = 0; d < var->ndims; d++)
var->chunksizes[d] = chunksize[d];
}
else if (layout == H5D_CONTIGUOUS || layout == H5D_COMPACT)
else if (layout == H5D_CONTIGUOUS)
var->contiguous = NC_TRUE;
else if (layout == H5D_COMPACT)
var->compact = NC_TRUE;

return NC_NOERR;
}
Expand Down
78 changes: 52 additions & 26 deletions libhdf5/hdf5var.c
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@
* order. */
#define NC_TEMP_NAME "_netcdf4_temporary_variable_name_for_rename"

/** Number of bytes in 64 MB. */
#define SIXTY_FOUR_MB (67108864)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit for a compact data set is 64 KiB, not 64 MiB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will fix.

#ifdef LOGGING
/**
* Report the chunksizes selected for a variable.
Expand Down Expand Up @@ -707,41 +710,64 @@ nc_def_var_extra(int ncid, int varid, int *shuffle, int *deflate,
var->contiguous = NC_FALSE;
}

/* Does the user want a contiguous dataset? Not so fast! Make sure
* that there are no unlimited dimensions, and no filters in use
* for this data. */
if (contiguous && *contiguous)
/* Handle storage settings. */
if (contiguous)
{
if (var->deflate || var->fletcher32 || var->shuffle)
return NC_EINVAL;

for (d = 0; d < var->ndims; d++)
if (var->dim[d]->unlimited)
/* Does the user want a contiguous or compact dataset? Not so
* fast! Make sure that there are no unlimited dimensions, and
* no filters in use for this data. */
if (*contiguous)
{
if (var->deflate || var->fletcher32 || var->shuffle)
return NC_EINVAL;
var->contiguous = NC_TRUE;
}

/* Chunksizes anyone? */
if (contiguous && *contiguous == NC_CHUNKED)
{
var->contiguous = NC_FALSE;
for (d = 0; d < var->ndims; d++)
if (var->dim[d]->unlimited)
return NC_EINVAL;
}

/* If the user provided chunksizes, check that they are not too
* big, and that their total size of chunk is less than 4 GB. */
if (chunksizes)
/* Handle chunked storage settings. */
if (*contiguous == NC_CHUNKED)
{
var->contiguous = NC_FALSE;

if ((retval = check_chunksizes(grp, var, chunksizes)))
return retval;
/* If the user provided chunksizes, check that they are not too
* big, and that their total size of chunk is less than 4 GB. */
if (chunksizes)
{
/* Check the chunksizes for validity. */
if ((retval = check_chunksizes(grp, var, chunksizes)))
return retval;

/* Ensure chunksize is smaller than dimension size */
for (d = 0; d < var->ndims; d++)
if(!var->dim[d]->unlimited && var->dim[d]->len > 0 && chunksizes[d] > var->dim[d]->len)
return NC_EBADCHUNK;
/* Ensure chunksize is smaller than dimension size */
for (d = 0; d < var->ndims; d++)
if (!var->dim[d]->unlimited && var->dim[d]->len > 0 &&
chunksizes[d] > var->dim[d]->len)
return NC_EBADCHUNK;

/* Set the chunksizes for this variable. */
/* Set the chunksizes for this variable. */
for (d = 0; d < var->ndims; d++)
var->chunksizes[d] = chunksizes[d];
}
}
else if (*contiguous == NC_CONTIGUOUS)
{
var->contiguous = NC_TRUE;
}
else if (*contiguous == NC_COMPACT)
{
size_t ndata = 1;

/* Ensure that total var is < 64 MB. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another 'kb not mb', I can fix downstream easily enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I will fix...

for (d = 0; d < var->ndims; d++)
var->chunksizes[d] = chunksizes[d];
ndata *= var->dim[d]->len;

/* Ensure var is small enough to fit in compact storage. */
if (ndata * var->type_info->size > SIXTY_FOUR_MB)
return NC_EINVAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this error out if the dataset is too large, or should it fall back to NC_CONTIGUOUS. In CGNS, we fall back instead of erroring out; not sure which is best behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the netCDF API, when you ask for something specifically, and you can't have it, we give you an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. In CGNS the library was in control and not the client. Would NC_VARSIZE be more appropriate? I guess its not a violation of format constraint, but instead a violation of HDF5 compact variable constraint... But, doesn't seem like an invalid argument either...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes NC_VARSIZE is a much better choice. I will change to that.


var->contiguous = NC_FALSE;
var->compact = NC_TRUE;
}
}

Expand Down
12 changes: 10 additions & 2 deletions libhdf5/nc4hdf.c
Original file line number Diff line number Diff line change
Expand Up @@ -994,11 +994,17 @@ var_create_dataset(NC_GRP_INFO_T *grp, NC_VAR_INFO_T *var, nc_bool_t write_dimid
}
}

/* Set the var storage to contiguous, compact, or chunked. */
if (var->contiguous)
{
if (H5Pset_layout(plistid, H5D_CONTIGUOUS) < 0)
BAIL(NC_EHDFERR);
}
else if (var->compact)
{
if (H5Pset_layout(plistid, H5D_COMPACT) < 0)
BAIL(NC_EHDFERR);
}
else
{
if (H5Pset_chunk(plistid, var->ndims, chunksize) < 0)
Expand Down Expand Up @@ -1106,9 +1112,11 @@ nc4_adjust_var_cache(NC_GRP_INFO_T *grp, NC_VAR_INFO_T *var)
int d;
int retval;

/* Nothing to be done. */
if (var->contiguous)
/* Nothing to be done for contiguous or compact data. */
if (var->contiguous || var->compact)
return NC_NOERR;

/* No cache adjusting for parallel builds. */
#ifdef USE_PARALLEL4
return NC_NOERR;
#endif
Expand Down
16 changes: 13 additions & 3 deletions libsrc4/nc4var.c
Original file line number Diff line number Diff line change
Expand Up @@ -187,16 +187,26 @@ NC4_inq_var_all(int ncid, int varid, char *name, nc_type *xtypep,
if (nattsp)
*nattsp = ncindexcount(var->att);

/* Chunking stuff. */
if (!var->contiguous && chunksizesp)
/* Did the user want the chunksizes? */
if (!var->contiguous && !var->compact && chunksizesp)
{
for (d = 0; d < var->ndims; d++)
{
chunksizesp[d] = var->chunksizes[d];
LOG((4, "chunksizesp[%d]=%d", d, chunksizesp[d]));
}
}

/* Did the user inquire about the storage? */
if (contiguousp)
*contiguousp = var->contiguous ? NC_CONTIGUOUS : NC_CHUNKED;
{
if (var->contiguous)
*contiguousp = NC_CONTIGUOUS;
else if (var->compact)
*contiguousp = NC_COMPACT;
else
*contiguousp = NC_CHUNKED;
}

/* Filter stuff. */
if (deflatep)
Expand Down
Loading