Possible regression in dictionary training in 1.4.0 #1599

FrancescAlted · 2019-04-24T09:23:26Z

After upgrading zstd in c-blosc2 to 1.4.0 (from 1.3.4), I've got this regression:

$ tests/test_dict_schunk

[blocksize: 1 KB] cratio w/o dict: 3.5x (compr @ 278.1 MB/s, decompr @ 3481.8 MB/s)
.[blocksize: 1 KB] cratio with dict: 17.3x (compr @ 825.3 MB/s, decompr @ 2966.4 MB/s)
.[blocksize: 4 KB] cratio w/o dict: 13.2x (compr @ 799.1 MB/s, decompr @ 5123.3 MB/s)
.[blocksize: 4 KB] cratio with dict: 54.1x (compr @ 1497.9 MB/s, decompr @ 4754.9 MB/s)
.[blocksize: 32 KB] cratio w/o dict: 86.0x (compr @ 1129.6 MB/s, decompr @ 5740.5 MB/s)
.[blocksize: 32 KB] cratio with dict: 225.3x (compr @ 1797.9 MB/s, decompr @ 6201.0 MB/s)
.[blocksize: 256 KB] cratio w/o dict: 286.1x (compr @ 896.3 MB/s, decompr @ 6036.4 MB/s)
Error in ZDICT_trainFromBuffer(): 'Error (generic)'.  Giving up.
.F (ERROR: incorrect nchunks value)
	Tests run: 8

The fact that the ZDICT_trainFromBuffer error message is just 'generic' does not help too much for figuring out what's going on.

When going back to commit Blosc/c-blosc2@7ee5507 (i.e. previous to the 1.4.0 update), the output of the test above is ok:

$ tests/test_dict_schunk

[blocksize: 1 KB] cratio w/o dict: 3.5x (compr @ 312.0 MB/s, decompr @ 3426.8 MB/s)
.[blocksize: 1 KB] cratio with dict: 22.3x (compr @ 350.6 MB/s, decompr @ 3565.2 MB/s)
.[blocksize: 4 KB] cratio w/o dict: 13.2x (compr @ 860.1 MB/s, decompr @ 5747.8 MB/s)
.[blocksize: 4 KB] cratio with dict: 73.6x (compr @ 1100.6 MB/s, decompr @ 5600.6 MB/s)
.[blocksize: 32 KB] cratio w/o dict: 76.9x (compr @ 1141.4 MB/s, decompr @ 6316.1 MB/s)
.[blocksize: 32 KB] cratio with dict: 273.5x (compr @ 1812.6 MB/s, decompr @ 5307.8 MB/s)
.[blocksize: 256 KB] cratio w/o dict: 284.8x (compr @ 994.6 MB/s, decompr @ 5787.4 MB/s)
.[blocksize: 256 KB] cratio with dict: 353.9x (compr @ 2623.7 MB/s, decompr @ 6588.8 MB/s)
.[blocksize: automatic] cratio w/o dict: 284.8x (compr @ 1056.7 MB/s, decompr @ 5751.8 MB/s)
.[blocksize: automatic] cratio with dict: 391.5x (compr @ 2945.3 MB/s, decompr @ 6652.7 MB/s)
. ALL TESTS PASSED	Tests run: 10

You can have a look at how I do the dict training here.

For reproducing the issue, just clone the c-blosc2 repo, cd into it and do the typical:

$ mkdir build; cd build
$ cmake ..
$ make -j4
$ tests/test_dict_schunk

Thanks!

The text was updated successfully, but these errors were encountered:

FrancescAlted · 2019-04-24T11:31:35Z

I digged further here, and apparently ZDICT_trainFromBuffer() behaves well whenever the number of blocks passed is >= 8 (I was passing 4). This is certainly a change from 1.3.4, and I would suggest either to document this as another note around https://github.com/facebook/zstd/blame/dev/lib/dictBuilder/zdict.h#L49, or even better, more meaningful error messages, like "number of blocks too small".

FrancescAlted · 2019-04-24T12:34:23Z

As a side note, I am not sure why it is necessary to pass the samplesSizes parameter, as the samplesBuffer is required to be contiguous. Wouldn't it be easier for the user something like:

size_t ZDICT_trainFromBufferSimple(void* dictBuffer, size_t dictBufferCapacity,
                       const void* samplesBuffer, const size_t samplesBufferSize);

Related with dictionary training, I do have another, different use case for passing a discontiguous buffer to the training function. The scenario is for Blosc having a data chunk split in blocks that are to be compressed independently, and passing the initial bytes for every block would be great so as to accumulate the redundancy among the blocks in the chunk dictionary. Supporting a discontiguous samplesBuffer would remove the need to create a buffer just to copy data, making the dictionary creation faster. I am suggesting something like:

size_t ZDICT_trainFromDiscontiguousBuffer(void* dictBuffer, size_t dictBufferCapacity,
    const void** samplesBuffer,  // an array of pointers to the starting point of blocks
    const size_t* samplesSizes, unsigned nbSamples);

If you find this to be feasible, I can open a different ticket with the suggestion.

terrelln · 2019-04-26T19:38:51Z

The dictionary error codes could certainly be improved. I'll leave this issue open until I get a chance to improve the error codes, and document the minimum number of samples in the header.

If you either call one of the advanced functions and set notificationLevel > 0, or for debugging just compile Blosc with cmake .. -DCMAKE_C_FLAGS="-DDEBUGLEVEL=2" zstd will print out that it needs more samples.
We need the sampleSizes because the algorithms are trying to detect redundancies between the files, not redundancies in a single file. We also split the dataset into a test and training data set.
We require a single buffer because it is easier to work with. At this point we can't change the API of ZDICT_trainFromBuffer().
I don't want to clutter the API with 2 versions of each function, and all the algorithms require a contiguous buffer to work with, so it wouldn't be any more efficient.
Looking at Blosc/c-blosc2@4454ce2 cover.h is an internal header, but those prototypes is provided in zdict.h when using ZDICT_STATIC_LINKING_ONLY.

FrancescAlted · 2019-04-29T07:30:56Z

Thanks for the hints. I still don't see why you need sampleSizes maybe because my use case is to pass a single sample of a data chunk in order to build a dictionary, and then use the dictionary to do the actual compression for other chunks to come (Blosc2 is storing the dictionary in the container, so that it can be used for decompressing). But anyway, probably it is better to leave things are they are now :-)

terrelln self-assigned this Apr 25, 2019

terrelln closed this as completed Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression in dictionary training in 1.4.0 #1599

Possible regression in dictionary training in 1.4.0 #1599

FrancescAlted commented Apr 24, 2019

FrancescAlted commented Apr 24, 2019

FrancescAlted commented Apr 24, 2019

terrelln commented Apr 26, 2019

FrancescAlted commented Apr 29, 2019 •

edited

Loading

Possible regression in dictionary training in 1.4.0 #1599

Possible regression in dictionary training in 1.4.0 #1599

Comments

FrancescAlted commented Apr 24, 2019

FrancescAlted commented Apr 24, 2019

FrancescAlted commented Apr 24, 2019

terrelln commented Apr 26, 2019

FrancescAlted commented Apr 29, 2019 • edited Loading

FrancescAlted commented Apr 29, 2019 •

edited

Loading