Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in dictionary training in 1.4.0 #1599

Closed
FrancescAlted opened this issue Apr 24, 2019 · 4 comments
Closed

Possible regression in dictionary training in 1.4.0 #1599

FrancescAlted opened this issue Apr 24, 2019 · 4 comments
Assignees

Comments

@FrancescAlted
Copy link

After upgrading zstd in c-blosc2 to 1.4.0 (from 1.3.4), I've got this regression:

$ tests/test_dict_schunk

[blocksize: 1 KB] cratio w/o dict: 3.5x (compr @ 278.1 MB/s, decompr @ 3481.8 MB/s)
.[blocksize: 1 KB] cratio with dict: 17.3x (compr @ 825.3 MB/s, decompr @ 2966.4 MB/s)
.[blocksize: 4 KB] cratio w/o dict: 13.2x (compr @ 799.1 MB/s, decompr @ 5123.3 MB/s)
.[blocksize: 4 KB] cratio with dict: 54.1x (compr @ 1497.9 MB/s, decompr @ 4754.9 MB/s)
.[blocksize: 32 KB] cratio w/o dict: 86.0x (compr @ 1129.6 MB/s, decompr @ 5740.5 MB/s)
.[blocksize: 32 KB] cratio with dict: 225.3x (compr @ 1797.9 MB/s, decompr @ 6201.0 MB/s)
.[blocksize: 256 KB] cratio w/o dict: 286.1x (compr @ 896.3 MB/s, decompr @ 6036.4 MB/s)
Error in ZDICT_trainFromBuffer(): 'Error (generic)'.  Giving up.
.F (ERROR: incorrect nchunks value)
	Tests run: 8

The fact that the ZDICT_trainFromBuffer error message is just 'generic' does not help too much for figuring out what's going on.

When going back to commit Blosc/c-blosc2@7ee5507 (i.e. previous to the 1.4.0 update), the output of the test above is ok:

$ tests/test_dict_schunk

[blocksize: 1 KB] cratio w/o dict: 3.5x (compr @ 312.0 MB/s, decompr @ 3426.8 MB/s)
.[blocksize: 1 KB] cratio with dict: 22.3x (compr @ 350.6 MB/s, decompr @ 3565.2 MB/s)
.[blocksize: 4 KB] cratio w/o dict: 13.2x (compr @ 860.1 MB/s, decompr @ 5747.8 MB/s)
.[blocksize: 4 KB] cratio with dict: 73.6x (compr @ 1100.6 MB/s, decompr @ 5600.6 MB/s)
.[blocksize: 32 KB] cratio w/o dict: 76.9x (compr @ 1141.4 MB/s, decompr @ 6316.1 MB/s)
.[blocksize: 32 KB] cratio with dict: 273.5x (compr @ 1812.6 MB/s, decompr @ 5307.8 MB/s)
.[blocksize: 256 KB] cratio w/o dict: 284.8x (compr @ 994.6 MB/s, decompr @ 5787.4 MB/s)
.[blocksize: 256 KB] cratio with dict: 353.9x (compr @ 2623.7 MB/s, decompr @ 6588.8 MB/s)
.[blocksize: automatic] cratio w/o dict: 284.8x (compr @ 1056.7 MB/s, decompr @ 5751.8 MB/s)
.[blocksize: automatic] cratio with dict: 391.5x (compr @ 2945.3 MB/s, decompr @ 6652.7 MB/s)
. ALL TESTS PASSED	Tests run: 10

You can have a look at how I do the dict training here.

For reproducing the issue, just clone the c-blosc2 repo, cd into it and do the typical:

$ mkdir build; cd build
$ cmake ..
$ make -j4
$ tests/test_dict_schunk

Thanks!

@FrancescAlted
Copy link
Author

I digged further here, and apparently ZDICT_trainFromBuffer() behaves well whenever the number of blocks passed is >= 8 (I was passing 4). This is certainly a change from 1.3.4, and I would suggest either to document this as another note around https://github.com/facebook/zstd/blame/dev/lib/dictBuilder/zdict.h#L49, or even better, more meaningful error messages, like "number of blocks too small".

@FrancescAlted
Copy link
Author

As a side note, I am not sure why it is necessary to pass the samplesSizes parameter, as the samplesBuffer is required to be contiguous. Wouldn't it be easier for the user something like:

size_t ZDICT_trainFromBufferSimple(void* dictBuffer, size_t dictBufferCapacity,
                       const void* samplesBuffer, const size_t samplesBufferSize);

Related with dictionary training, I do have another, different use case for passing a discontiguous buffer to the training function. The scenario is for Blosc having a data chunk split in blocks that are to be compressed independently, and passing the initial bytes for every block would be great so as to accumulate the redundancy among the blocks in the chunk dictionary. Supporting a discontiguous samplesBuffer would remove the need to create a buffer just to copy data, making the dictionary creation faster. I am suggesting something like:

size_t ZDICT_trainFromDiscontiguousBuffer(void* dictBuffer, size_t dictBufferCapacity,
    const void** samplesBuffer,  // an array of pointers to the starting point of blocks
    const size_t* samplesSizes, unsigned nbSamples);

If you find this to be feasible, I can open a different ticket with the suggestion.

@terrelln terrelln self-assigned this Apr 25, 2019
@terrelln
Copy link
Contributor

The dictionary error codes could certainly be improved. I'll leave this issue open until I get a chance to improve the error codes, and document the minimum number of samples in the header.

  • If you either call one of the advanced functions and set notificationLevel > 0, or for debugging just compile Blosc with cmake .. -DCMAKE_C_FLAGS="-DDEBUGLEVEL=2" zstd will print out that it needs more samples.
  • We need the sampleSizes because the algorithms are trying to detect redundancies between the files, not redundancies in a single file. We also split the dataset into a test and training data set.
  • We require a single buffer because it is easier to work with. At this point we can't change the API of ZDICT_trainFromBuffer().
  • I don't want to clutter the API with 2 versions of each function, and all the algorithms require a contiguous buffer to work with, so it wouldn't be any more efficient.
  • Looking at Blosc/c-blosc2@4454ce2 cover.h is an internal header, but those prototypes is provided in zdict.h when using ZDICT_STATIC_LINKING_ONLY.

@FrancescAlted
Copy link
Author

FrancescAlted commented Apr 29, 2019

Thanks for the hints. I still don't see why you need sampleSizes maybe because my use case is to pass a single sample of a data chunk in order to build a dictionary, and then use the dictionary to do the actual compression for other chunks to come (Blosc2 is storing the dictionary in the container, so that it can be used for decompressing). But anyway, probably it is better to leave things are they are now :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants