Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZSTD_TrainDictionary runs even when the compression is set to kNoCompression for a given level #12409

Closed
kwadhwa18 opened this issue Mar 6, 2024 · 4 comments

Comments

@kwadhwa18
Copy link
Contributor

ZSTD_TrainDictionary [link] runs for SSTFileWriter::Finish even when bottommost_compression option is set to kNoCompression. This reduces throughput for SstFileWriter::Finish

We construct rocksdb options using ZSTD compression for levels including 2 and above. For levels 0 and 1, we set it to kNoCompression. We also set zstd_max_train_bytes to a non-zero positive value (which is applicable for levels with ZSTD compression enabled). These options are used for the database and also passed to SstFileWriter for creating sst files to be later added to that database. Since the BlockBasedTableBuilder::Finish [link] only checks for zstd_max_train_bytes to be non-zero positive value, it runs ZSTD_TrainDictionary even when it shouldn't since SSTFileWriter is operating at bottommost level

Expected behavior

If the bottommost_compression or compression_per_level for a level is set to kNoCompression, then don't run ZSTD_TrainDictionary

Actual behavior

ZSTD_TrainDictionary is also run for level which has kNoCompression set

@ajkr
Copy link
Contributor

ajkr commented Mar 6, 2024

Another case is max_dict_bytes > 0 will build a dictionary even when compression type is kNoCompression. These sound like good sanitizations to add. Would you be interested in adding some of them?

@kwadhwa18
Copy link
Contributor Author

are you referring to

for (size_t i = 0;
i < kNumBlocksBuffered && compression_dict_samples.size() < kSampleBytes;
++i) {
size_t copy_len = std::min(kSampleBytes - compression_dict_samples.size(),
r->data_block_buffers[buffer_idx].size());
compression_dict_samples.append(r->data_block_buffers[buffer_idx], 0,
copy_len);
compression_dict_sample_lens.emplace_back(copy_len);
buffer_idx += kPrimeGeneratorRemainder;
if (buffer_idx >= kNumBlocksBuffered) {
buffer_idx -= kNumBlocksBuffered;
}
}
?

I can help with the sanitizations - is the level information available inside BlockBasedTableBuilder?

@ajkr
Copy link
Contributor

ajkr commented Mar 8, 2024

are you referring to

for (size_t i = 0;
i < kNumBlocksBuffered && compression_dict_samples.size() < kSampleBytes;
++i) {
size_t copy_len = std::min(kSampleBytes - compression_dict_samples.size(),
r->data_block_buffers[buffer_idx].size());
compression_dict_samples.append(r->data_block_buffers[buffer_idx], 0,
copy_len);
compression_dict_sample_lens.emplace_back(copy_len);
buffer_idx += kPrimeGeneratorRemainder;
if (buffer_idx >= kNumBlocksBuffered) {
buffer_idx -= kNumBlocksBuffered;
}
}

?

Yes.

I can help with the sanitizations - is the level information available inside BlockBasedTableBuilder?

Yes , BlockBasedTableBuilder::Rep has compression_type and compression_opts, which are the settings specific to the level for which the table is being built

edit: Technically the answer to your question is no, but my point is BlockBasedTableBuilder::Rep has everything you need without it

@kwadhwa18
Copy link
Contributor Author

I have attempted a fix #12420. PTAL!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants