Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSTFileWriter don't report file size when enabled zstd dictionary training #11146

Open
yihuang opened this issue Jan 27, 2023 · 7 comments
Open
Assignees

Comments

@yihuang
Copy link
Contributor

yihuang commented Jan 27, 2023

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

I was trying to limit the file size when bulk loading with SSTFileWriter, and I find the FileSize() always return 0 when zstd dictionary training is enabled.

Expected behavior

sstFileWriter.FileSize() should report current progress.

Actual behavior

sstFileWriter.FileSize() always return 0 when zstd dictionary training is enabled.

Steps to reproduce the behavior

@cbi42
Copy link
Member

cbi42 commented Feb 7, 2023

With zstd dictionary compression, there is a "buffered" stage where the supposedly written data is kept in memory. This is used to generate a compression dictionary from buffered data. This causes sstFileWriter.FileSize() to return 0 in buffered stage right now, and you should see non-zero file size once buffered data is written to file (e.g. after Finish()).

EDIT: this is a reasonable feature, but there is no plan to add support yet. A temporary workaround is to limit buffer size (max_dict_buffer_bytes), but it can hurt compression ratio.

@cbi42 cbi42 self-assigned this Feb 7, 2023
@yihuang
Copy link
Contributor Author

yihuang commented Feb 7, 2023

With zstd dictionary compression, there is a "buffered" stage where the supposedly written data is kept in memory. This is used to generate a compression dictionary from buffered data. This causes sstFileWriter.FileSize() to return 0 in buffered stage right now, and you should see non-zero file size once buffered data is written to file (e.g. after Finish()).

EDIT: this is a reasonable feature, but there is no plan to add support yet. A temporary workaround is to limit buffer size (max_dict_buffer_bytes), but it can hurt compression ratio.

I haven't digged into it, but I guess there's more to that, I use sth like this to write the sst files:

for key, value in input:
  if sstWriter.FileSize() > 128m:
    sstWriter.Finish()
    sstWriter.Open(next file)
  sstWriter.Put(key, value)

When using zstd dictionary compression, it'll keep generating 2gb sst file and never rotate.

Should I just set max_dict_buffer_bytes to the target sst file size?

@yihuang
Copy link
Contributor Author

yihuang commented Feb 7, 2023

Or maybe we should support setting the target file size in sst file writer? (https://github.com/facebook/rocksdb/blob/main/table/sst_file_writer.cc#L323)

@yihuang
Copy link
Contributor Author

yihuang commented Feb 7, 2023

What do you think is the best practice to rotate sst files based on file size?

@cbi42
Copy link
Member

cbi42 commented Feb 7, 2023

I think target file size works the same as setting max_dict_buffer_size:

if (tbo.target_file_size == 0) {
buffer_limit = compression_opts.max_dict_buffer_bytes;
} else if (compression_opts.max_dict_buffer_bytes == 0) {
buffer_limit = tbo.target_file_size;
} else {
buffer_limit = std::min(tbo.target_file_size,
compression_opts.max_dict_buffer_bytes);
}
I don't see a way to specify target file size yet, so maybe you can try just set max_dict_buffer_bytes for now.

@yihuang
Copy link
Contributor Author

yihuang commented Feb 10, 2023

I don't see a way to specify target file size yet, so maybe you can try just set max_dict_buffer_bytes for now.

I seems don't work, still don't rotate even if the size is more then 1g, with target file size 128m.

@ajkr
Copy link
Contributor

ajkr commented Feb 11, 2023

Can you share your compression options, including bottommost compression options? I thought setting max_dict_buffer_bytes=128m would cause FileSize() to start reporting the real size once the amount of uncompressed data has exceeded 128m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants