SSTFileWriter don't report file size when enabled zstd dictionary training #11146

yihuang · 2023-01-27T02:53:10Z

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

I was trying to limit the file size when bulk loading with SSTFileWriter, and I find the FileSize() always return 0 when zstd dictionary training is enabled.

Expected behavior

sstFileWriter.FileSize() should report current progress.

Actual behavior

sstFileWriter.FileSize() always return 0 when zstd dictionary training is enabled.

Steps to reproduce the behavior

The text was updated successfully, but these errors were encountered:

cbi42 · 2023-02-07T04:44:05Z

With zstd dictionary compression, there is a "buffered" stage where the supposedly written data is kept in memory. This is used to generate a compression dictionary from buffered data. This causes sstFileWriter.FileSize() to return 0 in buffered stage right now, and you should see non-zero file size once buffered data is written to file (e.g. after Finish()).

EDIT: this is a reasonable feature, but there is no plan to add support yet. A temporary workaround is to limit buffer size (max_dict_buffer_bytes), but it can hurt compression ratio.

yihuang · 2023-02-07T05:11:39Z

With zstd dictionary compression, there is a "buffered" stage where the supposedly written data is kept in memory. This is used to generate a compression dictionary from buffered data. This causes sstFileWriter.FileSize() to return 0 in buffered stage right now, and you should see non-zero file size once buffered data is written to file (e.g. after Finish()).

EDIT: this is a reasonable feature, but there is no plan to add support yet. A temporary workaround is to limit buffer size (max_dict_buffer_bytes), but it can hurt compression ratio.

I haven't digged into it, but I guess there's more to that, I use sth like this to write the sst files:

for key, value in input:
  if sstWriter.FileSize() > 128m:
    sstWriter.Finish()
    sstWriter.Open(next file)
  sstWriter.Put(key, value)

When using zstd dictionary compression, it'll keep generating 2gb sst file and never rotate.

Should I just set max_dict_buffer_bytes to the target sst file size?

yihuang · 2023-02-07T05:18:44Z

Or maybe we should support setting the target file size in sst file writer? (https://github.com/facebook/rocksdb/blob/main/table/sst_file_writer.cc#L323)

yihuang · 2023-02-07T05:21:35Z

What do you think is the best practice to rotate sst files based on file size?

cbi42 · 2023-02-07T05:41:01Z

I think target file size works the same as setting max_dict_buffer_size:

rocksdb/table/block_based/block_based_table_builder.cc

Lines 452 to 459 in 54d7208

    
           if (tbo.target_file_size == 0) { 
        
             buffer_limit = compression_opts.max_dict_buffer_bytes; 
        
           } else if (compression_opts.max_dict_buffer_bytes == 0) { 
        
             buffer_limit = tbo.target_file_size; 
        
           } else { 
        
             buffer_limit = std::min(tbo.target_file_size, 
        
                                     compression_opts.max_dict_buffer_bytes); 
        
           }

I don't see a way to specify target file size yet, so maybe you can try just set max_dict_buffer_bytes for now.

yihuang · 2023-02-10T06:56:17Z

I don't see a way to specify target file size yet, so maybe you can try just set max_dict_buffer_bytes for now.

I seems don't work, still don't rotate even if the size is more then 1g, with target file size 128m.

ajkr · 2023-02-11T17:10:32Z

Can you share your compression options, including bottommost compression options? I thought setting max_dict_buffer_bytes=128m would cause FileSize() to start reporting the real size once the amount of uncompressed data has exceeded 128m

cbi42 self-assigned this Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSTFileWriter don't report file size when enabled zstd dictionary training #11146

SSTFileWriter don't report file size when enabled zstd dictionary training #11146

yihuang commented Jan 27, 2023

cbi42 commented Feb 7, 2023 •

edited

yihuang commented Feb 7, 2023 •

edited

yihuang commented Feb 7, 2023

yihuang commented Feb 7, 2023

cbi42 commented Feb 7, 2023

yihuang commented Feb 10, 2023

ajkr commented Feb 11, 2023

SSTFileWriter don't report file size when enabled zstd dictionary training #11146

SSTFileWriter don't report file size when enabled zstd dictionary training #11146

Comments

yihuang commented Jan 27, 2023

Expected behavior

Actual behavior

Steps to reproduce the behavior

cbi42 commented Feb 7, 2023 • edited

yihuang commented Feb 7, 2023 • edited

yihuang commented Feb 7, 2023

yihuang commented Feb 7, 2023

cbi42 commented Feb 7, 2023

yihuang commented Feb 10, 2023

ajkr commented Feb 11, 2023

cbi42 commented Feb 7, 2023 •

edited

yihuang commented Feb 7, 2023 •

edited