Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show number of blocks compressed and not compressed when using "sst_dump --show_compression_sizes". #1474

Closed
thatsafunnyname opened this issue Nov 8, 2016 · 5 comments

Comments

@thatsafunnyname
Copy link
Contributor

thatsafunnyname commented Nov 8, 2016

Hello and thank you for RocksDB,

Rather than sst_dump --show_compression_sizes just showing the compression sizes:

Compression: kNoCompression Size:     65040323 
Compression: kSnappyCompression Size: 64355577 
Compression: kZlibCompression Size:   63778913 
Compression: kLZ4Compression Size:    64209486
Compression: kLZ4HCCompression Size:  64041757

I found also showing the number of blocks compressed and not compressed to be valuable:

Compression: kNoCompression     Size: 65040323 Blocks compressed:   0 Blocks not compressed:    0
Compression: kSnappyCompression Size: 64355577 Blocks compressed: 212 Blocks not compressed: 1849
Compression: kZlibCompression   Size: 63778913 Blocks compressed: 358 Blocks not compressed: 1703
Compression: kLZ4Compression    Size: 64209486 Blocks compressed: 249 Blocks not compressed: 1812
Compression: kLZ4HCCompression  Size: 64041757 Blocks compressed: 292 Blocks not compressed: 1769

I am testing compression options on values that are already compressed with LZ4, and have bumped up against the 12.5% threshold for GoodCompressionRatio (originally from ldb) as I have datasets where many blocks compress by 10%.

Currently NUMBER_BLOCK_NOT_COMPRESSED is only incremented when compression is aborted because the block is too big or did not pass verification ( added in f43c826 ) , it does not require ShouldReportDetailedTime(r->ioptions.env, r->ioptions.statistics) to be true, it is not incremented when a block is not compressed because it does not compress by 12.5%. NUMBER_BLOCK_COMPRESSED is incremented when a compressed block is used/written, it requries ShouldReportDetailedTime(r->ioptions.env, r->ioptions.statistics) to be true, it was added in 9430333 .

When the code ( with NUMBER_BLOCK_NOT_COMPRESSED (aborted compression) ) had NUMBER_BLOCK_COMPRESSED (compression meets threshold) added, I suspect counting blocks that fail to meet the threshold was overlooked, or maybe just not a priority.

The output above with the number of blocks not compressed uses a change that changes NUMBER_BLOCK_NOT_COMPRESSED to also be incremented when a block does not compress by 12.5%. This is a change in behaviour that is likely not acceptable:

> diff -du tools/sst_dump_tool.cc.orig tools/sst_dump_tool.cc
--- tools/sst_dump_tool.cc.orig 2016-11-04 09:33:20.505992575 -0400
+++ tools/sst_dump_tool.cc      2016-11-08 11:37:45.503881967 -0500
@@ -178,6 +178,8 @@
 int SstFileReader::ShowAllCompressionSizes(size_t block_size) {
   ReadOptions read_options;
   Options opts;
+  opts.statistics = rocksdb::CreateDBStatistics();
+  opts.statistics->stats_level_ = StatsLevel::kAll;
   const ImmutableCFOptions imoptions(opts);
   rocksdb::InternalKeyComparator ikc(opts.comparator);
   std::vector<std::unique_ptr<IntTblPropCollectorFactory> >
@@ -206,7 +208,9 @@
                                   false /* skip_filters */, column_family_name);
       uint64_t file_size = CalculateCompressedTableSize(tb_opts, block_size);
       fprintf(stdout, "Compression: %s", i.second);
-      fprintf(stdout, " Size: %" PRIu64 "\n", file_size);
+      fprintf(stdout, " Size: %" PRIu64, file_size);
+      fprintf(stdout, " Number of blocks compressed: %" PRIu64, opts.statistics->getAndResetTickerCount(NUMBER_BLOCK_COMPRESSED));
+      fprintf(stdout, " Number of blocks not compressed: %" PRIu64 "\n", opts.statistics->getAndResetTickerCount(NUMBER_BLOCK_NOT_COMPRESSED));
     } else {
       fprintf(stdout, "Unsupported compression type: %s.\n", i.second);
     }

> diff -du4 table/block_based_table_builder.cc.orig table/block_based_table_builder.cc
--- table/block_based_table_builder.cc.orig     2016-11-08 11:39:50.682322192 -0500
+++ table/block_based_table_builder.cc  2016-11-08 07:46:55.875903674 -0500
@@ -706,17 +706,19 @@
   if (abort_compression) {
     RecordTick(r->ioptions.statistics, NUMBER_BLOCK_NOT_COMPRESSED);
     type = kNoCompression;
     block_contents = raw_block_contents;
-  }
-  else if (type != kNoCompression &&
-            ShouldReportDetailedTime(r->ioptions.env,
-              r->ioptions.statistics)) {
-    MeasureTime(r->ioptions.statistics, COMPRESSION_TIMES_NANOS,
-      timer.ElapsedNanos());
-    MeasureTime(r->ioptions.statistics, BYTES_COMPRESSED,
-      raw_block_contents.size());
-    RecordTick(r->ioptions.statistics, NUMBER_BLOCK_COMPRESSED);
+  } else if (ShouldReportDetailedTime(r->ioptions.env,
+                                     r->ioptions.statistics)) {
+    if (type != kNoCompression) {
+      MeasureTime(r->ioptions.statistics, COMPRESSION_TIMES_NANOS,
+                 timer.ElapsedNanos());
+      MeasureTime(r->ioptions.statistics, BYTES_COMPRESSED,
+                 raw_block_contents.size());
+      RecordTick(r->ioptions.statistics, NUMBER_BLOCK_COMPRESSED);
+    } else if (type != r->compression_type) {
+      RecordTick(r->ioptions.statistics, NUMBER_BLOCK_NOT_COMPRESSED);
+    }
   }

   WriteRawBlock(block_contents, type, handle);
   r->compressed_output.clear();

Alternatively NUMBER_BLOCK_NOT_COMPRESSED could be left unchanged and a new metric, NUMBER_BLOCK_COMPRESSION_RATIO_FAIL added. Avoiding any change in behaviour (for existing counters) for consumers.

Or NUMBER_BLOCK_ABORT_COMPRESSION could be added and used as NUMBER_BLOCK_NOT_COMPRESSED is used now (aborts), and NUMBER_BLOCK_NOT_COMPRESSED used for ratio fails (when ShouldReportDetailedTime) and aborted compression due to size or failed verify.

@thatsafunnyname
Copy link
Contributor Author

For sst_dump --show_compression_sizes the number of blocks not compressed because they did not pass the 12.5% compression threshold could be calculated by using the number of blocks in the SST. Avoiding any changes to table/block_based_table_builder.cc.

> diff -du tools/sst_dump_tool.cc.orig tools/sst_dump_tool.cc
--- tools/sst_dump_tool.cc.orig 2016-11-04 09:33:20.505992575 -0400
+++ tools/sst_dump_tool.cc      2016-11-10 14:39:44.468225894 -0500
@@ -143,7 +143,7 @@
 }

 uint64_t SstFileReader::CalculateCompressedTableSize(
-    const TableBuilderOptions& tb_options, size_t block_size) {
+    const TableBuilderOptions& tb_options, size_t block_size, uint64_t& num_data_blocks) {
   unique_ptr<WritableFile> out_file;
   unique_ptr<Env> env(NewMemEnv(Env::Default()));
   env->NewWritableFile(testFileName, &out_file, soptions_);
@@ -171,6 +171,7 @@
     exit(1);
   }
   uint64_t size = table_builder->FileSize();
+  num_data_blocks = table_builder->GetTableProperties().num_data_blocks;
   env->DeleteFile(testFileName);
   return size;
 }
@@ -178,6 +179,8 @@
 int SstFileReader::ShowAllCompressionSizes(size_t block_size) {
   ReadOptions read_options;
   Options opts;
+  opts.statistics = rocksdb::CreateDBStatistics();
+  opts.statistics->stats_level_ = StatsLevel::kAll;
   const ImmutableCFOptions imoptions(opts);
   rocksdb::InternalKeyComparator ikc(opts.comparator);
   std::vector<std::unique_ptr<IntTblPropCollectorFactory> >
@@ -204,9 +207,20 @@
                                   i.first, compress_opt,
                                   nullptr /* compression_dict */,
                                   false /* skip_filters */, column_family_name);
-      uint64_t file_size = CalculateCompressedTableSize(tb_opts, block_size);
-      fprintf(stdout, "Compression: %s", i.second);
-      fprintf(stdout, " Size: %" PRIu64 "\n", file_size);
+      uint64_t num_data_blocks = 0;
+      uint64_t file_size = CalculateCompressedTableSize(tb_opts, block_size, num_data_blocks);
+      fprintf(stdout, "Compression: %-24s", i.second);
+      fprintf(stdout, " Size: %10" PRIu64, file_size);
+      fprintf(stdout, " Blocks: %6" PRIu64, num_data_blocks);
+      uint64_t compressed_blocks           = opts.statistics->getAndResetTickerCount(NUMBER_BLOCK_COMPRESSED);
+      uint64_t not_compressed_blocks       = opts.statistics->getAndResetTickerCount(NUMBER_BLOCK_NOT_COMPRESSED);
+      uint64_t ratio_not_compressed_blocks = (num_data_blocks - compressed_blocks) - not_compressed_blocks;
+      double compressed_pcnt               = ( 0 == num_data_blocks ) ? 0.0 : ( ( (double)compressed_blocks           / (double)num_data_blocks ) * 100.0 ) ;
+      double ratio_not_compressed_pcnt     = ( 0 == num_data_blocks ) ? 0.0 : ( ( (double)ratio_not_compressed_blocks / (double)num_data_blocks ) * 100.0 ) ;
+      double not_compressed_pcnt           = ( 0 == num_data_blocks ) ? 0.0 : ( ( (double)not_compressed_blocks       / (double)num_data_blocks ) * 100.0 ) ;
+      fprintf(stdout, " Compressed: %6"             PRIu64 " (%5.1f%%)",   compressed_blocks, compressed_pcnt);
+      fprintf(stdout, " Not compressed (ratio): %6" PRIu64 " (%5.1f%%)",   ratio_not_compressed_blocks, ratio_not_compressed_pcnt);
+      fprintf(stdout, " Not compressed (abort): %6" PRIu64 " (%5.1f%%)\n", not_compressed_blocks, not_compressed_pcnt);
     } else {
       fprintf(stdout, "Unsupported compression type: %s.\n", i.second);
     }

> diff -du tools/sst_dump_tool_imp.h.orig tools/sst_dump_tool_imp.h
--- tools/sst_dump_tool_imp.h.orig      2016-11-10 14:43:05.259253965 -0500
+++ tools/sst_dump_tool_imp.h   2016-11-10 13:29:09.562311567 -0500
@@ -40,7 +40,7 @@
                              RandomAccessFileReader* file, uint64_t file_size);

   uint64_t CalculateCompressedTableSize(const TableBuilderOptions& tb_options,
-                                        size_t block_size);
+                                        size_t block_size, uint64_t& num_data_blocks);

   Status SetTableOptionsByMagicNumber(uint64_t table_magic_number);
   Status SetOldTableOptions();

Example output:

Sst file format: block-based
Block Size: 16384
Compression: kNoCompression           Size:   64898800 Blocks:   1918 Compressed:      0 (  0.0%) Not compressed (ratio):   1918 (100.0%) Not compressed (abort):      0 (  0.0%)
Compression: kSnappyCompression       Size:   64278188 Blocks:   1918 Compressed:    186 (  9.7%) Not compressed (ratio):   1732 ( 90.3%) Not compressed (abort):      0 (  0.0%)
Compression: kZlibCompression         Size:   63820123 Blocks:   1918 Compressed:    298 ( 15.5%) Not compressed (ratio):   1620 ( 84.5%) Not compressed (abort):      0 (  0.0%)
Compression: kLZ4Compression          Size:   64156437 Blocks:   1918 Compressed:    215 ( 11.2%) Not compressed (ratio):   1703 ( 88.8%) Not compressed (abort):      0 (  0.0%)
Compression: kLZ4HCCompression        Size:   64008501 Blocks:   1918 Compressed:    254 ( 13.2%) Not compressed (ratio):   1664 ( 86.8%) Not compressed (abort):      0 (  0.0%)

@siying
Copy link
Contributor

siying commented Nov 10, 2016

Are you interested in sending a pull request for this?

@thatsafunnyname
Copy link
Contributor Author

I am interested in a statistic for how many compressed blocks were not used because they failed to meet the compression threshold. At the moment I am testing using sst_dump --show_compression_sizes, but it is likely to be a useful metric for our real data. I think it will be useful to see this metric in the output from util/statistics.cc::StatisticsImpl::ToString(), in the same way:

{NUMBER_BLOCK_COMPRESSED, "rocksdb.number.block.compressed"},
{NUMBER_BLOCK_DECOMPRESSED, "rocksdb.number.block.decompressed"},
{NUMBER_BLOCK_NOT_COMPRESSED, "rocksdb.number.block.not_compressed"},

can currently be logged.

While I can now see that this metric can be calculated from the existing metrics available, the way the names NUMBER_BLOCK_COMPRESSED and NUMBER_BLOCK_NOT_COMPRESSED came to be has resulted in names that could be improved upon to be less potentially misleading. I don't want to submit a PR for something that would not be acceptable because of a change in behaviour/meaning of a metric, but I think changes to names/behaviour could help other future consumers. Because any such changes would be more efficiently made by core project contributors, at the moment I will not be working on such a PR.

I can send a PR for the change that is isolated to tools/sst_dump_tool.(h|cc) (and maybe a test), but equally I am happy if you wish to implement the suggested enhancement to the sst_dump tool (with or without other changes to metric names/behaviour). Thanks.

@gfosco
Copy link
Contributor

gfosco commented Jan 10, 2018

Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue.

@gfosco gfosco closed this as completed Jan 10, 2018
thatsafunnyname added a commit to thatsafunnyname/rocksdb that referenced this issue Dec 31, 2018
For: facebook#1474
Helps show when the 12.5% threshold for GoodCompressionRatio (originally from ldb) is hit.
thatsafunnyname added a commit to thatsafunnyname/rocksdb that referenced this issue Dec 31, 2018
For: facebook#1474
Helps show when the 12.5% threshold for GoodCompressionRatio (originally from ldb) is hit.
@thatsafunnyname
Copy link
Contributor Author

thatsafunnyname commented Jan 2, 2019

Could this be opened? #4835 is the PR, thanks.

UPDATE: #5791 is the PR

@ajkr ajkr reopened this Jan 2, 2019
thatsafunnyname added a commit to thatsafunnyname/rocksdb that referenced this issue Sep 11, 2019
thatsafunnyname added a commit to thatsafunnyname/rocksdb that referenced this issue Sep 11, 2019
merryChris pushed a commit to merryChris/rocksdb that referenced this issue Nov 18, 2019
…ook#5791)

Summary:
Closes facebook#1474
Helps show when the 12.5% threshold for GoodCompressionRatio (originally from ldb) is hit.

Example output:

```
> ./sst_dump --file=/tmp/test.sst --command=recompress
from [] to []
Process /tmp/test.sst
Sst file format: block-based
Block Size: 16384
Compression: kNoCompression           Size:  122579836 Blocks:   2300 Compressed:      0 (  0.0%) Not compressed (ratio):   2300 (100.0%) Not compressed (abort):      0 (  0.0%)
Compression: kSnappyCompression       Size:   46289962 Blocks:   2300 Compressed:   2119 ( 92.1%) Not compressed (ratio):    181 (  7.9%) Not compressed (abort):      0 (  0.0%)
Compression: kZlibCompression         Size:   29689825 Blocks:   2300 Compressed:   2301 (100.0%) Not compressed (ratio):      0 (  0.0%) Not compressed (abort):      0 (  0.0%)
Unsupported compression type: kBZip2Compression.
Compression: kLZ4Compression          Size:   44785490 Blocks:   2300 Compressed:   1950 ( 84.8%) Not compressed (ratio):    350 ( 15.2%) Not compressed (abort):      0 (  0.0%)
Compression: kLZ4HCCompression        Size:   37498895 Blocks:   2300 Compressed:   2301 (100.0%) Not compressed (ratio):      0 (  0.0%) Not compressed (abort):      0 (  0.0%)
Unsupported compression type: kXpressCompression.
Compression: kZSTD                    Size:   32208707 Blocks:   2300 Compressed:   2301 (100.0%) Not compressed (ratio):      0 (  0.0%) Not compressed (abort):      0 (  0.0%)
```
Pull Request resolved: facebook#5791

Differential Revision: D17347870

fbshipit-source-id: af10849c010b46b20e54162b70123c2805ffe526
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants