Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index value delta encoding #3983

Closed
wants to merge 10 commits into from

Conversation

maysamyabandeh
Copy link
Contributor

@maysamyabandeh maysamyabandeh commented Jun 11, 2018

Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. View: changes, changes since last import

table/block.cc Outdated
@@ -283,9 +291,33 @@ bool BlockIter::ParseNextKey() {

value_ = Slice(p + non_shared, value_length);
while (restart_index_ + 1 < num_restarts_ &&
GetRestartPoint(restart_index_ + 1) < current_) {
GetRestartPoint(restart_index_ + 1) <= current_) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug in the existing code. The bug did not do any harm though and the restart point would be eventually be set correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code a little bit. Maybe I'm wrong but before your new code, restart_index_ seems to only be useful when calling Prev(). In that case, in that case GetRestartPoint(restart_index_) == current_ is not very useful, as the current entry is not interesting, so we are going restart_index_-- for that anyway, so we are doing an unnecessary ++ and --, as well as an extra GetRestartPoint() in Prev().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I would say that existing code is confusing as restart_index_ does not reflect what its definition says, and would make it likely to create bugs. But if the performance overhead of the one more GetRestartPoint is the concern I can revert it back to previous implementation and does the additional call only if value_delta_encoded_ is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: this is reverted now.

@maysamyabandeh
Copy link
Contributor Author

Switching from format 3 to 4 reduces the index size form 345949 to 197100 (with index_block_restart_interval=16)

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. View: changes, changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. View: changes, changes since last import

table/block.cc Outdated
assert(decode_s.ok());
value_ = Slice(p + non_shared, updated_value.data() - value_.data());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move it to another function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

table/block.h Outdated
// the first value in that restart interval.
bool value_delta_encoded_;
mutable std::string value_buf_;
BlockHandle decoded_value_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having so many index block specific logic in BlockIter can make the code harder to understand. Is there a way to generialize it? Is it possible to define an interface like ValueDeltaEncoder (by default null) and call the interface to delta decode the value? Is it possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be static method of helper class as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about make another IndexBlockIter, and move all the index related logic to there, and leave those data block only logic behind? The common binary search can be done with a helper class or a subclass. I do worry that we make the already very complicated BlockIter to be even more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: after the rebase, the logic is moved to IndexBlockIter.

table/block.cc Outdated
uint32_t* value_length,
const bool value_delta_encoded) {
// We need 2 btes for shared and non_shared size. We also need one more byte
// either for value size or the actual value in case of value delta encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't always do value delta encoding inside one restart block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any delta encoding needs one initial full encoding. This implementation uses the first entry in the restart block to have the full encoding. What is the alternative that you would recommend?

table/block.cc Outdated
@@ -283,9 +291,33 @@ bool BlockIter::ParseNextKey() {

value_ = Slice(p + non_shared, value_length);
while (restart_index_ + 1 < num_restarts_ &&
GetRestartPoint(restart_index_ + 1) < current_) {
GetRestartPoint(restart_index_ + 1) <= current_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code a little bit. Maybe I'm wrong but before your new code, restart_index_ seems to only be useful when calling Prev(). In that case, in that case GetRestartPoint(restart_index_) == current_ is not very useful, as the current entry is not interesting, so we are going restart_index_-- for that anyway, so we are doing an unnecessary ++ and --, as well as an extra GetRestartPoint() in Prev().

table/block.cc Outdated
value_ = Slice(p + non_shared, updated_value.data() - value_.data());
} else {
const size_t kHandleMaxSize = 8;
auto v = Slice(p + non_shared, kHandleMaxSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we make sure Slice is within the bytes of the block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice would exceed the block limits only in the case of corruption. A corruption could be detected at any point during the reading. Without delta encoding we had this opportunity to detect such corruption when reading the value. With delta encoding we would rely either on past checks (checksum) or on future checks on the code to detect a value that does not make sense.

table/block.cc Outdated
const size_t kHandleMaxSize = 8;
auto v = Slice(p + non_shared, kHandleMaxSize);
auto next_value_base =
decoded_value_.offset() + decoded_value_.size() + kBlockTrailerSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Google C++ Style, use auto in cases like this is not encouraged: https://google.github.io/styleguide/cppguide.html#auto "Sometimes code is clearer when types are manifest, especially when a variable's initialization depends on things that were declared far away". In this case, the type of next_value_base is not very clear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

table/block.cc Outdated
BlockHandle handle;
Status decode_s = handle.DecodeSizeFrom(next_value_base, &v);
decoded_value_ = handle;
auto& updated_value = v;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This auto also makes the code hard for me to understand. I suggest we make it explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure..

table/block.cc Outdated
auto& updated_value = v;
assert(decode_s.ok());
value_ = Slice(p + non_shared, updated_value.data() - value_.data());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the format in the comment? I haven't manged to understand it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I forgot to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is now documented on IndexBlockIter::DecodeCurrentValue

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. View: changes, changes since last import

@ot
Copy link
Contributor

ot commented Jun 17, 2018

Is there a succinct description of the new format anywhere? Are you still encoding the payload (block handle) length? If so, it could be avoided because the following varints are uniquely decodable.

table/block.cc Outdated
return true;
}
}

// The fomrat:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

table/block.cc Outdated
// restart_point 1: k, v (offset, size), k, v (size), ..., k, v (size)
// ...
// restart_point n-1: k, v (offset, size), k, v (size), ..., k, v (size)
// where, k is key, v is value, and its encoding is in pranthesis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here "pranthesis"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ot yeah you found the description, and as it says:

The format of each key is (shared_size, non_shared_size, shared, non_shared)

so the block handle size (i.e., the value size) is not encoded in the new format.

table/block.cc Outdated
return true;
}
}

// The fomrat:
// restart_point 0: k, v (offset, size), k, v (size), ..., k, v (size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider:

k, v (offset, size1), k, v (size2-size1), ..., k, v (size3-size2)

as what @ot used to proposed?

I also hope

k, v (offset, size1), k, v [padding_size](size2-size1), ..., k, v [padding_size](size3-size2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main factor of whether to take this approach is that whether it is common to have block size difference to within 64 bytes. Sounds like very unlikely though. So maybe size is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to https://www.prod.facebook.com/groups/rocksdb.dev/permalink/815533521878497/, I did not suggest that, I was just proposing to use offset, size, size, size, ... like here, and if we have no other ways to distinguish restart points from delta handles, use negative size (with zigzag encoding) to use the sign bit to signal that.

Other than that, I also still believe that we should try to drop the payload size byte here (is it done in this PR?) and switch to sub-varint for prefix coding.

Copy link
Contributor Author

@maysamyabandeh maysamyabandeh Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying it is an interesting idea. since size_(i+1) - size_i could be positive or negative we need to have most of them between -127 to 127 to significantly benefit from this. With 4k blocks, the compressed block will be 2k and the [-127, 127] is likely to be true for the difference between two index sizes.
I think we can leave padding_size for a future work.
@ot payload size byte is dropped in the new format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying let me correct myself. To benefit from this optimization the changes in the block size should be within [-63, 64] since in varint encoding numbers larger than 127 will take the 2nd byte.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we use 0xFF to indicate a fallback to full size.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. Re-import the pull request

1 similar comment
@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. Re-import the pull request

@maysamyabandeh
Copy link
Contributor Author

Verified that the critical functions are properly inlined:

$ objdump -S db_bench &> /tmp/bytewise.txt
$ grep "rocksdb.*DecodeCurrentValue" /tmp/bytewise.txt
$ grep "rocksdb.*BinarySeek" /tmp/bytewise.txt
$ grep "rocksdb.*DecodeKey" /tmp/bytewise.txt
$ grep "rocksdb.*DecodeEntry" /tmp/bytewise.txt
$

@ot
Copy link
Contributor

ot commented Jul 17, 2018

@maysamyabandeh have you also experimented with the prefix encoding I suggested in the post?

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. Re-import the pull request

@maysamyabandeh
Copy link
Contributor Author

@ot I guess you refer to this by "prefix encoding":

Storing the shared len is known to be inefficient: suppose all the string have a very long common prefix but differ by a very short suffix (which is common because we shorten the keys to the minimal separator). Storing "{previous string len} - {shared len}" still enables decodability, but it is in most cases a much smaller integer. This is known as "rear coding"

The sample sst files that I have been using as reference almost always encodes shared size in one byte. For that not to be the case the key prefix should be > 127 bytes which was not the case in my reference db. I do agree that this is a good idea, but not of immediate impact.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ot
Copy link
Contributor

ot commented Jul 18, 2018

@maysamyabandeh I'm talking about using 1 byte for shared+nonshared instead of 2. With the rear coding optimization, is very likely that both will be < 15 and can be encoded in 2 nibbles.

@maysamyabandeh
Copy link
Contributor Author

@ot oh yeah I was discussing this idea with @siying. Since there will be tradeoff between the memory space and decoding cpu usage, I am not sure if this always would be a win. We would need proper benchmarking to see if it is worth it. The current decoding in 1-byte varint is very simple and efficient.

@ot
Copy link
Contributor

ot commented Jul 18, 2018

@maysamyabandeh why do you think this would be slower than varint? The encoding I'm suggesting is used in LZ4 which is quite fast :)
Would the benefit of saving 1 byte per key be large?

@siying
Copy link
Contributor

siying commented Jul 18, 2018

@maysamyabandeh I don't buy the argument that it is less CPU efficient. It is simple:

uint64_t shared_len = *ptr | 0xF0;
uint64_t non_shared_len = *ptr | 0x0F;
if (shared_len == 0 && non_shared_len == 0) {
  // fallback
}

@ot
Copy link
Contributor

ot commented Jul 18, 2018

It's better to use 0xF as sentinel for continuation, so you don't have to do -1 in the fast path:

uint64_t shared_len = *ptr >> 4;
uint64_t non_shared_len = *ptr | 0x0F;
if (shared_len != 0x0F || non_shared_len != 0x0F) {
  // fast path
} 

Also note that shared_len should really be len - shared_len. Otherwise, if you have a situation like

<very long prefix>a
<very long prefix>bb
<very long prefix>c
<very long prefix>ddd

(which is common in some databases, think of long column names) then shared_len is always at least |<very long prefix>|, while the difference is always small.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. Re-import the pull request

@siying
Copy link
Contributor

siying commented Jul 19, 2018

@ot "Also note that shared_len should really be len - shared_len", do you mean the len from the previous one?

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

util/coding.h Outdated
inline const char* GetVarsignedint64Ptr(const char* p, const char* limit,
int64_t* value) {
uint64_t u = 0;
auto ret = GetVarint64Ptr(p, limit, &u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I still feel it's better to name the type explicitly here.

util/coding.h Outdated
@@ -254,6 +261,15 @@ inline void PutVarint64(std::string* dst, uint64_t v) {
dst->append(buf, static_cast<size_t>(ptr - buf));
}

inline void PutVarsignedint64(std::string* dst, int64_t v) {
char buf[10];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use kMaxVarint64Length instead?

table/block.h Outdated
// offset of delta encoded BlockHandles is computed by adding the size of
// previous delta encoded values in the same restart interval to the offset of
// the first value in that restart interval.
bool value_delta_encoded_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Can we pad value_delta_encoded_ together with key_includes_seq_? Now it takes 8 bytes.

PutVarint32Varint32Varint32(&buffer_, static_cast<uint32_t>(shared),
static_cast<uint32_t>(non_shared),
static_cast<uint32_t>(value.size()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK for now but maybe we should later refactor so that index and data block builder logic is separated.

@@ -68,6 +68,28 @@ void GenerateRandomKVs(std::vector<std::string> *keys,
}
}

void GenerateRandomKVs(std::vector<std::string> *keys,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V is actually a block handle. Can you rename the function so that it is clearer?

@@ -109,12 +119,14 @@ PartitionedFilterBlockReader::PartitionedFilterBlockReader(
const SliceTransform* prefix_extractor, bool _whole_key_filtering,
BlockContents&& contents, FilterBitsReader* /*filter_bits_reader*/,
Statistics* stats, const InternalKeyComparator comparator,
const BlockBasedTable* table, const bool index_key_includes_seq)
const BlockBasedTable* table, const bool index_key_includes_seq,
const bool index_value_is_full)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you renamed it to "index_value_is_delta_encoded" in many places. Why not keep it consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the case before and had caused programming bugs. The new name is consistent with index_key_includes_seq, in the way that the default of both of them is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time understanding what does "index value full" means. You meant it was not deeply encoded, right. I didn't think that way the first time I saw it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion for rename is welcomed (without changing the default).

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

HISTORY.md Outdated
@@ -2,6 +2,7 @@
## Unreleased
### Public API Change
### New Features
* Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding of BlockHandle::offset of the non-head index entries in each restart interval. The feature is backward compatbile but not forward compatible. It is disabled by default unless format_version 4 or above is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo compatbile

table/block.cc Outdated
// restart_point n-1: k, v (off,sz), k, v (delta-sz), ..., k, v (delta-sz)
// where, k is key, v is value, and its encoding is in parenthesis.
// The format of each key is (shared_size, non_shared_size, shared, non_shared)
// The format of each value, i.e., block hanlde, is (offset,size) whenever the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo hanlde, also add a space in all tuples after the comma? (offset, size)

static_cast<uint32_t>(non_shared),
static_cast<uint32_t>(value.size()));
if (use_value_delta_encoding_) {
// Add "<shared><non_shared><value_size>" to buffer_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is incorrect, there is no <value_size>


// generate different prefix
for (int i = from; i < from + len; i += step) {
// generating keys that shares the prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why one comment is present tense, the other -ing?

util/coding.h Outdated
// Shift the absolute number right for one bit. Then use the least significant
// bit to indicate the sign.
char* ptr =
v >= 0 ? EncodeVarint64(buf, v * 2) : EncodeVarint64(buf, v * -1 * 2 + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't EncodeVarint64(buf, v >= 0 ? uint64_t(v) * 2 : -uint64_t(v) * 2 + 1 ) be simpler?

Also be careful with the casts and their order, you're doing signed overflow here (note that -INT_MIN overflows, so you have to negate after switching to unsigned)

Also, have you considered using the branch-free routines used in Thrift? They're probably much faster if the sign is unpredictable:

https://github.com/facebook/fbthrift/blob/449a5f77f9f9bae72c9eb5e78093247eef185c04/thrift/lib/cpp/util/VarintUtils-inl.h#L202-L208

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maysamyabandeh is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh
Copy link
Contributor Author

maysamyabandeh commented Aug 10, 2018

The previous report about the cpu regression with this PR was most probably an experiment error as I could not reproduce it.

The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585 rocksdb read-only range=100
Format 3:
19569 rocksdb read-only range=100
Format 4:
19352 rocksdb read-only range=100

facebook-github-bot pushed a commit that referenced this pull request Aug 10, 2018
Summary:
Fix the compile error in "make unity_test" caused by #3983.
Pull Request resolved: #4257

Differential Revision: D9271740

Pulled By: maysamyabandeh

fbshipit-source-id: 94e56d1675bf8bdc0e94439467eb4f40dd107517
@mm304321141
Copy link
Contributor

mm304321141 commented Aug 11, 2018

WHY THIS PULL ACCEPT ?

if need InternalIterator with other value type
you can use

template<class TValue> InternalIteratorBase{ ... }

typedef InternalIteratorBase<Slice> InternalIterator;

then will not get such a big diff

and

this is BlockBasedTable's feature , why nearly WHOLE project been modified ???
the template param TValue intrusion everywhere

@mm304321141
Copy link
Contributor

@maysamyabandeh @siying

@siying
Copy link
Contributor

siying commented Aug 13, 2018

@mm304321141 as what I mentioned in a WeChat thread, we don't treat SST and memtable interface as publish interface. They were under include/rocksdb, just because we were lazy to resolve some shared_ptr class defining requirement. They can't be implemented just depending on public API anyway. We haven't and won't promise any API compatibility to those memtable and SST file plug-in, including anything related to InternalIterator ---- the name of the class indicates what it is, an internal implementation.

Although I don't see it breaks anything, it is possible that, we have a better way to reorganize the code than what is done in this PR. If you have a better suggestion, you can submit a pull request, and we can discuss from there. We can merge it if it is better than the way it is.

By the way, we in the same community, should respect each other and trust each other to have tried the best to make the best judgements. I accepted the pull request, and fortunately I don't feel offended by the way you challenge me in all capital terms and three question marks. If it were another contributor, the way you wrote your comment might be totally counter-productive for you to solve your problem.

facebook-github-bot pushed a commit that referenced this pull request Sep 12, 2018
Summary:
With #3983 the size of IndexBlockIter was increased. This had resulted in a regression on P50 latencies in one of our benchmarks. The patch reduces IndexBlockIter size be eliminating active_comparator_ field from the class.
Pull Request resolved: #4358

Differential Revision: D9781737

Pulled By: maysamyabandeh

fbshipit-source-id: 71e2b28d90ff0813db9e04b737ae73e185583c52
anand1976 pushed a commit that referenced this pull request Sep 12, 2018
Summary:
With #3983 the size of IndexBlockIter was increased. This had resulted in a regression on P50 latencies in one of our benchmarks. The patch reduces IndexBlockIter size be eliminating active_comparator_ field from the class.
Pull Request resolved: #4358

Differential Revision: D9781737

Pulled By: maysamyabandeh

fbshipit-source-id: 71e2b28d90ff0813db9e04b737ae73e185583c52
rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.

The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585   rocksdb read-only range=100
Format 3:
19569   rocksdb read-only range=100
Format 4:
19352   rocksdb read-only range=100
Pull Request resolved: facebook#3983

Differential Revision: D8361343

Pulled By: maysamyabandeh

fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
Summary:
Fix the compile error in "make unity_test" caused by facebook#3983.
Pull Request resolved: facebook#4257

Differential Revision: D9271740

Pulled By: maysamyabandeh

fbshipit-source-id: 94e56d1675bf8bdc0e94439467eb4f40dd107517
rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
Summary:
With facebook#3983 the size of IndexBlockIter was increased. This had resulted in a regression on P50 latencies in one of our benchmarks. The patch reduces IndexBlockIter size be eliminating active_comparator_ field from the class.
Pull Request resolved: facebook#4358

Differential Revision: D9781737

Pulled By: maysamyabandeh

fbshipit-source-id: 71e2b28d90ff0813db9e04b737ae73e185583c52
buffer_.append(delta_value->data(), delta_value->size());
} else {
buffer_.append(value.data(), value.size());
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if shared == 0 && use_value_delta_encoding_ == true?
The 'value.size()' was not written into buffer as logic from line 135 to line 144.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline doc is indeed confusing. Thanks for bringing it up. When use_value_delta_encoding_ is true it makes two changes in the encoding: the value size is no longer encoded, the value itself is fully encoded only if shared=0 (and otherwise a delta of it is encoded).
We should update the inline docs to clarify that.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants