-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pinnableslice (2nd attempt) #1756
Conversation
maysamyabandeh
commented
Jan 8, 2017
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
cd8dd6b
to
d196f95
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Maysam, some initial comments
db/db_impl.cc
Outdated
@@ -3907,8 +3907,8 @@ ColumnFamilyHandle* DBImpl::DefaultColumnFamily() const { | |||
|
|||
Status DBImpl::Get(const ReadOptions& read_options, | |||
ColumnFamilyHandle* column_family, const Slice& key, | |||
std::string* value) { | |||
return GetImpl(read_options, column_family, key, value); | |||
PinnableSlice* pSlice) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer leaving the name to be value
, what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
db/db_impl.cc
Outdated
@@ -6395,7 +6405,8 @@ Status DBImpl::GetLatestSequenceForKey(SuperVersion* sv, const Slice& key, | |||
*found_record_for_key = false; | |||
|
|||
// Check if there is a record for this key in the latest memtable | |||
sv->mem->Get(lkey, nullptr, &s, &merge_context, &range_del_agg, seq, | |||
PinnableSlice* pSliceNullPtr = nullptr; | |||
sv->mem->Get(lkey, pSliceNullPtr, &s, &merge_context, &range_del_agg, seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent with the style of the rest of the code base, let's change this line to be
sv->mem->Get(lkey, nullptr /* value */, &s, &merge_context, &range_del_agg, seq,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately it would not work. there is an overload of this function with this signature: Get(Slice, String, ...). If we simply pass nullptr compiler would not know which function to invoke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
db/memtable.cc
Outdated
@@ -534,6 +534,10 @@ struct Saver { | |||
}; | |||
} // namespace | |||
|
|||
//static void UnrefMemTable(void* s, void*) { | |||
// reinterpret_cast<MemTable*>(s)->Unref(); | |||
//} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. i wanted to make it easier for you to see which options we took into account before resorting to string copy for memtables.
db/memtable.cc
Outdated
s->pSlice->PinSelf(); | ||
} else { | ||
//s->mem->Ref(); | ||
//s->pSlice->PinSlice(v, UnrefMemTable, s->mem, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
db/memtable.cc
Outdated
} else { | ||
//s->mem->Ref(); | ||
//s->pSlice->PinSlice(v, UnrefMemTable, s->mem, nullptr); | ||
s->pSlice->PinSelf(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we decided to do a memcpy if we are reading from the memtable, why is that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried to make the commit message well detailed: 2c4cead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about that, at the end of the Get call we call ReturnAndCleanupSuperVersion
, what if we delay this call to the ~PinnableSlice(). this way we can guarantee that the memtable will stay in memory until the value is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed that offline with Islam. The plan is go ahead with the string copy from memtable and revisit the alternatives in future.
db/memtable_list.cc
Outdated
PinnableSlice* pSlicePtr = value != nullptr ? &pSlice : nullptr; | ||
auto res = GetFromList(&memlist_history_, key, pSlicePtr, s, merge_context, | ||
range_del_agg, seq, read_opts); | ||
if (value != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are doing this check anyway, can we rewrite the code to be something like this
if (LIKELY(value != nullptr)) {
PinnableSlice pinnable_val;
res = GetFromList( .... );
value->assign(pinnable_val.data(), pinnable_val.size());
} else {
res = GetFromList( .... );
}
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah makes sense.
include/rocksdb/slice.h
Outdated
@@ -116,6 +118,52 @@ class Slice { | |||
// Intentionally copyable | |||
}; | |||
|
|||
class PinnableSlice : public Slice, public Cleanable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment section for PinnableSlice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly
include/rocksdb/slice.h
Outdated
cleanable->DelegateCleanupsTo(this); | ||
} | ||
|
||
inline void PinHeap(std::string* s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used any where, can we remove it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i figured it still demonstrates how pinnable slice could be used with heap objects too. but i do not feel strongly about it. i let you decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not have code that is not used, it makes people wonder (where is this code used) ?
I would prefer if we remove it, but this is my personal opinion, It's your call
include/rocksdb/slice.h
Outdated
size_ = self_space.size(); | ||
} | ||
|
||
inline void PinSelf() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if I call PinSelf
after using PinSlice
What will IsPinned
return ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. IsPinned meant to say if there is any cleanup attached to this. but after PinSelf it has become confusing. i am thinking of removing it entirely. would that work?
table/get_context.cc
Outdated
@@ -106,17 +106,22 @@ bool GetContext::SaveValue(const ParsedInternalKey& parsed_key, | |||
assert(state_ == kNotFound || state_ == kMerge); | |||
if (kNotFound == state_) { | |||
state_ = kFound; | |||
if (value_ != nullptr) { | |||
value_->assign(value.data(), value.size()); | |||
if (LIKELY(pSlice_ != nullptr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some comments explaining this section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
d196f95
to
c32a521
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There are 3 failures in sandcastle which all seems irrelevant: db/db_universal_compaction_test.cc:1068: Failure In file included from db/compaction_job_test.cc:8: @IslamAbdelRahman do you think it is ready to land? |
tools/db_bench_tool.cc
Outdated
@@ -229,6 +229,8 @@ DEFINE_bool(reverse_iterator, false, | |||
|
|||
DEFINE_bool(use_uint64_comparator, false, "use Uint64 user comparator"); | |||
|
|||
DEFINE_bool(pin_slice, false, "use pinnable slice for point lookup"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make it true by default
include/rocksdb/slice.h
Outdated
private: | ||
friend class PinnableSlice4Test; | ||
std::string self_space; | ||
static void ReleaseStringHeap(void* s, void*) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to the test file
include/rocksdb/slice.h
Outdated
|
||
private: | ||
friend class PinnableSlice4Test; | ||
std::string self_space; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on our coding style this should be
std::string self_space_;
db/db_impl.cc
Outdated
@@ -4068,8 +4068,9 @@ Status DBImpl::GetImpl(const ReadOptions& read_options, | |||
ReturnAndCleanupSuperVersion(cfd, sv); | |||
|
|||
RecordTick(stats_, NUMBER_KEYS_READ); | |||
RecordTick(stats_, BYTES_READ, value->size()); | |||
MeasureTime(stats_, BYTES_PER_READ, value->size()); | |||
size_t size = value->size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is not needed, I believe the compiler should do it him self
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the compiler know that two invocation of size() will return the same value?
db/memtable.cc
Outdated
s->env_); | ||
} else if (s->value != nullptr) { | ||
s->value->assign(v.data(), v.size()); | ||
if (LIKELY(s->value != nullptr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed offline, Since we are doing the memcpy in memtable/memtable list anyway, I think we should keep the old code that is using std::string for now. and pass the PinnableSlice own string in the upper layers
include/rocksdb/db.h
Outdated
std::string* value) { | ||
if (LIKELY(value != nullptr)) { | ||
PinnableSlice pinnable_val; | ||
auto s = Get(options, column_family, key, &pinnable_val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, does that mean that we introduce an extra memcpy for memtable/merge operator get ?
memtable -> PinnableSlice::self_space -> value
Let's try
std::move
or
We can allow PinnableSlice to accept an external space
PinnableSlice {
std::string* own_data_ptr_; // This point to own_data_ except if we change it to something else
std::string own_data_;
}
We can measure the regression by running db_bench and making sure that all keys live in memtable
./db_bench --benchmarks="fillseq,stats,readrandom" --num=<something_small_enough>
we can verify that all the keys are in the memtable by looking at the stats result and see that there are no files generated in L0 or any other levels
@maysamyabandeh updated the pull request - view changes - changes since last import |
cff29d8
to
6ccd54c
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
6ccd54c
to
0d05f11
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled out when the user does use the new PinnableSlice API. Here is the summary: value 100 byte: +0.5% regular, -2.4% merge values value 1k byte: -1.8% regular, -0.5% merge values value 10k byte: -1.5% regular, -2.15% merge values Benchmark Details: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=1000 -compression_type=none TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=10000 -compression_type=none TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=100 -compression_type=none --merge_keys=100000 -merge_operator=max TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=1000 -compression_type=none --merge_keys=100000 -merge_operator=max TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=10000 -compression_type=none --merge_keys=100000 -merge_operator=max TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-100-mergenon-pslice.txt TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-100-mergenon-nopslice.txt TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=true 2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-1k-mergenon-pslice.txt TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-1k-mergenon-nopslice.txt TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=true 2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-10k-mergenon-pslice.txt TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-10k-mergenon-nopslice.txt TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=true 2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt Benchmark Results: ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" | xargs -L 1 grep AVG /dev/null scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] : 3005915 ops/sec; 210.6 MB / sec scanread - 10m - 100 - mergenon - nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec; 207.0 MB / sec scanreadmerge - 10m - 100 - mergemax - pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec; 8.5 MB / sec scanreadmerge - 10m - 100 - mergemax - nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec; 8.4 MB / sec scanread - 10m - 1k - mergenon - pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec; 3661.5 MB / sec scanread - 10m - 1k - mergenon - nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec; 3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax - pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec; 3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax - nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec; 3277.4 MB / sec scanread - 10m - 10k - mergenon - pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec; 18923.0 MB / sec scanread - 10m - 10k - mergenon - nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec; 14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax - pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec; 28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax - nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec; 21806.6 MB / sec
Summary: MemTable Ref/Unref must be done under a shared lock. Since the current usages do it under db_mutex then we should do that too. But this would likely result in performance regressions and cancel out all the improvments of using a thread local super vesion to avid exactly the same bottleneck. It is still possible to think of solutions that keep track of memtable ref count in the thread-local SuperVersion and update the MemTable only when we are referesshing thread-local sv (and hence holding the lock on db_mutex). We would need a similar solution to batch the Unrefs invoked by PinnableSlice release in a thread-local data structure and apply it only periodically or when we happen to have the db_mutex locaked (like when are refereshing the cachced SuperVersion). Such solutions however adds a non-negligible complexity (and probabely bugs) to the code base. At this point there are already benefits from using PinnableSlice on BlockCache and does not have to be extended to MemTable.
Summary: DocumentDBImpl is inhertting Get from DocumentDB, which inherit it from StackableDB. In the code however these methods are overriden by returning unimplemented status and yet calling DocumentDB::Get when it is required. There is no apparent rational behind it. We need to remove that since with having Get inline in db.h it will eventually calls DocumentDBImpl::Get through polymorphism.
0d05f11
to
b6ef5ec
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
b6ef5ec
to
d2a65c5
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
@IslamAbdelRahman I ran the latest patch (with std::mov) against the benchmark that you suggested. It shows 7.7% lower throughput compared to master.
|
It turns out that the bottleneck is creating the PinnableSlice object (on stack) which also has a string member. Making it thread_local the performance of -pin_slice=false improved to this: |
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh updated the pull request - view changes - changes since last import |
078cee9
to
96a406f
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
fyi, the problem with string was that it creates char* on heap on each assign call. By reusing a string it would first try to reuse the existing on-heap space and hence avoid creating a new object. |
I see two errors in phabricator:
The former does not seem relevant and the latter seems flaky. @IslamAbdelRahman what do you think? ready to land? |
@maysamyabandeh updated the pull request - view changes - changes since last import |
I frequently see TSAN failure like this. This is an environment problem. It has nothing to do your change. Sometimes if you relaunch it, it can run. |
Thanks @siying. Rerunning tsan helped. |
Here are the final improvements when using pinnable slice: value 100 byte: -2% regular, 4% merge values Since using string is still implemented via pinnable slice underneath, the lowered throughput in case of non-merge 100-byte values must be experiment error and can be ignored. Here are the benchmark details:
|
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
* to avoid memcpy by having the PinnsableSlice object referring to the data | ||
* that is locked in the memory and release them after the data is consuned. | ||
*/ | ||
class PinnableSlice : public Slice, public Cleanable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google C++ Style doesn't encourage this: https://google.github.io/styleguide/cppguide.html#Multiple_Inheritance