Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinnableslice (2nd attempt) #1756

Closed

Conversation

@maysamyabandeh
Copy link
Contributor

@maysamyabandeh maysamyabandeh commented Jan 8, 2017

PinnableSlice

Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

Here is the summary for improvements:

value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled out when the user does use the new PinnableSlice API. Here is
the summary:

value 100 byte: +0.5% regular, -2.4% merge values
value 1k byte: -1.8% regular, -0.5% merge values
value 10k byte: -1.5% regular, -2.15% merge values
Benchmark Details:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=100
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=1000
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=10000
-compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-100-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-100-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-1k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-1k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-10k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-10k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt


Benchmark Results:
ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" |
xargs -L 1 grep AVG /dev/null
scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] :
3005915 ops/sec;
210.6 MB / sec scanread - 10m - 100 - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec;
207.0 MB / sec scanreadmerge - 10m - 100 - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec;
8.5 MB / sec scanreadmerge - 10m - 100 - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec;
8.4 MB / sec scanread - 10m - 1k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec;
3661.5 MB / sec scanread - 10m - 1k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec;
3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec;
3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec;
3277.4 MB / sec scanread - 10m - 10k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec;
18923.0 MB / sec scanread - 10m - 10k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec;
14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec;
28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec;
21806.6 MB / sec
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Jan 8, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Jan 9, 2017

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Jan 9, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh maysamyabandeh force-pushed the maysamyabandeh:pinnableslice-new branch Jan 9, 2017
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Jan 9, 2017

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Jan 9, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@IslamAbdelRahman IslamAbdelRahman left a comment

Thanks Maysam, some initial comments

db/db_impl.cc Outdated
@@ -3907,8 +3907,8 @@ ColumnFamilyHandle* DBImpl::DefaultColumnFamily() const {

Status DBImpl::Get(const ReadOptions& read_options,
ColumnFamilyHandle* column_family, const Slice& key,
std::string* value) {
return GetImpl(read_options, column_family, key, value);
PinnableSlice* pSlice) {

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

I prefer leaving the name to be value, what do you think ?

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

Sure.

db/db_impl.cc Outdated
@@ -6395,7 +6405,8 @@ Status DBImpl::GetLatestSequenceForKey(SuperVersion* sv, const Slice& key,
*found_record_for_key = false;

// Check if there is a record for this key in the latest memtable
sv->mem->Get(lkey, nullptr, &s, &merge_context, &range_del_agg, seq,
PinnableSlice* pSliceNullPtr = nullptr;
sv->mem->Get(lkey, pSliceNullPtr, &s, &merge_context, &range_del_agg, seq,

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

to be consistent with the style of the rest of the code base, let's change this line to be
sv->mem->Get(lkey, nullptr /* value */, &s, &merge_context, &range_del_agg, seq,

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

unfortunately it would not work. there is an overload of this function with this signature: Get(Slice, String, ...). If we simply pass nullptr compiler would not know which function to invoke.

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 4, 2017
Contributor

Sounds good to me

db/memtable.cc Outdated
@@ -534,6 +534,10 @@ struct Saver {
};
} // namespace

//static void UnrefMemTable(void* s, void*) {
// reinterpret_cast<MemTable*>(s)->Unref();
//}

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

can we remove this ?

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

sure. i wanted to make it easier for you to see which options we took into account before resorting to string copy for memtables.

db/memtable.cc Outdated
s->pSlice->PinSelf();
} else {
//s->mem->Ref();
//s->pSlice->PinSlice(v, UnrefMemTable, s->mem, nullptr);

This comment has been minimized.

db/memtable.cc Outdated
} else {
//s->mem->Ref();
//s->pSlice->PinSlice(v, UnrefMemTable, s->mem, nullptr);
s->pSlice->PinSelf(v);

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

so we decided to do a memcpy if we are reading from the memtable, why is that ?

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

i tried to make the commit message well detailed: 2c4cead

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 4, 2017
Contributor

What about that, at the end of the Get call we call ReturnAndCleanupSuperVersion, what if we delay this call to the ~PinnableSlice(). this way we can guarantee that the memtable will stay in memory until the value is not needed

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Feb 7, 2017
Author Contributor

Discussed that offline with Islam. The plan is go ahead with the string copy from memtable and revisit the alternatives in future.

db/memtable_list.cc Outdated
PinnableSlice* pSlicePtr = value != nullptr ? &pSlice : nullptr;
auto res = GetFromList(&memlist_history_, key, pSlicePtr, s, merge_context,
range_del_agg, seq, read_opts);
if (value != nullptr) {

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

if we are doing this check anyway, can we rewrite the code to be something like this

if (LIKELY(value != nullptr)) {
  PinnableSlice pinnable_val;
  res = GetFromList( .... );
  value->assign(pinnable_val.data(), pinnable_val.size());
} else {
  res = GetFromList( .... );
}

What do you think ?

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

yeah makes sense.

include/rocksdb/slice.h Outdated
@@ -116,6 +118,52 @@ class Slice {
// Intentionally copyable
};

class PinnableSlice : public Slice, public Cleanable {

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

Can we add a comment section for PinnableSlice

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

certainly

include/rocksdb/slice.h Outdated
cleanable->DelegateCleanupsTo(this);
}

inline void PinHeap(std::string* s) {

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

This is not used any where, can we remove it ?

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

i figured it still demonstrates how pinnable slice could be used with heap objects too. but i do not feel strongly about it. i let you decide.

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 4, 2017
Contributor

I think we should not have code that is not used, it makes people wonder (where is this code used) ?
I would prefer if we remove it, but this is my personal opinion, It's your call

include/rocksdb/slice.h Outdated
size_ = self_space.size();
}

inline void PinSelf() {

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

What will happen if I call PinSelf after using PinSlice
What will IsPinned return ?

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

good point. IsPinned meant to say if there is any cleanup attached to this. but after PinSelf it has become confusing. i am thinking of removing it entirely. would that work?

table/get_context.cc Outdated
@@ -106,17 +106,22 @@ bool GetContext::SaveValue(const ParsedInternalKey& parsed_key,
assert(state_ == kNotFound || state_ == kMerge);
if (kNotFound == state_) {
state_ = kFound;
if (value_ != nullptr) {
value_->assign(value.data(), value.size());
if (LIKELY(pSlice_ != nullptr)) {

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Jan 17, 2017
Contributor

Let's add some comments explaining this section

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Jan 17, 2017
Author Contributor

sure

@maysamyabandeh maysamyabandeh force-pushed the maysamyabandeh:pinnableslice-new branch Feb 7, 2017
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Feb 7, 2017

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Feb 7, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh
Copy link
Contributor Author

@maysamyabandeh maysamyabandeh commented Feb 7, 2017

There are 3 failures in sandcastle which all seems irrelevant:
util/env_test.cc:700: Failure
writable_file->PositionedAppend(data_b, kBlockSize)

db/db_universal_compaction_test.cc:1068: Failure
Value of: 0
Expected: non_trivial_move
Which is: 1
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): db/db_universal_compaction_test.cc:1068: Failure

In file included from db/compaction_job_test.cc:8:
In file included from libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/algorithm:61:
libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/bits/stl_algobase.h:199:15: warning: The left operand of '<' is a garbage value
if (__b < __a)
~~~ ^

@IslamAbdelRahman do you think it is ready to land?

tools/db_bench_tool.cc Outdated
@@ -229,6 +229,8 @@ DEFINE_bool(reverse_iterator, false,

DEFINE_bool(use_uint64_comparator, false, "use Uint64 user comparator");

DEFINE_bool(pin_slice, false, "use pinnable slice for point lookup");

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 9, 2017
Contributor

let's make it true by default

include/rocksdb/slice.h Outdated
private:
friend class PinnableSlice4Test;
std::string self_space;
static void ReleaseStringHeap(void* s, void*) {

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 9, 2017
Contributor

Let's move this to the test file

include/rocksdb/slice.h Outdated

private:
friend class PinnableSlice4Test;
std::string self_space;

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 9, 2017
Contributor

based on our coding style this should be

std::string self_space_;
db/db_impl.cc Outdated
@@ -4068,8 +4068,9 @@ Status DBImpl::GetImpl(const ReadOptions& read_options,
ReturnAndCleanupSuperVersion(cfd, sv);

RecordTick(stats_, NUMBER_KEYS_READ);
RecordTick(stats_, BYTES_READ, value->size());
MeasureTime(stats_, BYTES_PER_READ, value->size());
size_t size = value->size();

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 9, 2017
Contributor

nit: this is not needed, I believe the compiler should do it him self

This comment has been minimized.

@maysamyabandeh

maysamyabandeh Mar 8, 2017
Author Contributor

How does the compiler know that two invocation of size() will return the same value?

db/memtable.cc Outdated
s->env_);
} else if (s->value != nullptr) {
s->value->assign(v.data(), v.size());
if (LIKELY(s->value != nullptr)) {

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 9, 2017
Contributor

as discussed offline, Since we are doing the memcpy in memtable/memtable list anyway, I think we should keep the old code that is using std::string for now. and pass the PinnableSlice own string in the upper layers

include/rocksdb/db.h Outdated
std::string* value) {
if (LIKELY(value != nullptr)) {
PinnableSlice pinnable_val;
auto s = Get(options, column_family, key, &pinnable_val);

This comment has been minimized.

@IslamAbdelRahman

IslamAbdelRahman Feb 9, 2017
Contributor

If I understand correctly, does that mean that we introduce an extra memcpy for memtable/merge operator get ?

memtable -> PinnableSlice::self_space -> value

Let's try

std::move

or
We can allow PinnableSlice to accept an external space

PinnableSlice {
std::string* own_data_ptr_; // This point to own_data_ except if we change it to something else
std::string own_data_;
}

We can measure the regression by running db_bench and making sure that all keys live in memtable

./db_bench --benchmarks="fillseq,stats,readrandom" --num=<something_small_enough>

we can verify that all the keys are in the memtable by looking at the stats result and see that there are no files generated in L0 or any other levels
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 9, 2017

@maysamyabandeh maysamyabandeh force-pushed the maysamyabandeh:pinnableslice-new branch Mar 9, 2017
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 9, 2017

@maysamyabandeh maysamyabandeh force-pushed the maysamyabandeh:pinnableslice-new branch Mar 9, 2017
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 9, 2017

Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

Here is the summary for improvements:

value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled out when the user does use the new PinnableSlice API. Here is
the summary:

value 100 byte: +0.5% regular, -2.4% merge values
value 1k byte: -1.8% regular, -0.5% merge values
value 10k byte: -1.5% regular, -2.15% merge values
Benchmark Details:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=100
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=1000
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=10000
-compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-100-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-100-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-1k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-1k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-10k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-10k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt

Benchmark Results:
ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" |
xargs -L 1 grep AVG /dev/null
scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] :
3005915 ops/sec;
210.6 MB / sec scanread - 10m - 100 - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec;
207.0 MB / sec scanreadmerge - 10m - 100 - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec;
8.5 MB / sec scanreadmerge - 10m - 100 - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec;
8.4 MB / sec scanread - 10m - 1k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec;
3661.5 MB / sec scanread - 10m - 1k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec;
3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec;
3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec;
3277.4 MB / sec scanread - 10m - 10k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec;
18923.0 MB / sec scanread - 10m - 10k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec;
14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec;
28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec;
21806.6 MB / sec
Summary:
MemTable Ref/Unref must be done under a shared lock. Since the current
usages do it under db_mutex then we should do that too. But this would
likely result in performance regressions and cancel out all the
improvments of using a thread local super vesion to avid exactly the
same bottleneck.
It is still possible to think of solutions that keep track of memtable
ref count in the thread-local SuperVersion and update the MemTable only
when we are referesshing thread-local sv (and hence holding the lock on
db_mutex). We would need a similar solution to batch the Unrefs invoked
by PinnableSlice release in a thread-local data structure and apply it
only periodically or when we happen to have the db_mutex locaked (like
when are refereshing the cachced SuperVersion).

Such solutions however adds a non-negligible complexity (and probabely
bugs) to the code base. At this point there are already benefits from
using PinnableSlice on BlockCache and does not have to be extended to
MemTable.
Summary:
DocumentDBImpl is inhertting Get from DocumentDB, which inherit it from
StackableDB. In the code however these methods are overriden by
returning unimplemented status and yet calling DocumentDB::Get when it
is required. There is no apparent rational behind it.

We need to remove that since with having Get inline in db.h it will
eventually calls DocumentDBImpl::Get through polymorphism.
@maysamyabandeh maysamyabandeh force-pushed the maysamyabandeh:pinnableslice-new branch Mar 10, 2017
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 10, 2017

@maysamyabandeh maysamyabandeh force-pushed the maysamyabandeh:pinnableslice-new branch to d2a65c5 Mar 10, 2017
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 10, 2017

@maysamyabandeh
Copy link
Contributor Author

@maysamyabandeh maysamyabandeh commented Mar 10, 2017

@IslamAbdelRahman I ran the latest patch (with std::mov) against the benchmark that you suggested. It shows 7.7% lower throughput compared to master.

[?0] myabandeh@dev15089:~/rocksdb[master)]$ cat memtable.sh
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true -duration 60 2>&1 | tee pslice-memtable.txt
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false -duration 60 2>&1 | tee nopslice-memtable.txt
# checkout master
N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration 60 2>&1 | tee string-memtable.txt
[?0 0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG string-memtable.txt
readrandom [AVG    5 runs] : 15538647 ops/sec; 1719.0 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG nopslice-memtable.txt
readrandom [AVG    5 runs] : 14362151 ops/sec; 1588.8 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG pslice-memtable.txt
readrandom [AVG    5 runs] : 15560370 ops/sec; 1721.4 MB/sec
@maysamyabandeh
Copy link
Contributor Author

@maysamyabandeh maysamyabandeh commented Mar 10, 2017

It turns out that the bottleneck is creating the PinnableSlice object (on stack) which also has a string member. Making it thread_local the performance of -pin_slice=false improved to this:
15160104 ops/sec; 1677.1 MB/s

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 10, 2017

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 10, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 11, 2017

@maysamyabandeh maysamyabandeh force-pushed the maysamyabandeh:pinnableslice-new branch to 96a406f Mar 11, 2017
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 11, 2017

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 11, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh
Copy link
Contributor Author

@maysamyabandeh maysamyabandeh commented Mar 13, 2017

fyi, the problem with string was that it creates char* on heap on each assign call. By reusing a string it would first try to reuse the existing on-heap space and hence avoid creating a new object.

@maysamyabandeh
Copy link
Contributor Author

@maysamyabandeh maysamyabandeh commented Mar 13, 2017

I see two errors in phabricator:

  1. tsan:
FATAL: ThreadSanitizer can not mmap the shadow memory (something is mapped at 0x55bdc3d63000 < 0x7cf000000000)
FATAL: Make sure to compile with -fPIE and to link with -pie.
  1. db_sst_test-DBSSTTest.DeleteObsoleteFilesPendingOutputs when using 4.8.1 compiler.

The former does not seem relevant and the latter seems flaky.

@IslamAbdelRahman what do you think? ready to land?

@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 13, 2017

@siying
Copy link
Contributor

@siying siying commented Mar 13, 2017

I frequently see TSAN failure like this. This is an environment problem. It has nothing to do your change. Sometimes if you relaunch it, it can run.

@maysamyabandeh
Copy link
Contributor Author

@maysamyabandeh maysamyabandeh commented Mar 13, 2017

Thanks @siying. Rerunning tsan helped.

@maysamyabandeh
Copy link
Contributor Author

@maysamyabandeh maysamyabandeh commented Mar 13, 2017

Here are the final improvements when using pinnable slice:

value 100 byte: -2% regular, 4% merge values
value 1k byte: 14% regular, 10% merge values
value 10k byte: 34% regular, 35% merge values

Since using string is still implemented via pinnable slice underneath, the lowered throughput in case of non-merge 100-byte values must be experiment error and can be ignored.

Here are the benchmark details:

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=100 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=1000 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=10000 -compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=true 2>&1 | tee scanread-10m-100-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=false 2>&1 | tee scanread-10m-100-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt4


TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-1k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-1k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-10k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-10k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt4
@facebook-github-bot
Copy link

@facebook-github-bot facebook-github-bot commented Mar 13, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

* to avoid memcpy by having the PinnsableSlice object referring to the data
* that is locked in the memory and release them after the data is consuned.
*/
class PinnableSlice : public Slice, public Cleanable {

This comment has been minimized.

facebook-github-bot added a commit that referenced this pull request Mar 31, 2017
Summary:
some fbcode services override it, we need to keep it virtual.

original change: #1756
Closes #2065

Differential Revision: D4808123

Pulled By: ajkr

fbshipit-source-id: 5eaeea7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.