Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinnableslice (2nd attempt) #1756

Closed

Conversation

maysamyabandeh
Copy link
Contributor

PinnableSlice

Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

Here is the summary for improvements:

value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled out when the user does use the new PinnableSlice API. Here is
the summary:

value 100 byte: +0.5% regular, -2.4% merge values
value 1k byte: -1.8% regular, -0.5% merge values
value 10k byte: -1.5% regular, -2.15% merge values
Benchmark Details:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=100
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=1000
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=10000
-compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-100-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-100-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-1k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-1k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-10k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-10k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt


Benchmark Results:
ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" |
xargs -L 1 grep AVG /dev/null
scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] :
3005915 ops/sec;
210.6 MB / sec scanread - 10m - 100 - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec;
207.0 MB / sec scanreadmerge - 10m - 100 - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec;
8.5 MB / sec scanreadmerge - 10m - 100 - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec;
8.4 MB / sec scanread - 10m - 1k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec;
3661.5 MB / sec scanread - 10m - 1k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec;
3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec;
3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec;
3277.4 MB / sec scanread - 10m - 10k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec;
18923.0 MB / sec scanread - 10m - 10k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec;
14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec;
28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec;
21806.6 MB / sec

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@IslamAbdelRahman IslamAbdelRahman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Maysam, some initial comments

db/db_impl.cc Outdated
@@ -3907,8 +3907,8 @@ ColumnFamilyHandle* DBImpl::DefaultColumnFamily() const {

Status DBImpl::Get(const ReadOptions& read_options,
ColumnFamilyHandle* column_family, const Slice& key,
std::string* value) {
return GetImpl(read_options, column_family, key, value);
PinnableSlice* pSlice) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer leaving the name to be value, what do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

db/db_impl.cc Outdated
@@ -6395,7 +6405,8 @@ Status DBImpl::GetLatestSequenceForKey(SuperVersion* sv, const Slice& key,
*found_record_for_key = false;

// Check if there is a record for this key in the latest memtable
sv->mem->Get(lkey, nullptr, &s, &merge_context, &range_del_agg, seq,
PinnableSlice* pSliceNullPtr = nullptr;
sv->mem->Get(lkey, pSliceNullPtr, &s, &merge_context, &range_del_agg, seq,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the style of the rest of the code base, let's change this line to be
sv->mem->Get(lkey, nullptr /* value */, &s, &merge_context, &range_del_agg, seq,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately it would not work. there is an overload of this function with this signature: Get(Slice, String, ...). If we simply pass nullptr compiler would not know which function to invoke.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

db/memtable.cc Outdated
@@ -534,6 +534,10 @@ struct Saver {
};
} // namespace

//static void UnrefMemTable(void* s, void*) {
// reinterpret_cast<MemTable*>(s)->Unref();
//}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. i wanted to make it easier for you to see which options we took into account before resorting to string copy for memtables.

db/memtable.cc Outdated
s->pSlice->PinSelf();
} else {
//s->mem->Ref();
//s->pSlice->PinSlice(v, UnrefMemTable, s->mem, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

db/memtable.cc Outdated
} else {
//s->mem->Ref();
//s->pSlice->PinSlice(v, UnrefMemTable, s->mem, nullptr);
s->pSlice->PinSelf(v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we decided to do a memcpy if we are reading from the memtable, why is that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried to make the commit message well detailed: 2c4cead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about that, at the end of the Get call we call ReturnAndCleanupSuperVersion, what if we delay this call to the ~PinnableSlice(). this way we can guarantee that the memtable will stay in memory until the value is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed that offline with Islam. The plan is go ahead with the string copy from memtable and revisit the alternatives in future.

PinnableSlice* pSlicePtr = value != nullptr ? &pSlice : nullptr;
auto res = GetFromList(&memlist_history_, key, pSlicePtr, s, merge_context,
range_del_agg, seq, read_opts);
if (value != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are doing this check anyway, can we rewrite the code to be something like this

if (LIKELY(value != nullptr)) {
  PinnableSlice pinnable_val;
  res = GetFromList( .... );
  value->assign(pinnable_val.data(), pinnable_val.size());
} else {
  res = GetFromList( .... );
}

What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense.

@@ -116,6 +118,52 @@ class Slice {
// Intentionally copyable
};

class PinnableSlice : public Slice, public Cleanable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment section for PinnableSlice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certainly

cleanable->DelegateCleanupsTo(this);
}

inline void PinHeap(std::string* s) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used any where, can we remove it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i figured it still demonstrates how pinnable slice could be used with heap objects too. but i do not feel strongly about it. i let you decide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not have code that is not used, it makes people wonder (where is this code used) ?
I would prefer if we remove it, but this is my personal opinion, It's your call

size_ = self_space.size();
}

inline void PinSelf() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if I call PinSelf after using PinSlice
What will IsPinned return ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. IsPinned meant to say if there is any cleanup attached to this. but after PinSelf it has become confusing. i am thinking of removing it entirely. would that work?

@@ -106,17 +106,22 @@ bool GetContext::SaveValue(const ParsedInternalKey& parsed_key,
assert(state_ == kNotFound || state_ == kMerge);
if (kNotFound == state_) {
state_ = kFound;
if (value_ != nullptr) {
value_->assign(value.data(), value.size());
if (LIKELY(pSlice_ != nullptr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some comments explaining this section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh
Copy link
Contributor Author

There are 3 failures in sandcastle which all seems irrelevant:
util/env_test.cc:700: Failure
writable_file->PositionedAppend(data_b, kBlockSize)

db/db_universal_compaction_test.cc:1068: Failure
Value of: 0
Expected: non_trivial_move
Which is: 1
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): db/db_universal_compaction_test.cc:1068: Failure

In file included from db/compaction_job_test.cc:8:
In file included from libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/algorithm:61:
libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/bits/stl_algobase.h:199:15: warning: The left operand of '<' is a garbage value
if (__b < __a)
~~~ ^

@IslamAbdelRahman do you think it is ready to land?

@@ -229,6 +229,8 @@ DEFINE_bool(reverse_iterator, false,

DEFINE_bool(use_uint64_comparator, false, "use Uint64 user comparator");

DEFINE_bool(pin_slice, false, "use pinnable slice for point lookup");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it true by default

private:
friend class PinnableSlice4Test;
std::string self_space;
static void ReleaseStringHeap(void* s, void*) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to the test file


private:
friend class PinnableSlice4Test;
std::string self_space;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on our coding style this should be

std::string self_space_;

db/db_impl.cc Outdated
@@ -4068,8 +4068,9 @@ Status DBImpl::GetImpl(const ReadOptions& read_options,
ReturnAndCleanupSuperVersion(cfd, sv);

RecordTick(stats_, NUMBER_KEYS_READ);
RecordTick(stats_, BYTES_READ, value->size());
MeasureTime(stats_, BYTES_PER_READ, value->size());
size_t size = value->size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not needed, I believe the compiler should do it him self

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the compiler know that two invocation of size() will return the same value?

db/memtable.cc Outdated
s->env_);
} else if (s->value != nullptr) {
s->value->assign(v.data(), v.size());
if (LIKELY(s->value != nullptr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, Since we are doing the memcpy in memtable/memtable list anyway, I think we should keep the old code that is using std::string for now. and pass the PinnableSlice own string in the upper layers

std::string* value) {
if (LIKELY(value != nullptr)) {
PinnableSlice pinnable_val;
auto s = Get(options, column_family, key, &pinnable_val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, does that mean that we introduce an extra memcpy for memtable/merge operator get ?

memtable -> PinnableSlice::self_space -> value

Let's try

std::move

or
We can allow PinnableSlice to accept an external space

PinnableSlice {
std::string* own_data_ptr_; // This point to own_data_ except if we change it to something else
std::string own_data_;
}

We can measure the regression by running db_bench and making sure that all keys live in memtable

./db_bench --benchmarks="fillseq,stats,readrandom" --num=<something_small_enough>

we can verify that all the keys are in the memtable by looking at the stats result and see that there are no files generated in L0 or any other levels

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

Here is the summary for improvements:

value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled out when the user does use the new PinnableSlice API. Here is
the summary:

value 100 byte: +0.5% regular, -2.4% merge values
value 1k byte: -1.8% regular, -0.5% merge values
value 10k byte: -1.5% regular, -2.15% merge values
Benchmark Details:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=100
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=1000
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=10000
-compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-100-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-100-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-1k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-1k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-10k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-10k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt

Benchmark Results:
ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" |
xargs -L 1 grep AVG /dev/null
scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] :
3005915 ops/sec;
210.6 MB / sec scanread - 10m - 100 - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec;
207.0 MB / sec scanreadmerge - 10m - 100 - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec;
8.5 MB / sec scanreadmerge - 10m - 100 - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec;
8.4 MB / sec scanread - 10m - 1k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec;
3661.5 MB / sec scanread - 10m - 1k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec;
3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec;
3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec;
3277.4 MB / sec scanread - 10m - 10k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec;
18923.0 MB / sec scanread - 10m - 10k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec;
14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec;
28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec;
21806.6 MB / sec
Summary:
MemTable Ref/Unref must be done under a shared lock. Since the current
usages do it under db_mutex then we should do that too. But this would
likely result in performance regressions and cancel out all the
improvments of using a thread local super vesion to avid exactly the
same bottleneck.
It is still possible to think of solutions that keep track of memtable
ref count in the thread-local SuperVersion and update the MemTable only
when we are referesshing thread-local sv (and hence holding the lock on
db_mutex). We would need a similar solution to batch the Unrefs invoked
by PinnableSlice release in a thread-local data structure and apply it
only periodically or when we happen to have the db_mutex locaked (like
when are refereshing the cachced SuperVersion).

Such solutions however adds a non-negligible complexity (and probabely
bugs) to the code base. At this point there are already benefits from
using PinnableSlice on BlockCache and does not have to be extended to
MemTable.
Summary:
DocumentDBImpl is inhertting Get from DocumentDB, which inherit it from
StackableDB. In the code however these methods are overriden by
returning unimplemented status and yet calling DocumentDB::Get when it
is required. There is no apparent rational behind it.

We need to remove that since with having Get inline in db.h it will
eventually calls DocumentDBImpl::Get through polymorphism.
@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@maysamyabandeh
Copy link
Contributor Author

@IslamAbdelRahman I ran the latest patch (with std::mov) against the benchmark that you suggested. It shows 7.7% lower throughput compared to master.

[?0] myabandeh@dev15089:~/rocksdb[master)]$ cat memtable.sh
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true -duration 60 2>&1 | tee pslice-memtable.txt
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false -duration 60 2>&1 | tee nopslice-memtable.txt
# checkout master
N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration 60 2>&1 | tee string-memtable.txt
[?0 0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG string-memtable.txt
readrandom [AVG    5 runs] : 15538647 ops/sec; 1719.0 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG nopslice-memtable.txt
readrandom [AVG    5 runs] : 14362151 ops/sec; 1588.8 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG pslice-memtable.txt
readrandom [AVG    5 runs] : 15560370 ops/sec; 1721.4 MB/sec

@maysamyabandeh
Copy link
Contributor Author

It turns out that the bottleneck is creating the PinnableSlice object (on stack) which also has a string member. Making it thread_local the performance of -pin_slice=false improved to this:
15160104 ops/sec; 1677.1 MB/s

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh
Copy link
Contributor Author

fyi, the problem with string was that it creates char* on heap on each assign call. By reusing a string it would first try to reuse the existing on-heap space and hence avoid creating a new object.

@maysamyabandeh
Copy link
Contributor Author

I see two errors in phabricator:

  1. tsan:
FATAL: ThreadSanitizer can not mmap the shadow memory (something is mapped at 0x55bdc3d63000 < 0x7cf000000000)
FATAL: Make sure to compile with -fPIE and to link with -pie.
  1. db_sst_test-DBSSTTest.DeleteObsoleteFilesPendingOutputs when using 4.8.1 compiler.

The former does not seem relevant and the latter seems flaky.

@IslamAbdelRahman what do you think? ready to land?

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh updated the pull request - view changes - changes since last import

@siying
Copy link
Contributor

siying commented Mar 13, 2017

I frequently see TSAN failure like this. This is an environment problem. It has nothing to do your change. Sometimes if you relaunch it, it can run.

@maysamyabandeh
Copy link
Contributor Author

Thanks @siying. Rerunning tsan helped.

@maysamyabandeh
Copy link
Contributor Author

Here are the final improvements when using pinnable slice:

value 100 byte: -2% regular, 4% merge values
value 1k byte: 14% regular, 10% merge values
value 10k byte: 34% regular, 35% merge values

Since using string is still implemented via pinnable slice underneath, the lowered throughput in case of non-merge 100-byte values must be experiment error and can be ignored.

Here are the benchmark details:

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=100 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=1000 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=10000 -compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=true 2>&1 | tee scanread-10m-100-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=false 2>&1 | tee scanread-10m-100-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt4


TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-1k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-1k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-10k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-10k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt4

@facebook-github-bot
Copy link
Contributor

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

* to avoid memcpy by having the PinnsableSlice object referring to the data
* that is locked in the memory and release them after the data is consuned.
*/
class PinnableSlice : public Slice, public Cleanable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

facebook-github-bot pushed a commit that referenced this pull request Mar 31, 2017
Summary:
some fbcode services override it, we need to keep it virtual.

original change: #1756
Closes #2065

Differential Revision: D4808123

Pulled By: ajkr

fbshipit-source-id: 5eaeea7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants