Pinnableslice (2nd attempt) #1756

maysamyabandeh · 2017-01-08T23:52:20Z

PinnableSlice

Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

Here is the summary for improvements:

value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled out when the user does use the new PinnableSlice API. Here is
the summary:

value 100 byte: +0.5% regular, -2.4% merge values
value 1k byte: -1.8% regular, -0.5% merge values
value 10k byte: -1.5% regular, -2.15% merge values
Benchmark Details:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=100
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=1000
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=10000
-compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-100-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-100-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-1k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-1k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-10k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-10k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt


Benchmark Results:
ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" |
xargs -L 1 grep AVG /dev/null
scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] :
3005915 ops/sec;
210.6 MB / sec scanread - 10m - 100 - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec;
207.0 MB / sec scanreadmerge - 10m - 100 - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec;
8.5 MB / sec scanreadmerge - 10m - 100 - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec;
8.4 MB / sec scanread - 10m - 1k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec;
3661.5 MB / sec scanread - 10m - 1k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec;
3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec;
3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec;
3277.4 MB / sec scanread - 10m - 10k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec;
18923.0 MB / sec scanread - 10m - 10k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec;
14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec;
28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec;
21806.6 MB / sec

facebook-github-bot · 2017-01-08T23:53:10Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-01-09T03:22:43Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-01-09T03:23:20Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-01-09T04:45:17Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-01-09T04:46:26Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

IslamAbdelRahman

Thanks Maysam, some initial comments

IslamAbdelRahman · 2017-01-17T19:46:30Z

db/db_impl.cc

@@ -3907,8 +3907,8 @@ ColumnFamilyHandle* DBImpl::DefaultColumnFamily() const {

 Status DBImpl::Get(const ReadOptions& read_options,
                   ColumnFamilyHandle* column_family, const Slice& key,
-                   std::string* value) {
-  return GetImpl(read_options, column_family, key, value);
+                   PinnableSlice* pSlice) {


I prefer leaving the name to be value, what do you think ?

IslamAbdelRahman · 2017-01-17T19:52:21Z

db/db_impl.cc

@@ -6395,7 +6405,8 @@ Status DBImpl::GetLatestSequenceForKey(SuperVersion* sv, const Slice& key,
  *found_record_for_key = false;

  // Check if there is a record for this key in the latest memtable
-  sv->mem->Get(lkey, nullptr, &s, &merge_context, &range_del_agg, seq,
+  PinnableSlice* pSliceNullPtr = nullptr;
+  sv->mem->Get(lkey, pSliceNullPtr, &s, &merge_context, &range_del_agg, seq,


to be consistent with the style of the rest of the code base, let's change this line to be
sv->mem->Get(lkey, nullptr /* value */, &s, &merge_context, &range_del_agg, seq,

unfortunately it would not work. there is an overload of this function with this signature: Get(Slice, String, ...). If we simply pass nullptr compiler would not know which function to invoke.

Sounds good to me

IslamAbdelRahman · 2017-01-17T19:53:43Z

db/memtable.cc

@@ -534,6 +534,10 @@ struct Saver {
 };
 }  // namespace

+//static void UnrefMemTable(void* s, void*) {
+// reinterpret_cast<MemTable*>(s)->Unref();
+//}


can we remove this ?

sure. i wanted to make it easier for you to see which options we took into account before resorting to string copy for memtables.

IslamAbdelRahman · 2017-01-17T19:57:28Z

db/memtable.cc

+            s->pSlice->PinSelf();
+          } else {
+            //s->mem->Ref();
+            //s->pSlice->PinSlice(v, UnrefMemTable, s->mem, nullptr);


IslamAbdelRahman · 2017-01-17T19:57:54Z

db/memtable.cc

+          } else {
+            //s->mem->Ref();
+            //s->pSlice->PinSlice(v, UnrefMemTable, s->mem, nullptr);
+            s->pSlice->PinSelf(v);


so we decided to do a memcpy if we are reading from the memtable, why is that ?

i tried to make the commit message well detailed: 2c4cead

What about that, at the end of the Get call we call ReturnAndCleanupSuperVersion, what if we delay this call to the ~PinnableSlice(). this way we can guarantee that the memtable will stay in memory until the value is not needed

Discussed that offline with Islam. The plan is go ahead with the string copy from memtable and revisit the alternatives in future.

IslamAbdelRahman · 2017-01-17T20:02:09Z

db/memtable_list.cc

+  PinnableSlice* pSlicePtr = value != nullptr ? &pSlice : nullptr;
+  auto res = GetFromList(&memlist_history_, key, pSlicePtr, s, merge_context,
+                         range_del_agg, seq, read_opts);
+  if (value != nullptr) {


if we are doing this check anyway, can we rewrite the code to be something like this

if (LIKELY(value != nullptr)) { PinnableSlice pinnable_val; res = GetFromList( .... ); value->assign(pinnable_val.data(), pinnable_val.size()); } else { res = GetFromList( .... ); }

What do you think ?

yeah makes sense.

IslamAbdelRahman · 2017-01-17T20:07:07Z

include/rocksdb/slice.h

@@ -116,6 +118,52 @@ class Slice {
  // Intentionally copyable
 };

+class PinnableSlice : public Slice, public Cleanable {


Can we add a comment section for PinnableSlice

IslamAbdelRahman · 2017-01-17T20:09:52Z

include/rocksdb/slice.h

+    cleanable->DelegateCleanupsTo(this);
+  }
+
+  inline void PinHeap(std::string* s) {


This is not used any where, can we remove it ?

i figured it still demonstrates how pinnable slice could be used with heap objects too. but i do not feel strongly about it. i let you decide.

I think we should not have code that is not used, it makes people wonder (where is this code used) ?
I would prefer if we remove it, but this is my personal opinion, It's your call

IslamAbdelRahman · 2017-01-17T20:11:05Z

include/rocksdb/slice.h

+    size_ = self_space.size();
+  }
+
+  inline void PinSelf() {


What will happen if I call PinSelf after using PinSlice
What will IsPinned return ?

good point. IsPinned meant to say if there is any cleanup attached to this. but after PinSelf it has become confusing. i am thinking of removing it entirely. would that work?

IslamAbdelRahman · 2017-01-17T20:20:06Z

table/get_context.cc

@@ -106,17 +106,22 @@ bool GetContext::SaveValue(const ParsedInternalKey& parsed_key,
        assert(state_ == kNotFound || state_ == kMerge);
        if (kNotFound == state_) {
          state_ = kFound;
-          if (value_ != nullptr) {
-            value_->assign(value.data(), value.size());
+          if (LIKELY(pSlice_ != nullptr)) {


Let's add some comments explaining this section

facebook-github-bot · 2017-02-07T03:45:13Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-02-07T03:51:59Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

maysamyabandeh · 2017-02-07T20:02:03Z

There are 3 failures in sandcastle which all seems irrelevant:
util/env_test.cc:700: Failure
writable_file->PositionedAppend(data_b, kBlockSize)

db/db_universal_compaction_test.cc:1068: Failure
Value of: 0
Expected: non_trivial_move
Which is: 1
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): db/db_universal_compaction_test.cc:1068: Failure

In file included from db/compaction_job_test.cc:8:
In file included from libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/algorithm:61:
libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/bits/stl_algobase.h:199:15: warning: The left operand of '<' is a garbage value
if (__b < __a)
~~~ ^

@IslamAbdelRahman do you think it is ready to land?

IslamAbdelRahman · 2017-02-09T21:14:44Z

tools/db_bench_tool.cc

@@ -229,6 +229,8 @@ DEFINE_bool(reverse_iterator, false,

 DEFINE_bool(use_uint64_comparator, false, "use Uint64 user comparator");

+DEFINE_bool(pin_slice, false, "use pinnable slice for point lookup");


let's make it true by default

IslamAbdelRahman · 2017-02-09T21:25:35Z

include/rocksdb/slice.h

+ private:
+  friend class PinnableSlice4Test;
+  std::string self_space;
+  static void ReleaseStringHeap(void* s, void*) {


Let's move this to the test file

IslamAbdelRahman · 2017-02-09T21:26:00Z

include/rocksdb/slice.h

+
+ private:
+  friend class PinnableSlice4Test;
+  std::string self_space;


based on our coding style this should be

std::string self_space_;

IslamAbdelRahman · 2017-02-09T21:31:23Z

db/db_impl.cc

@@ -4068,8 +4068,9 @@ Status DBImpl::GetImpl(const ReadOptions& read_options,
    ReturnAndCleanupSuperVersion(cfd, sv);

    RecordTick(stats_, NUMBER_KEYS_READ);
-    RecordTick(stats_, BYTES_READ, value->size());
-    MeasureTime(stats_, BYTES_PER_READ, value->size());
+    size_t size = value->size();


nit: this is not needed, I believe the compiler should do it him self

How does the compiler know that two invocation of size() will return the same value?

IslamAbdelRahman · 2017-02-09T21:52:16Z

db/memtable.cc

-              s->env_);
-        } else if (s->value != nullptr) {
-          s->value->assign(v.data(), v.size());
+        if (LIKELY(s->value != nullptr)) {


as discussed offline, Since we are doing the memcpy in memtable/memtable list anyway, I think we should keep the old code that is using std::string for now. and pass the PinnableSlice own string in the upper layers

IslamAbdelRahman · 2017-02-09T22:11:33Z

include/rocksdb/db.h

+                    std::string* value) {
+    if (LIKELY(value != nullptr)) {
+      PinnableSlice pinnable_val;
+      auto s = Get(options, column_family, key, &pinnable_val);


If I understand correctly, does that mean that we introduce an extra memcpy for memtable/merge operator get ?

memtable -> PinnableSlice::self_space -> value

Let's try

std::move

or
We can allow PinnableSlice to accept an external space

PinnableSlice { std::string* own_data_ptr_; // This point to own_data_ except if we change it to something else std::string own_data_; }

We can measure the regression by running db_bench and making sure that all keys live in memtable

./db_bench --benchmarks="fillseq,stats,readrandom" --num=<something_small_enough> we can verify that all the keys are in the memtable by looking at the stats result and see that there are no files generated in L0 or any other levels

facebook-github-bot · 2017-03-09T18:43:54Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-03-09T23:44:29Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-03-09T23:53:57Z

@maysamyabandeh updated the pull request - view changes - changes since last import

Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled out when the user does use the new PinnableSlice API. Here is the summary: value 100 byte: +0.5% regular, -2.4% merge values value 1k byte: -1.8% regular, -0.5% merge values value 10k byte: -1.5% regular, -2.15% merge values Benchmark Details: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=1000 -compression_type=none TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=10000 -compression_type=none TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=100 -compression_type=none --merge_keys=100000 -merge_operator=max TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=1000 -compression_type=none --merge_keys=100000 -merge_operator=max TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=10000 -compression_type=none --merge_keys=100000 -merge_operator=max TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-100-mergenon-pslice.txt TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-100-mergenon-nopslice.txt TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=true 2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-1k-mergenon-pslice.txt TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-1k-mergenon-nopslice.txt TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=true 2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-10k-mergenon-pslice.txt TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-10k-mergenon-nopslice.txt TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=true 2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=120 -pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt Benchmark Results: ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" | xargs -L 1 grep AVG /dev/null scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] : 3005915 ops/sec; 210.6 MB / sec scanread - 10m - 100 - mergenon - nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec; 207.0 MB / sec scanreadmerge - 10m - 100 - mergemax - pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec; 8.5 MB / sec scanreadmerge - 10m - 100 - mergemax - nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec; 8.4 MB / sec scanread - 10m - 1k - mergenon - pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec; 3661.5 MB / sec scanread - 10m - 1k - mergenon - nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec; 3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax - pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec; 3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax - nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec; 3277.4 MB / sec scanread - 10m - 10k - mergenon - pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec; 18923.0 MB / sec scanread - 10m - 10k - mergenon - nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec; 14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax - pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec; 28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax - nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec; 21806.6 MB / sec

+ lint

Summary: MemTable Ref/Unref must be done under a shared lock. Since the current usages do it under db_mutex then we should do that too. But this would likely result in performance regressions and cancel out all the improvments of using a thread local super vesion to avid exactly the same bottleneck. It is still possible to think of solutions that keep track of memtable ref count in the thread-local SuperVersion and update the MemTable only when we are referesshing thread-local sv (and hence holding the lock on db_mutex). We would need a similar solution to batch the Unrefs invoked by PinnableSlice release in a thread-local data structure and apply it only periodically or when we happen to have the db_mutex locaked (like when are refereshing the cachced SuperVersion). Such solutions however adds a non-negligible complexity (and probabely bugs) to the code base. At this point there are already benefits from using PinnableSlice on BlockCache and does not have to be extended to MemTable.

Summary: DocumentDBImpl is inhertting Get from DocumentDB, which inherit it from StackableDB. In the code however these methods are overriden by returning unimplemented status and yet calling DocumentDB::Get when it is required. There is no apparent rational behind it. We need to remove that since with having Get inline in db.h it will eventually calls DocumentDBImpl::Get through polymorphism.

facebook-github-bot · 2017-03-10T00:02:19Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-03-10T00:12:56Z

@maysamyabandeh updated the pull request - view changes - changes since last import

maysamyabandeh · 2017-03-10T01:24:03Z

@IslamAbdelRahman I ran the latest patch (with std::mov) against the benchmark that you suggested. It shows 7.7% lower throughput compared to master.

[?0] myabandeh@dev15089:~/rocksdb[master)]$ cat memtable.sh
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true -duration 60 2>&1 | tee pslice-memtable.txt
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false -duration 60 2>&1 | tee nopslice-memtable.txt
# checkout master
N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration 60 2>&1 | tee string-memtable.txt

[?0 0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG string-memtable.txt
readrandom [AVG    5 runs] : 15538647 ops/sec; 1719.0 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG nopslice-memtable.txt
readrandom [AVG    5 runs] : 14362151 ops/sec; 1588.8 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG pslice-memtable.txt
readrandom [AVG    5 runs] : 15560370 ops/sec; 1721.4 MB/sec

maysamyabandeh · 2017-03-10T05:07:34Z

It turns out that the bottleneck is creating the PinnableSlice object (on stack) which also has a string member. Making it thread_local the performance of -pin_slice=false improved to this:
15160104 ops/sec; 1677.1 MB/s

facebook-github-bot · 2017-03-10T06:21:08Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-03-10T16:21:28Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-03-11T03:05:15Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-03-11T03:06:13Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-03-11T03:10:40Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

maysamyabandeh · 2017-03-13T16:23:41Z

fyi, the problem with string was that it creates char* on heap on each assign call. By reusing a string it would first try to reuse the existing on-heap space and hence avoid creating a new object.

maysamyabandeh · 2017-03-13T16:28:30Z

I see two errors in phabricator:

tsan:

FATAL: ThreadSanitizer can not mmap the shadow memory (something is mapped at 0x55bdc3d63000 < 0x7cf000000000)
FATAL: Make sure to compile with -fPIE and to link with -pie.

db_sst_test-DBSSTTest.DeleteObsoleteFilesPendingOutputs when using 4.8.1 compiler.

The former does not seem relevant and the latter seems flaky.

@IslamAbdelRahman what do you think? ready to land?

facebook-github-bot · 2017-03-13T16:28:35Z

@maysamyabandeh updated the pull request - view changes - changes since last import

siying · 2017-03-13T17:14:31Z

I frequently see TSAN failure like this. This is an environment problem. It has nothing to do your change. Sometimes if you relaunch it, it can run.

maysamyabandeh · 2017-03-13T18:28:37Z

Thanks @siying. Rerunning tsan helped.

maysamyabandeh · 2017-03-13T18:34:09Z

Here are the final improvements when using pinnable slice:

value 100 byte: -2% regular, 4% merge values
value 1k byte: 14% regular, 10% merge values
value 10k byte: 34% regular, 35% merge values

Since using string is still implemented via pinnable slice underneath, the lowered throughput in case of non-merge 100-byte values must be experiment error and can be ignored.

Here are the benchmark details:

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=100 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=1000 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=10000 -compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=true 2>&1 | tee scanread-10m-100-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=false 2>&1 | tee scanread-10m-100-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt4


TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-1k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-1k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-10k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-10k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt4

facebook-github-bot · 2017-03-13T18:37:09Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying · 2017-03-13T19:02:21Z

include/rocksdb/slice.h

+ * to avoid memcpy by having the PinnsableSlice object referring to the data
+ * that is locked in the memory and release them after the data is consuned.
+ */
+class PinnableSlice : public Slice, public Cleanable {


Google C++ Style doesn't encourage this: https://google.github.io/styleguide/cppguide.html#Multiple_Inheritance

Summary: some fbcode services override it, we need to keep it virtual. original change: #1756 Closes #2065 Differential Revision: D4808123 Pulled By: ajkr fbshipit-source-id: 5eaeea7

maysamyabandeh requested a review from IslamAbdelRahman January 8, 2017 23:52

facebook-github-bot added the CLA Signed label Jan 8, 2017

maysamyabandeh force-pushed the pinnableslice-new branch from cd8dd6b to d196f95 Compare January 9, 2017 04:44

IslamAbdelRahman reviewed Jan 17, 2017

View reviewed changes

maysamyabandeh force-pushed the pinnableslice-new branch from d196f95 to c32a521 Compare February 7, 2017 03:44

IslamAbdelRahman reviewed Feb 9, 2017

View reviewed changes

maysamyabandeh force-pushed the pinnableslice-new branch from cff29d8 to 6ccd54c Compare March 9, 2017 23:44

maysamyabandeh force-pushed the pinnableslice-new branch from 6ccd54c to 0d05f11 Compare March 9, 2017 23:53

maysamyabandeh added 11 commits March 9, 2017 16:01

Do Get(string) through PinAndGet(PinnableSlice)

e9f8d56

Replace on-heap string with member-of-PinneableSlice string

3b0fb72

+ lint

Fix test compile issues

f79ed7f

Fix unit test failures

d2df4b7

Add unit tests

8a2ead7

Add nullptr checks

8397c2d

use ASSERT_STREQ for string assert

54bf4cd

fix unit tests

82d4089

maysamyabandeh force-pushed the pinnableslice-new branch from 0d05f11 to b6ef5ec Compare March 10, 2017 00:02

apply comments, revert changes to memtable

d2a65c5

maysamyabandeh force-pushed the pinnableslice-new branch from b6ef5ec to d2a65c5 Compare March 10, 2017 00:11

make PinnableSlice thread_local

b97073b

Allow PinnableSlice to reuse the existing string

96a406f

maysamyabandeh force-pushed the pinnableslice-new branch from 078cee9 to 96a406f Compare March 11, 2017 03:05

minor

ccc992f

facebook-github-bot closed this in 1152625 Mar 13, 2017

siying reviewed Mar 13, 2017

View reviewed changes

ajkr mentioned this pull request Mar 31, 2017

make all DB::Get overloads virtual #2065

Closed

arthurprs mentioned this pull request May 2, 2017

reduce the redundant data copy for Get tikv/rust-rocksdb#24

Closed

maysamyabandeh mentioned this pull request Jul 23, 2018

Implement DB::MultiGet(PinnableSlice) #4171

Open

miasantreble mentioned this pull request Aug 4, 2018

Implement MultiGet(PinnableSlice) #4230

Closed

		@@ -229,6 +229,8 @@ DEFINE_bool(reverse_iterator, false,

		DEFINE_bool(use_uint64_comparator, false, "use Uint64 user comparator");

		DEFINE_bool(pin_slice, false, "use pinnable slice for point lookup");

Pinnableslice (2nd attempt) #1756

Pinnableslice (2nd attempt) #1756

Conversation

maysamyabandeh commented Jan 8, 2017

facebook-github-bot commented Jan 8, 2017

facebook-github-bot commented Jan 9, 2017

facebook-github-bot commented Jan 9, 2017

facebook-github-bot commented Jan 9, 2017

facebook-github-bot commented Jan 9, 2017

IslamAbdelRahman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 7, 2017

facebook-github-bot commented Feb 7, 2017

maysamyabandeh commented Feb 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 9, 2017

facebook-github-bot commented Mar 9, 2017

facebook-github-bot commented Mar 9, 2017

facebook-github-bot commented Mar 10, 2017

facebook-github-bot commented Mar 10, 2017

maysamyabandeh commented Mar 10, 2017

maysamyabandeh commented Mar 10, 2017

facebook-github-bot commented Mar 10, 2017

facebook-github-bot commented Mar 10, 2017

facebook-github-bot commented Mar 11, 2017

facebook-github-bot commented Mar 11, 2017

facebook-github-bot commented Mar 11, 2017

maysamyabandeh commented Mar 13, 2017

maysamyabandeh commented Mar 13, 2017

facebook-github-bot commented Mar 13, 2017

siying commented Mar 13, 2017

maysamyabandeh commented Mar 13, 2017

maysamyabandeh commented Mar 13, 2017

facebook-github-bot commented Mar 13, 2017

Choose a reason for hiding this comment