New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinnableslice (2nd attempt) #1756

Closed
wants to merge 16 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@maysamyabandeh
Contributor

maysamyabandeh commented Jan 8, 2017

PinnableSlice

Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

Here is the summary for improvements:

value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled out when the user does use the new PinnableSlice API. Here is
the summary:

value 100 byte: +0.5% regular, -2.4% merge values
value 1k byte: -1.8% regular, -0.5% merge values
value 10k byte: -1.5% regular, -2.15% merge values
Benchmark Details:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=100
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=1000
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=10000
-compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-100-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-100-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-1k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-1k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-10k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-10k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt


Benchmark Results:
ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" |
xargs -L 1 grep AVG /dev/null
scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] :
3005915 ops/sec;
210.6 MB / sec scanread - 10m - 100 - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec;
207.0 MB / sec scanreadmerge - 10m - 100 - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec;
8.5 MB / sec scanreadmerge - 10m - 100 - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec;
8.4 MB / sec scanread - 10m - 1k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec;
3661.5 MB / sec scanread - 10m - 1k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec;
3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec;
3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec;
3277.4 MB / sec scanread - 10m - 10k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec;
18923.0 MB / sec scanread - 10m - 10k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec;
14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec;
28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec;
21806.6 MB / sec
@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot Jan 8, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented Jan 8, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Jan 9, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot Jan 9, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented Jan 9, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Jan 9, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot Jan 9, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented Jan 9, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@IslamAbdelRahman

Thanks Maysam, some initial comments

Show outdated Hide outdated db/db_impl.cc
Show outdated Hide outdated db/db_impl.cc
Show outdated Hide outdated db/memtable.cc
Show outdated Hide outdated db/memtable.cc
Show outdated Hide outdated db/memtable.cc
Show outdated Hide outdated db/memtable_list.cc
Show outdated Hide outdated include/rocksdb/slice.h
Show outdated Hide outdated include/rocksdb/slice.h
Show outdated Hide outdated include/rocksdb/slice.h
Show outdated Hide outdated table/get_context.cc
@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Feb 7, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot Feb 7, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented Feb 7, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh

This comment has been minimized.

Show comment
Hide comment
@maysamyabandeh

maysamyabandeh Feb 7, 2017

Contributor

There are 3 failures in sandcastle which all seems irrelevant:
util/env_test.cc:700: Failure
writable_file->PositionedAppend(data_b, kBlockSize)

db/db_universal_compaction_test.cc:1068: Failure
Value of: 0
Expected: non_trivial_move
Which is: 1
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): db/db_universal_compaction_test.cc:1068: Failure

In file included from db/compaction_job_test.cc:8:
In file included from libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/algorithm:61:
libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/bits/stl_algobase.h:199:15: warning: The left operand of '<' is a garbage value
if (__b < __a)
~~~ ^

@IslamAbdelRahman do you think it is ready to land?

Contributor

maysamyabandeh commented Feb 7, 2017

There are 3 failures in sandcastle which all seems irrelevant:
util/env_test.cc:700: Failure
writable_file->PositionedAppend(data_b, kBlockSize)

db/db_universal_compaction_test.cc:1068: Failure
Value of: 0
Expected: non_trivial_move
Which is: 1
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): db/db_universal_compaction_test.cc:1068: Failure

In file included from db/compaction_job_test.cc:8:
In file included from libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/algorithm:61:
libgcc/d6e0a7da6faba45f5e5b1638f9edd7afc2f34e7d/4.9.x/gcc-4.9-glibc-2.20/024dbc3/include/c++/4.9.x/bits/stl_algobase.h:199:15: warning: The left operand of '<' is a garbage value
if (__b < __a)
~~~ ^

@IslamAbdelRahman do you think it is ready to land?

Show outdated Hide outdated tools/db_bench_tool.cc
Show outdated Hide outdated include/rocksdb/slice.h
Show outdated Hide outdated include/rocksdb/slice.h
Show outdated Hide outdated db/db_impl.cc
Show outdated Hide outdated db/memtable.cc
Show outdated Hide outdated include/rocksdb/db.h
@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 9, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 9, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 9, 2017

maysamyabandeh added some commits Dec 23, 2016

PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

Here is the summary for improvements:

value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled out when the user does use the new PinnableSlice API. Here is
the summary:

value 100 byte: +0.5% regular, -2.4% merge values
value 1k byte: -1.8% regular, -0.5% merge values
value 10k byte: -1.5% regular, -2.15% merge values
Benchmark Details:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=100
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=1000
-compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks=mergerandom --num=1000000 -value_size=10000
-compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-100-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-100-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-1k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-1k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=true 2>&1 | tee
scanread-10m-10k-mergenon-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -pin_slice=false 2>&1 | tee
scanread-10m-10k-mergenon-nopslice.txt

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120 -pin_slice=true
2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench
--benchmarks="readseq,readrandom[X5]" --use_existing_db --num=1000000
--reads=10000000 --cache_size=10000000000 -threads=32
-compression_type=none -merge_operator=max -duration=120
-pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt

Benchmark Results:
ls -tr | grep ".*-10m-(100|1k|10k)-merge...-(no|)pslice.txt$" |
xargs -L 1 grep AVG /dev/null
scanread-10m-100-mergenon-pslice.txt:readrandom [AVG 5 runs] :
3005915 ops/sec;
210.6 MB / sec scanread - 10m - 100 - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2953754 ops / sec;
207.0 MB / sec scanreadmerge - 10m - 100 - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 766150 ops / sec;
8.5 MB / sec scanreadmerge - 10m - 100 - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 757289 ops / sec;
8.4 MB / sec scanread - 10m - 1k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 5965694 ops / sec;
3661.5 MB / sec scanread - 10m - 1k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 5350749 ops / sec;
3284.1 MB / sec scanreadmerge - 10m - 1k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 36379493 ops / sec;
3524.9 MB / sec scanreadmerge - 10m - 1k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 33825001 ops / sec;
3277.4 MB / sec scanread - 10m - 10k - mergenon -
    pslice.txt : readrandom[AVG 5 runs] : 3127471 ops / sec;
18923.0 MB / sec scanread - 10m - 10k - mergenon -
    nopslice.txt : readrandom[AVG 5 runs] : 2474603 ops / sec;
14972.8 MB / sec scanreadmerge - 10m - 10k - mergemax -
    pslice.txt : readrandom[AVG 5 runs] : 29406680 ops / sec;
28090.5 MB / sec scanreadmerge - 10m - 10k - mergemax -
    nopslice.txt : readrandom[AVG 5 runs] : 22828258 ops / sec;
21806.6 MB / sec
Skip memtable Ref-ing for PinnableSlice
Summary:
MemTable Ref/Unref must be done under a shared lock. Since the current
usages do it under db_mutex then we should do that too. But this would
likely result in performance regressions and cancel out all the
improvments of using a thread local super vesion to avid exactly the
same bottleneck.
It is still possible to think of solutions that keep track of memtable
ref count in the thread-local SuperVersion and update the MemTable only
when we are referesshing thread-local sv (and hence holding the lock on
db_mutex). We would need a similar solution to batch the Unrefs invoked
by PinnableSlice release in a thread-local data structure and apply it
only periodically or when we happen to have the db_mutex locaked (like
when are refereshing the cachced SuperVersion).

Such solutions however adds a non-negligible complexity (and probabely
bugs) to the code base. At this point there are already benefits from
using PinnableSlice on BlockCache and does not have to be extended to
MemTable.
Remove NotImplemented overrides from DocumentDBImp
Summary:
DocumentDBImpl is inhertting Get from DocumentDB, which inherit it from
StackableDB. In the code however these methods are overriden by
returning unimplemented status and yet calling DocumentDB::Get when it
is required. There is no apparent rational behind it.

We need to remove that since with having Get inline in db.h it will
eventually calls DocumentDBImpl::Get through polymorphism.
@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 10, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 10, 2017

@maysamyabandeh

This comment has been minimized.

Show comment
Hide comment
@maysamyabandeh

maysamyabandeh Mar 10, 2017

Contributor

@IslamAbdelRahman I ran the latest patch (with std::mov) against the benchmark that you suggested. It shows 7.7% lower throughput compared to master.

[?0] myabandeh@dev15089:~/rocksdb[master)]$ cat memtable.sh
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true -duration 60 2>&1 | tee pslice-memtable.txt
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false -duration 60 2>&1 | tee nopslice-memtable.txt
# checkout master
N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration 60 2>&1 | tee string-memtable.txt
[?0 0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG string-memtable.txt
readrandom [AVG    5 runs] : 15538647 ops/sec; 1719.0 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG nopslice-memtable.txt
readrandom [AVG    5 runs] : 14362151 ops/sec; 1588.8 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG pslice-memtable.txt
readrandom [AVG    5 runs] : 15560370 ops/sec; 1721.4 MB/sec
Contributor

maysamyabandeh commented Mar 10, 2017

@IslamAbdelRahman I ran the latest patch (with std::mov) against the benchmark that you suggested. It shows 7.7% lower throughput compared to master.

[?0] myabandeh@dev15089:~/rocksdb[master)]$ cat memtable.sh
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true -duration 60 2>&1 | tee pslice-memtable.txt
#N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false -duration 60 2>&1 | tee nopslice-memtable.txt
# checkout master
N=10000; TEST_TMPDIR=/dev/shm/memtable/ ./db_bench --benchmarks="fillseq,readrandom[X5],stats"  --num=$N --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration 60 2>&1 | tee string-memtable.txt
[?0 0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG string-memtable.txt
readrandom [AVG    5 runs] : 15538647 ops/sec; 1719.0 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG nopslice-memtable.txt
readrandom [AVG    5 runs] : 14362151 ops/sec; 1588.8 MB/sec
[?0] myabandeh@dev15089:~/rocksdb[master)]$ grep AVG pslice-memtable.txt
readrandom [AVG    5 runs] : 15560370 ops/sec; 1721.4 MB/sec
@maysamyabandeh

This comment has been minimized.

Show comment
Hide comment
@maysamyabandeh

maysamyabandeh Mar 10, 2017

Contributor

It turns out that the bottleneck is creating the PinnableSlice object (on stack) which also has a string member. Making it thread_local the performance of -pin_slice=false improved to this:
15160104 ops/sec; 1677.1 MB/s

Contributor

maysamyabandeh commented Mar 10, 2017

It turns out that the bottleneck is creating the PinnableSlice object (on stack) which also has a string member. Making it thread_local the performance of -pin_slice=false improved to this:
15160104 ops/sec; 1677.1 MB/s

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 10, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot Mar 10, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented Mar 10, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 11, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 11, 2017

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot Mar 11, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented Mar 11, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@maysamyabandeh

This comment has been minimized.

Show comment
Hide comment
@maysamyabandeh

maysamyabandeh Mar 13, 2017

Contributor

fyi, the problem with string was that it creates char* on heap on each assign call. By reusing a string it would first try to reuse the existing on-heap space and hence avoid creating a new object.

Contributor

maysamyabandeh commented Mar 13, 2017

fyi, the problem with string was that it creates char* on heap on each assign call. By reusing a string it would first try to reuse the existing on-heap space and hence avoid creating a new object.

@maysamyabandeh

This comment has been minimized.

Show comment
Hide comment
@maysamyabandeh

maysamyabandeh Mar 13, 2017

Contributor

I see two errors in phabricator:

  1. tsan:
FATAL: ThreadSanitizer can not mmap the shadow memory (something is mapped at 0x55bdc3d63000 < 0x7cf000000000)
FATAL: Make sure to compile with -fPIE and to link with -pie.
  1. db_sst_test-DBSSTTest.DeleteObsoleteFilesPendingOutputs when using 4.8.1 compiler.

The former does not seem relevant and the latter seems flaky.

@IslamAbdelRahman what do you think? ready to land?

Contributor

maysamyabandeh commented Mar 13, 2017

I see two errors in phabricator:

  1. tsan:
FATAL: ThreadSanitizer can not mmap the shadow memory (something is mapped at 0x55bdc3d63000 < 0x7cf000000000)
FATAL: Make sure to compile with -fPIE and to link with -pie.
  1. db_sst_test-DBSSTTest.DeleteObsoleteFilesPendingOutputs when using 4.8.1 compiler.

The former does not seem relevant and the latter seems flaky.

@IslamAbdelRahman what do you think? ready to land?

@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot commented Mar 13, 2017

@siying

This comment has been minimized.

Show comment
Hide comment
@siying

siying Mar 13, 2017

Contributor

I frequently see TSAN failure like this. This is an environment problem. It has nothing to do your change. Sometimes if you relaunch it, it can run.

Contributor

siying commented Mar 13, 2017

I frequently see TSAN failure like this. This is an environment problem. It has nothing to do your change. Sometimes if you relaunch it, it can run.

@maysamyabandeh

This comment has been minimized.

Show comment
Hide comment
@maysamyabandeh

maysamyabandeh Mar 13, 2017

Contributor

Thanks @siying. Rerunning tsan helped.

Contributor

maysamyabandeh commented Mar 13, 2017

Thanks @siying. Rerunning tsan helped.

@maysamyabandeh

This comment has been minimized.

Show comment
Hide comment
@maysamyabandeh

maysamyabandeh Mar 13, 2017

Contributor

Here are the final improvements when using pinnable slice:

value 100 byte: -2% regular, 4% merge values
value 1k byte: 14% regular, 10% merge values
value 10k byte: 34% regular, 35% merge values

Since using string is still implemented via pinnable slice underneath, the lowered throughput in case of non-merge 100-byte values must be experiment error and can be ignored.

Here are the benchmark details:

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=100 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=1000 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=10000 -compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=true 2>&1 | tee scanread-10m-100-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=false 2>&1 | tee scanread-10m-100-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt4


TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-1k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-1k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-10k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-10k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt4
Contributor

maysamyabandeh commented Mar 13, 2017

Here are the final improvements when using pinnable slice:

value 100 byte: -2% regular, 4% merge values
value 1k byte: 14% regular, 10% merge values
value 10k byte: 34% regular, 35% merge values

Since using string is still implemented via pinnable slice underneath, the lowered throughput in case of non-merge 100-byte values must be experiment error and can be ignored.

Here are the benchmark details:

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=1000 -compression_type=none
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=10000 -compression_type=none
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=100 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=1000 -compression_type=none --merge_keys=100000 -merge_operator=max
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks=mergerandom --num=1000000 -value_size=10000 -compression_type=none --merge_keys=100000 -merge_operator=max

TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=true 2>&1 | tee scanread-10m-100-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -duration=60 -pin_slice=false 2>&1 | tee scanread-10m-100-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-100-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v100nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-100-mergemax-nopslice.txt4


TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-1k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-1k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-1k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v1000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-1k-mergemax-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=true 2>&1 | tee scanread-10m-10k-mergenon-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -pin_slice=false 2>&1 | tee scanread-10m-10k-mergenon-nopslice.txt4

TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=true 2>&1 | tee scanreadmerge-10m-10k-mergemax-pslice.txt4
TEST_TMPDIR=/dev/shm/v10000nocomp-merge/ ./db_bench --benchmarks="readseq,readrandom[X5]"  --use_existing_db --num=1000000 --reads=10000000 --cache_size=10000000000 -threads=32 -compression_type=none -merge_operator=max -duration=60 -pin_slice=false 2>&1 | tee scanreadmerge-10m-10k-mergemax-nopslice.txt4
@facebook-github-bot

This comment has been minimized.

Show comment
Hide comment
@facebook-github-bot

facebook-github-bot Mar 13, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot commented Mar 13, 2017

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

* to avoid memcpy by having the PinnsableSlice object referring to the data
* that is locked in the memory and release them after the data is consuned.
*/
class PinnableSlice : public Slice, public Cleanable {

This comment has been minimized.

@siying
@siying

facebook-github-bot added a commit that referenced this pull request Mar 31, 2017

make all DB::Get overloads virtual
Summary:
some fbcode services override it, we need to keep it virtual.

original change: #1756
Closes #2065

Differential Revision: D4808123

Pulled By: ajkr

fbshipit-source-id: 5eaeea7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment