Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional performance drop on MongoRocks #3198

Closed
SunguckLee opened this issue Nov 23, 2017 · 26 comments
Closed

Occasional performance drop on MongoRocks #3198

SunguckLee opened this issue Nov 23, 2017 · 26 comments
Assignees

Comments

@SunguckLee
Copy link
Contributor

Expected behavior

I have prepared mongodb 3.4.10 + RocksDB 5.9.
And ran sysbench(https://github.com/Percona-Lab/sysbench/tree/dev-mongodb-support-1.0) prepare for test with this options.

./sysbench/sysbench \
    --mongo-write-concern=1 \
    --mongo-url="mongodb://test-mongodb-server" \
    --mongo-database-name=sbtest \
    --test=sysbench/tests/mongodb/oltp.lua \
    --oltp_table_size=500000000 \
    --oltp_tables_count=16 \
    --num-threads=16 \
    --rand-type=pareto \
    --report-interval=10 \
    --max-requests=0 \
    --max-time=600 \
    --oltp-point-selects=10 \
    --oltp-simple-ranges=1 \
    --oltp-sum-ranges=1 \
    --oltp-order-ranges=1 \
    --oltp-distinct-ranges=1 \
    --oltp-index-updates=1 \
    --oltp-non-index-updates=1 \
    --oltp-inserts=1 prepare

I have expected stable insert throughput. because there's no query(Find). only single thread inserts document to test collection (sbtest).

Actual behavior

First a few hours, everything is peaceful (thanks to L0 -> L0 compaction).
But after about 0.4 billion documents are insert sbtest1 (the first collection), insert performance is dropped drastically. At normal status, 30K insert / second, but it's dropped to 2K/sec. And once this case happens, this low throughput continue about 1~2 minutes.

insert query update delete .. vsize   res qrw arw net_in net_out conn ..
 29877    *0     *0     *0 .. 8.35G 6.90G 0|0 0|2  6.70m   57.7k    3 ..
 30081    *0     *0     *0 .. 8.35G 6.90G 0|0 0|1  6.73m   58.2k    3 ..
 30023    *0     *0     *0 .. 8.35G 6.90G 0|0 0|1  6.73m   58.0k    3 ..
 27982    *0     *0     *0 .. 8.35G 6.90G 0|0 0|0  6.27m   57.1k    3 ..
 30020    *0     *0     *0 .. 8.35G 6.90G 0|0 0|0  6.73m   57.5k    3 ..
 29394    *0     *0     *0 .. 8.35G 6.90G 0|0 0|1  6.71m   57.3k    3 ..
  5543    *0     *0     *0 .. 8.35G 6.90G 0|0 0|2  1.23m   49.0k    3 ..
  1594    *0     *0     *0 .. 8.35G 6.89G 0|0 0|1   461k   54.1k    3 ..
  1532    *0     *0     *0 .. 8.35G 6.89G 0|0 0|1   218k   51.1k    3 ..
  1459    *0     *0     *0 .. 8.35G 6.89G 0|0 0|1   459k   54.0k    3 ..

During this performance drop, I ran linux profile and pstack for mongod process.
And I have found something interesting. During performance drop, "rocksdb::Version::AddLiveFiles" took a lot of cpu time. I have not checked source code yet, but this is not expected (might be even you). But I am worried about 30K/sec performance drop to 2K/sec.

Profile report(During performance drop) : https://gist.github.com/SunguckLee/e68a02981f6edece6214b32db8b37513#file-perf_report_slow-txt
Profile report(During normal 30K insert) : https://gist.github.com/SunguckLee/9fd03e1397eb2446e858e024200cc2ec#file-perf_report_normal-txt
Stacktrace : https://gist.github.com/SunguckLee/a0fb86b182c259360450b8528433a8f0#file-stacetace-txt

Steps to reproduce the behavior

Build rocksdb with "USE_RTTI=1 CFLAGS=-fPIC make static_lib".
Build mongodb with "scons CPPPATH="/mongodb-rocksdb-3.4/rocksdb/include" LIBPATH="/mongodb-rocksdb-3.4/rocksdb" LIBS="lz4 zstd" -j5 mongod mongo".

And run mongodb with below mongod configuration.

...
storage:
    syncPeriodSecs: 60
    dbPath: /data/mongodb
    journal:
        enabled: true
        commitIntervalMs: 100

    engine: rocksdb
    rocksdb:
        cacheSizeGB: 31
        compression: lz4

And run sysbench prepare command.

I have built mogorocks with devtoolset-6 and tested it on ...

Linux mongo-test-server 3.10.0-327.36.1.el7.x86_64 #1 SMP Sun Sep 18 13:04:29 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

32 core, 64GB Memory, NVME SSD
@SunguckLee
Copy link
Contributor Author

And this is rocksdb storage engine status during performance drop.

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      5/0   249.16 MB   1.2    121.6     0.0    121.6     398.3    276.7       0.0   1.4     53.2    174.1      2342      6115    0.383   1195M   3797
  L3     13/0   792.65 MB   1.6    441.1   193.0    248.1     440.9    192.8       0.0   2.3    158.6    158.5      2848       497    5.731   3783M   6242
  L4     27/0    1.46 GB   1.0    617.9   264.9    353.0     566.6    213.6       0.0   2.1    106.5     97.6      5943      2029    2.929   6215M  3042K
  L5    253/2   14.71 GB   1.0    354.9   134.7    220.2     334.4    114.2      87.4   2.5     82.1     77.4      4426      1378    3.212   5559M  9209K
  L6   2362/10  145.24 GB   0.0    504.9   187.3    317.6     462.7    145.1       0.1   2.5     26.6     24.4     19455      1122   17.339     53G   553M
 Sum   2660/12  162.43 GB   0.0   2040.4   779.9   1260.5    2202.8    942.4      87.5   8.0     59.7     64.4     35014     11141    3.143     70G   565M
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0
Uptime(secs): 69063.0 total, 0.0 interval
Flush(GB): cumulative 276.799, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 2202.85 GB write, 32.66 MB/s write, 2040.42 GB read, 30.25 MB/s read, 35014.4 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 211 level0_slowdown, 176 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

@mdcallag
Copy link
Contributor

Thanks for the detailed report. I reshared it at
https://groups.google.com/forum/#!forum/mongo-rocks

There are compaction stalls. I am not sure whether that is the issue here. lz4 compaction for L1, L2 and maybe L3 will make the stalls worse compared to not using compression for L0/L1/L2.

Internally there is a query done to confirm PK is unique, but that should be in-memory only because inserts are not in random order per the PK value.

I am curious about what the user thread is doing when processing inserts. Does the overhead and work from the background compaction threads make it stall? My insert heavy workload has been the insert benchmark, but I didn't enough much attention to stalls and p99 insert rates to state whether or not it was a problem for me.

Back in the day, write rates of 1000/second were a strong signal that write throttling was in progress. The code has changed and perhaps 2000/second means the same thing today. If that is the problem then there are my.cnf changes you can do.

Do you have thread stacks that show where user threads are doing work or blocked in mongod? In the trace you provided I only saw background threads. In this case, there would be at most one thread busy for the user because prepare is single threaded.

@igorcanadi
Copy link
Collaborator

Thanks for the great report! AddLiveFiles using a lot of CPU is definitely not expected, but it is happening in the background thread, so it might not be the cause of the slowdown. Do you have data on total CPU usage during fast and slow periods? Are all CPU cores at 100% during the slow period?

@SunguckLee
Copy link
Contributor Author

Thanks for your attention.

Sorry I have forgotten checking rocksdb LOG.
After filing this issue, I have checked rocksdb LOG and found these L0 compaction stall message.

2017/11/26-20:20:52.547477 7fa68402f700 [WARN] [db/column_family.cc:704] [default] Stalling writes because we have 20 level-0 files rate 7922813
2017/11/26-20:20:52.874001 7fa68282c700 [WARN] [db/column_family.cc:704] [default] Stalling writes because we have 20 level-0 files rate 9903516
2017/11/26-20:20:54.295086 7fa686033700 [WARN] [db/column_family.cc:704] [default] Stalling writes because we have 20 level-0 files rate 12379395

Also evaluating AddLiveFiles method, but there's no buggy things. As @mdcallag mentioned, this is compaction issue. The problem is too frequent compactions periodically.
When insert performance drop, these stall message is logged in rocksdb LOG.

@mdcallag I am testing MongoRocks not MyRocks. And there's only a single insert thread. No query threads. And I have increased memtable size to 128MB and max-compaction thread to 20, then insert stalling period is shorter than before but still stalling happened.

@igorcanadi No, there's only single insert thread. So cpu usage is not so high. With default options(max-compaction thread=8, memtable size 64MB), cpu usage is only 11%.

This is sysbench collection's indexes and sample document.

shard-01:PRIMARY> db.sbtest1.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "sbtest.sbtest1"
	},
	{
		"v" : 1,
		"key" : {
			"k" : 1
		},
		"name" : "k_1",
		"ns" : "sbtest.sbtest1"
	}
]
shard-01:PRIMARY> db.sbtest1.findOne()
{
	"_id" : 1,
	"k" : 1,
	"c" : " 08566691263-11954212384-49965520504-76648806909-97649256466-70808089694-75700327046-71097693010-15164106634-83838565200",
	"pad" : " 66118511869-22681140063-09693510518-99105097351-49282964876"
}

I think indexes are not so many and document is smaller than normal use case, So I expected there's no compaction stall (Because there's L0 -> L0 compaction).

@mdcallag @igorcanadi One more question about MongoRocks default options.
MongoRocks default target_file_size_multiplier is 1 and target_file_size_base(and write_buffer_size) is 64MB. Are these default options recommended for normal use case ? I think these default options cause many SST files. I am worried many SST files makes RocksDB needs more frequent compaction.

@igorcanadi
Copy link
Collaborator

@mdcallag @igorcanadi One more question about MongoRocks default options.
MongoRocks default target_file_size_multiplier is 1 and target_file_size_base(and write_buffer_size) is 64MB. Are these default options recommended for normal use case ? I think these default options cause many SST files. I am worried many SST files makes RocksDB needs more frequent compaction.

These settings will indeed cause large number of files for bigger databases, but in practice we have not seen many problems related to large number of files. The biggest benefit of smaller files is that it increases the likelihood of parallel compactions. The index and filter blocks are also smaller, which is generally a good thing. However, feel free to increase the file size, especially if your database is on the bigger side (500GB-1TB).

@SunguckLee
Copy link
Contributor Author

@mdcallag @igorcanadi

You must have thought about this..
Is there any side effects or problems moving L0->L0 compaction to Priority::HIGH thread pool or dedicated thread ?

Of course, this could not be a best solution.
Anyway L0->L0 compaction is not dependent on other inter-level compactions.
So if we move L0->L0 compaction as same priority of Flushing memtable, then can we avoid level0_slowdown_writes_trigger threshold ?

@igorcanadi
Copy link
Collaborator

A dedicated thread for L0->L0 compactions (or even L0->L1) sounds like a good idea. Would be great to construct the experiment to see if it would speed up the ingest.

@SunguckLee
Copy link
Contributor Author

Hi @igorcanadi @mdcallag

After examining this insert throughput drop, I have found some possible cause.

MongoRocks calls manual compaction periodically, And this prevent parallel compaction.
This time, DBImpl::GetBGJobLimits will return max_compactions==1 because parallelize_compactions parameter is false.
(https://github.com/facebook/rocksdb/blob/v5.9.2/db/db_impl_compaction_flush.cc#L1101-L1104)

But looks like all automatic compaction is blocked during manual compaction is running.
So L0-->L0 or L0-->L1 compactions are blocked, eventually the number of L0 files are touch level0_slowdown_writes_trigger threshold.

I am not sure this scenario (I don't know much about rocksdb and mongorocks code).

Here's what I am curious about are

  1. RocksDB has own automatic compaction, then why mongorocks run manual compaction periodically ? (I have already post some question about this : MongoDB Insert slow down after drop big collection. mongodb-partners/mongo-rocks#130)
  2. Why manual compaction and automatic level compaction can't run concurrently ?
    RocksDB has conflict check routine for each compaction already, so I think there's no issue when manual and automatic level compaction run together.

@igorcanadi
Copy link
Collaborator

@SunguckLee Great findings!

  1. MongoRocks runs compaction periodically to clean up dropped data. RocksDB exposes a concept of manual compaction to give application more freedom in choosing which data they want to compact.
  2. This is an interesting question. MongoRocks does set exclusive_manual_compaction = false, so it should run in parallel with other compactions: https://github.com/mongodb-partners/mongo-rocks/blob/master/src/rocks_compaction_scheduler.cpp#L269. @ajkr is that true?

@SunguckLee
Copy link
Contributor Author

Thanks @igorcanadi

Then what happen if I skip manual compaction(rangeCompact) from mongo-rocks.
Then removed key range(dropped collection) is survived forever in rocksdb datafiles ?

@igorcanadi
Copy link
Collaborator

Then removed key range(dropped collection) is survived forever in rocksdb datafiles ?

No, it will just be cleaned up later, when compaction decides to compact that range.

@SunguckLee
Copy link
Contributor Author

SunguckLee commented Jan 12, 2018

@igorcanadi

I was wrong, exclusive compaction is not the cause of this performance drop.

Looks like auto-scheduled compaction can't be run when manual compaction(not in-progress and not done status) is in the queue. Then auto-scheduled compaction can't doing job and put off later.
(https://github.com/facebook/rocksdb/blob/v5.9.2/db/db_impl_compaction_flush.cc#L1506-L1513)

But BackgroundCallCompaction() schedule new compaction right after return of BckgroundCompaction().
(https://github.com/facebook/rocksdb/blob/v5.9.2/db/db_impl_compaction_flush.cc#L1414)

This loop is so frequent and takes DBImpl::mutex_ lock. Because of this loop, DBImpl::mutex_ is getting so hot and eventually user insert documents are delayed.
The worse thing is this loop(MaybeScheduleFlushOrCompaction --> BackgroundCallCompaction --> BackgroundCompaction) is lasting 5~6 minutes (sometimes)
and during this period loop is executed 28K/sec times repeatedly.

About this code block(https://github.com/facebook/rocksdb/blob/v5.9.2/db/db_impl_compaction_flush.cc#L1506-L1513),

  1. Is this check (HaveManualCompaction) mandatory or just preventing conflict(and this conflict is not so critical) ?
  2. Why auto-scheduled compaction can't be run during manual compaction is in the queue (not in-progress and not done status) ?
    I think rocksdb check conflict during picking compaction inputs.

What do you think if we remove check HaveManualCompaction() code block
or changing HaveManualCompaction() to HasExclusiveManualCompaction() ?

@ajkr
Copy link
Contributor

ajkr commented Jan 12, 2018

Agreed, I don't know why that check is there as we mark the files as "being compacted" when manual is picked so subsequent automatic compactions can't step on its toes. Want to try changing it to check for exclusive manual compaction and see if anything breaks?

@igorcanadi
Copy link
Collaborator

Cool, looks like this might fix it!

@ajkr seems like the loop causing the mutex to go hot would still be a problem when running exclusive manual compaction. Is that something worth fixing, too?

@ajkr
Copy link
Contributor

ajkr commented Jan 12, 2018

That's true. Maybe we can keep a list of column families that reached the front of the compaction queue already but were rejected due to exclusive manual. Scheduling will first check whether any of these CFs have finished their exclusive manual before trying the regular compaction queue.

@ajkr
Copy link
Contributor

ajkr commented Jan 12, 2018

It's pretty tricky though. Retiring exclusive manual compaction option would be my preference.

@SunguckLee
Copy link
Contributor Author

SunguckLee commented Jan 13, 2018

@ajkr @igorcanadi

That is not exclusive manual compaction.
Looping is caused by this check (https://github.com/facebook/rocksdb/blob/v5.9.2/db/db_impl_compaction_flush.cc#L1849)

And I have changed code (https://github.com/facebook/rocksdb/blob/v5.9.2/db/db_impl_compaction_flush.cc#L1506)

if (HaveManualCompaction(compaction_queue_.front())) {

==>

if (HaveExclusiveManualCompaction()) {

After this change now slow down and insert performance is getting really stable.

I am not sure this change is good (no harm to data consistency or anything else).
What do you think this change ?

@igorcanadi
Copy link
Collaborator

@SunguckLee I think @ajkr said above that this fix should be okay. Would you mind sending a PR?

That's true. Maybe we can keep a list of column families that reached the front of the compaction queue already but were rejected due to exclusive manual. Scheduling will first check whether any of these CFs have finished their exclusive manual before trying the regular compaction queue.

Yeah, or remove them from the compaction queue and have it be manual compaction's responsibility to put them back to the queue once the manual compaction is done. It should be small code change, but will probably need to be built carefully.

@ajkr
Copy link
Contributor

ajkr commented Jan 15, 2018

@igorcanadi My concern for that approach was a fairness issue where CFs that reach the front of the queue during exclusive manual compaction are penalized. Those CFs would have to wait in the queue again after the manual compaction finished, whereas CFs that hadn't yet reached the front would get no additional wait.

Thinking about it more, a similar fairness issue already existed since we put CFs in the back of the queue when they're popped during exclusive manual, which meant reaching the front earlier could result in longer wait. So I am ok with the solution you suggested. We could also address the fairness issue by inserting to the front of the compaction queue at the end of exclusive manual if we know the CF has reached the front already but been rejected due to conflict.

SunguckLee added a commit to SunguckLee/rocksdb that referenced this issue Jan 16, 2018
If there's manual compaction in the queue, then "HaveManualCompaction(compaction_queue_.front())" will return true, and this cause too frequent MaybeScheduleFlushOrCompaction().

facebook#3198
@SunguckLee
Copy link
Contributor Author

I have made pull-request simple patch for DBImpl::mutex_ contention.

#3366

facebook-github-bot pushed a commit that referenced this issue Jan 16, 2018
Summary:
If there's manual compaction in the queue, then "HaveManualCompaction(compaction_queue_.front())" will return true, and this cause too frequent MaybeScheduleFlushOrCompaction().

#3198
Closes #3366

Differential Revision: D6729575

Pulled By: ajkr

fbshipit-source-id: 96da04f8fd33297b1ccaec3badd9090403da29b0
@igorcanadi
Copy link
Collaborator

Hey @ajkr , there were a lot of commits and reverts flying around for this issue, so I'm a bit confused. :) Is this finally resolved?

amytai pushed a commit to amytai/rocksdb that referenced this issue Mar 9, 2018
Summary:
If there's manual compaction in the queue, then "HaveManualCompaction(compaction_queue_.front())" will return true, and this cause too frequent MaybeScheduleFlushOrCompaction().

facebook#3198
Closes facebook#3366

Differential Revision: D6729575

Pulled By: ajkr

fbshipit-source-id: 96da04f8fd33297b1ccaec3badd9090403da29b0
@elirevach12
Copy link

Hi ,
Which version of rocksdb includes this fix?

@ajkr
Copy link
Contributor

ajkr commented May 30, 2018

EDIT: delete this comment as it was wrong

@ajkr
Copy link
Contributor

ajkr commented May 30, 2018

It's fixed in 5.12 and later.

@elirevach12
Copy link

@ajkr , the most stable higher version I was able to use is 5.9.3, all versions above this version including 5.12 failed to normally sync with their primary and fail with " [replExecDBWorker-0] Failed to load timestamp of most recently applied operation: NoMatchingDocument: Did not find any entries in local.oplog.rs" error

By any chance, can you create a new branch such as 5.9.4 to include this change?

@ajkr
Copy link
Contributor

ajkr commented Jul 12, 2018

I'd prefer to fix the issue preventing upgrade to 5.12 instead of backporting non-critical fixes to arbitrary old versions. Want to report the issue you see with upgrading?

@ajkr ajkr closed this as completed Jul 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants