Wip writeback throttling for cache tiering #4792

dragonylffly · 2015-05-28T13:32:25Z

This patch is to do write back throttling for cache tiering, which is similar to what the Linux kernel does for page cache write back. The motivation and original idea are proposed by Nick Fisk, detailed in his email as below. In our implementation, we introduce a paramter 'cache_target_dirty_high_ratio' (default 0.6) as the high speed threshold, while leave the
'cache_target_dirty_ratio' (default 0.4) to represent the low speed threshold, we control the flush speed by limiting the parallelism of flushing. The maximum parallelism under low speed is half of the parallelism under high speed. If there is at least one PG such that the dirty ratio beyond the high threshold, full speed mode is entered; If there is no PG such that dirty ratio beyond the low threshold, idle mode is entered; In other cases, slow speed mode is entered.

-------- Original Message --------
Subject: Ceph Tiering Idea
Date: Fri, 22 May 2015 16:07:46 +0100
From: Nick Fisk nick@fisk.me.uk
To: liwang@ubuntukylin.com

Hi,

I’ve just seen your post to the Ceph Dev Mailing list regarding adding
temperature based eviction to the cache eviction logic. I think this is
a much needed enhancement and can’t wait to test it out once it hits the
next release.

I have been testing Ceph Cache Tiering for a number of months now and
another enhancement which I think would greatly enhance the performance
would be high and low thresholds for flushing and eviction. I have tried
looking through the Ceph source, but with my limited programming skills
I was unable to make any progress and so thought I would share my idea
with you and get your thoughts.

Currently as soon as you exceed the flush/eviction threshold, Ceph
starts aggressively flushing to the base tier which impacts performance.
For long running write operations this is probably unavoidable, however
most workloads are normally quite bursty and my idea of having high and
low thresholds would hopefully improve performance where the writes come
in bursts.

When the cache tier approaches the low threshold, Ceph would start
flushing/evicting with a low priority, so performance is not affected.
If the high threshold is reached, Ceph will flush more aggressively,
similar to the current behaviour. Hopefully during the quiet periods
in-between bursts of writes, the cache would slowly be reduced down to
the low threshold meaning it is ready for the next burst.

For example:-

1TB Cache Tier

Low Dirty=0.4

High Dirty=0.6

Cache tier would contain 400GB of dirty data at idle, as dirty data
rises above 400GB, Ceph would flush with a low priority or throttled
MB/s rate.

If Cache tier raises above 600GB, Ceph will aggressively flush to keep
dirty data below 60%

The above should give you 200GB capacity of bursty writes before
performance becomes impacted

Does this make sense?

Many Thanks,

Nick

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

loic-bot · 2015-05-28T15:12:57Z

FAILURE: http://jenkins.ceph.dachary.org/job/ceph/5494/

FAILURE http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/5494/

liewegas · 2015-05-28T16:45:34Z

src/osd/OSD.h

+    Mutex::Locker l(agent_lock);
+    flush_mode_high_count --;
+  }
+


I think this means that if any PG in the cluster needs more flushing then all flushing will go faster, as opposed to just the PGs in the most-full pool. That's the best we can do currently, but we might also consider making the agent queue a priority queue? Hrm.

liewegas · 2015-05-28T16:49:10Z

I kind of wish we could make this a smooth function instead of a step between high and low. There is the effort calculations, for example. But.. this has to happen across the whole OSD, so those effort values don't work very well. So this is probably the best we can do with the current infrastructure.

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

…ands Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

loic-bot · 2015-06-02T08:28:54Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/5596/

dragonylffly · 2015-06-02T12:37:27Z

Revised according to the comments, please review

loic-bot · 2015-06-02T14:44:09Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/5603/

loic-bot · 2015-06-02T18:03:10Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/5608/

loic-bot · 2015-06-02T19:09:31Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/5610/

loic-bot · 2015-06-03T01:32:37Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/5620/

dragonylffly · 2015-06-03T02:41:06Z

I think this should be
uint64_t flush_high_target = MAX(pool.info.cache_target_dirty_ratio_micro, pool.info.cache_target_dirty_high_ratio_micro);
to handle when the high value is 0 (for upgraded clusters!).

We handle it at the decode process, for the existing pools, the high ratio is initialized to ratio

liewegas · 2015-06-03T04:19:34Z

Oh, right. There is still the case where the admin configures a value that is smaller. I think it's worth covering that one.

With that and the config option name change I'm happy with it. Thanks!

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

loic-bot · 2015-06-03T08:44:13Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/5626/

dragonylffly · 2015-06-04T02:18:49Z

Oh, right. There is still the case where the admin configures a value that is smaller. I think it's worth covering that one.

We have considered this before, however, we did not do it because of the following three reasons, (1) after a check of the other parameters, we found many of them did not do validity check either, for example, cache_target_full_ratio could be set to be lower than cache_target_dirty_ratio (2) It is the administrator's responsibility to understand the semantic and give a correct value (3) Even the high ratio is lower than ratio, it seems no problem, the flusher will do the job more aggressively according to the administrator's wishes. Nevertheless, we are toally happy to do the validity check, it is up to your decision :), and we could submit other patches to do other missing validity check as well.

liewegas · 2015-06-04T13:51:39Z

Fair enough, looks good to me!

XinzeChi · 2015-06-15T06:06:48Z

@dragonylffly , If in the slow flush mode, we could consider system load, Does it make sense?
Such as if system load is high but in flush mode, We stop flush for a while?

LiumxNL · 2015-06-15T09:39:54Z

If in the slow flush mode, we could consider system load, Does it make sense?
Such as if system load is high but in flush mode, We stop flush for a while?
I don't think it's easy to consider system load and separate this pool`s load with others.secondly when should we restart flush,how to decide,i think it will be complicated.it's hard to say it will perform better if we consider it?@XinzeChi

XinzeChi · 2015-06-16T02:01:34Z

@LiumxNL , @dragonylffly , What about restrict the flush time in idle time defined by user, such as between 22:00 - 07:00. It would be more simpler.
In busy time, such as 08:00 - 21:00, if not reach flush high mode, pg would not flush any object.
This could be a new feature? user could choose to turn on or off?

markhpc · 2015-06-17T12:50:41Z

Usually I don't like increasing the number of options that the user has to tweak, but In this case I think the user specified idle times seems pretty straightforward if an automatic mechanism can't be made.

dragonylffly · 2015-06-18T02:07:50Z

@XinzeChi @markhpc thanks for the suggestions. I personally also think it is a good idea to add an option to give user the opportunity to specify no-flushing time, however, what if we can not wait until reaching the specified flushing time, then the avantages of writeback throttling are lost. In addition, a little concern about that will it be too long to keep the dirty objects in cache, although the data are persisted... @liewegas what is your option?

dragonylffly · 2015-06-18T02:19:58Z

Yes, I think maybe current implementation suffices, for user specified busy time, it won't do any flush until reaching the low threshold, however, if we do reach the low threshold, we must start to flush

fiskn · 2015-06-18T17:04:30Z

I'm not sure adding "no flush" times would be particularly useful. The whole idea of low speed flushing was to try and make sure the cache has some headroom for the next burst of writes. Currently you have to promote to do a write, for latency reasons you don't also want to be trying to evict an old object for every incoming write IO as well. During busy times you will definitely want to do be doing low speed flushing, otherwise you will soon find yourself bouncing around the high watermark. I would hope that the low speed flushing should have a minimal impact on performance anyway.

tchaikov · 2015-06-19T01:30:28Z

tested

http://pulpito.ceph.com/ubuntu-2015-06-15_22:35:49-rados-wip-kefu-testing---basic-multi
http://pulpito.ceph.com/ubuntu-2015-06-16_22:31:53-rados-wip-kefu-testing---basic-multi
http://pulpito.ceph.com/ubuntu-2015-06-17_06:59:09-rados-wip-kefu-testing---basic-multi
http://pulpito.ceph.com/ubuntu-2015-06-18_08:24:05-rados-wip-kefu-testing---basic-multi

dragonylffly · 2015-06-23T07:05:29Z

@tchaikov thanks for testing

…or-cache-tiering Wip writeback throttling for cache tiering This patch is to do write back throttling for cache tiering, which is similar to what the Linux kernel does for page cache write back. A paramter 'cache_target_dirty_high_ratio' (default 0.6) is introduced as the high speed flushing threshold, while leave the 'cache_target_dirty_ratio' (default 0.4) to represent the low speed threshold. The flush speed is controlled by limiting the parallelism of flushing. The maximum parallelism under low speed is half of the parallelism under high speed. If there is at least one PG such that the dirty ratio beyond the high threshold, full speed mode is entered; If there is no PG such that dirty ratio beyond the low threshold, idle mode is entered; In other cases, slow speed mode is entered. Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk> Tested-by: Kefu Chai <kchai@redhat.com>

VinceOnGit · 2015-06-29T09:50:11Z

Why do we have to wait for cache_target_dirty_ratio' (default 0.4) to start a low speed flush ? If i do some high write till 0.39, then nothing during the night where my config is sleeping, i will start a new day or even a new month at 0.39. That is to say that the next time i will need to perform high write, i'll be penalised bye flush operation ... Is there a way to do flushing at a very slow rate as soon as cache_target_dirty_ratio > 0 and raise the rate at 0.4 then at 0.6

fiskn · 2015-06-29T10:01:55Z

Because during busy times you might end up writing to the same blocks over and over again. In which case you don't want to keep flushing them to disk you want to keep them in cache. Now there are two thresholds you have the power to adjust them to best suit your workload.

If you know you will have very bursty behaviour, keep them fairly close together and hopefully the cache full percentage should oscillate between them, whilst doing its best to keep the hot blocks in cache.

If you know that your workload will have long sustained periods of writes which you know will result in cache misses, then probably setting the low threshold to 0.1 or 0.2 and the high threshold to 0.8 will make sure the cache has plenty of space for the writes without the risk of high intensity flushing taking place.

If you really want to make sure the cache is clean or empty prior to the days work starting, then probably manipulating the thresholds with a nightly cron job or job scheduler is the best bet.

Osd: classify flush mode into low speed and high speed modes

735abea

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

tchaikov added feature core labels May 28, 2015

liewegas reviewed May 28, 2015
View reviewed changes

Mingxin Liu added 3 commits June 2, 2015 09:59

Osd: add new field in pg_pool_t

4c5f755

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

Mon: add cache_target_dirty_high_ratio related configuration and comm…

1fb26e3

…ands Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

Osd: revise agent_choose_mode() to track the flush mode

8f6056a

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

Mingxin Liu added 2 commits June 3, 2015 15:56

Osd: implement low speed flush

fa000d0

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

Doc: add write back throttling stuff in document and test scripts

7aba947

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com> Reviewed-by: Li Wang <liwang@ubuntukylin.com> Suggested-by: Nick Fisk <nick@fisk.me.uk>

liewegas assigned liewegas and athanatos and unassigned liewegas Jun 4, 2015

liewegas added the needs-qa label Jun 4, 2015

tchaikov added the wip-kefu-testing label Jun 5, 2015

dragonylffly merged commit c1bd02c into ceph:master Jun 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wip writeback throttling for cache tiering #4792

Wip writeback throttling for cache tiering #4792

dragonylffly commented May 28, 2015

loic-bot commented May 28, 2015

liewegas May 28, 2015

liewegas commented May 28, 2015

loic-bot commented Jun 2, 2015

dragonylffly commented Jun 2, 2015

loic-bot commented Jun 2, 2015

loic-bot commented Jun 2, 2015

loic-bot commented Jun 2, 2015

loic-bot commented Jun 3, 2015

dragonylffly commented Jun 3, 2015

liewegas commented Jun 3, 2015

loic-bot commented Jun 3, 2015

dragonylffly commented Jun 4, 2015

liewegas commented Jun 4, 2015

XinzeChi commented Jun 15, 2015

LiumxNL commented Jun 15, 2015

XinzeChi commented Jun 16, 2015

markhpc commented Jun 17, 2015

dragonylffly commented Jun 18, 2015

dragonylffly commented Jun 18, 2015

fiskn commented Jun 18, 2015

tchaikov commented Jun 19, 2015

dragonylffly commented Jun 23, 2015

VinceOnGit commented Jun 29, 2015

fiskn commented Jun 29, 2015

Wip writeback throttling for cache tiering #4792

Wip writeback throttling for cache tiering #4792

Conversation

dragonylffly commented May 28, 2015

loic-bot commented May 28, 2015

liewegas May 28, 2015

Choose a reason for hiding this comment

liewegas commented May 28, 2015

loic-bot commented Jun 2, 2015

dragonylffly commented Jun 2, 2015

loic-bot commented Jun 2, 2015

loic-bot commented Jun 2, 2015

loic-bot commented Jun 2, 2015

loic-bot commented Jun 3, 2015

dragonylffly commented Jun 3, 2015

liewegas commented Jun 3, 2015

loic-bot commented Jun 3, 2015

dragonylffly commented Jun 4, 2015

liewegas commented Jun 4, 2015

XinzeChi commented Jun 15, 2015

LiumxNL commented Jun 15, 2015

XinzeChi commented Jun 16, 2015

markhpc commented Jun 17, 2015

dragonylffly commented Jun 18, 2015

dragonylffly commented Jun 18, 2015

fiskn commented Jun 18, 2015

tchaikov commented Jun 19, 2015

dragonylffly commented Jun 23, 2015

VinceOnGit commented Jun 29, 2015

fiskn commented Jun 29, 2015