Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wip writeback throttling for cache tiering #4792

Merged
merged 6 commits into from Jun 23, 2015
Merged

Wip writeback throttling for cache tiering #4792

merged 6 commits into from Jun 23, 2015

Conversation

dragonylffly
Copy link
Contributor

This patch is to do write back throttling for cache tiering, which is similar to what the Linux kernel does for page cache write back. The motivation and original idea are proposed by Nick Fisk, detailed in his email as below. In our implementation, we introduce a paramter 'cache_target_dirty_high_ratio' (default 0.6) as the high speed threshold, while leave the
'cache_target_dirty_ratio' (default 0.4) to represent the low speed threshold, we control the flush speed by limiting the parallelism of flushing. The maximum parallelism under low speed is half of the parallelism under high speed. If there is at least one PG such that the dirty ratio beyond the high threshold, full speed mode is entered; If there is no PG such that dirty ratio beyond the low threshold, idle mode is entered; In other cases, slow speed mode is entered.

-------- Original Message --------
Subject: Ceph Tiering Idea
Date: Fri, 22 May 2015 16:07:46 +0100
From: Nick Fisk nick@fisk.me.uk
To: liwang@ubuntukylin.com

Hi,

I’ve just seen your post to the Ceph Dev Mailing list regarding adding
temperature based eviction to the cache eviction logic. I think this is
a much needed enhancement and can’t wait to test it out once it hits the
next release.

I have been testing Ceph Cache Tiering for a number of months now and
another enhancement which I think would greatly enhance the performance
would be high and low thresholds for flushing and eviction. I have tried
looking through the Ceph source, but with my limited programming skills
I was unable to make any progress and so thought I would share my idea
with you and get your thoughts.

Currently as soon as you exceed the flush/eviction threshold, Ceph
starts aggressively flushing to the base tier which impacts performance.
For long running write operations this is probably unavoidable, however
most workloads are normally quite bursty and my idea of having high and
low thresholds would hopefully improve performance where the writes come
in bursts.

When the cache tier approaches the low threshold, Ceph would start
flushing/evicting with a low priority, so performance is not affected.
If the high threshold is reached, Ceph will flush more aggressively,
similar to the current behaviour. Hopefully during the quiet periods
in-between bursts of writes, the cache would slowly be reduced down to
the low threshold meaning it is ready for the next burst.

For example:-

1TB Cache Tier

Low Dirty=0.4

High Dirty=0.6

Cache tier would contain 400GB of dirty data at idle, as dirty data
rises above 400GB, Ceph would flush with a low priority or throttled
MB/s rate.

If Cache tier raises above 600GB, Ceph will aggressively flush to keep
dirty data below 60%

The above should give you 200GB capacity of bursty writes before
performance becomes impacted

Does this make sense?

Many Thanks,

Nick

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com>
Reviewed-by: Li Wang <liwang@ubuntukylin.com>
Suggested-by: Nick Fisk <nick@fisk.me.uk>
Mutex::Locker l(agent_lock);
flush_mode_high_count --;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means that if any PG in the cluster needs more flushing then all flushing will go faster, as opposed to just the PGs in the most-full pool. That's the best we can do currently, but we might also consider making the agent queue a priority queue? Hrm.

@liewegas
Copy link
Member

I kind of wish we could make this a smooth function instead of a step between high and low. There is the effort calculations, for example. But.. this has to happen across the whole OSD, so those effort values don't work very well. So this is probably the best we can do with the current infrastructure.

Mingxin Liu added 3 commits June 2, 2015 09:59
Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com>
Reviewed-by: Li Wang <liwang@ubuntukylin.com>
Suggested-by: Nick Fisk <nick@fisk.me.uk>
…ands

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com>
Reviewed-by: Li Wang <liwang@ubuntukylin.com>
Suggested-by: Nick Fisk <nick@fisk.me.uk>
Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com>
Reviewed-by: Li Wang <liwang@ubuntukylin.com>
Suggested-by: Nick Fisk <nick@fisk.me.uk>
@dragonylffly
Copy link
Contributor Author

Revised according to the comments, please review

@dragonylffly
Copy link
Contributor Author

I think this should be

uint64_t flush_high_target = MAX(pool.info.cache_target_dirty_ratio_micro, pool.info.cache_target_dirty_high_ratio_micro);

to handle when the high value is 0 (for upgraded clusters!).

We handle it at the decode process, for the existing pools, the high ratio is initialized to ratio

@liewegas
Copy link
Member

liewegas commented Jun 3, 2015

Oh, right. There is still the case where the admin configures a value that is smaller. I think it's worth covering that one.

With that and the config option name change I'm happy with it. Thanks!

Mingxin Liu added 2 commits June 3, 2015 15:56
Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com>
Reviewed-by: Li Wang <liwang@ubuntukylin.com>
Suggested-by: Nick Fisk <nick@fisk.me.uk>
Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com>
Reviewed-by: Li Wang <liwang@ubuntukylin.com>
Suggested-by: Nick Fisk <nick@fisk.me.uk>
@dragonylffly
Copy link
Contributor Author

Oh, right. There is still the case where the admin configures a value that is smaller. I think it's worth covering that one.

We have considered this before, however, we did not do it because of the following three reasons, (1) after a check of the other parameters, we found many of them did not do validity check either, for example, cache_target_full_ratio could be set to be lower than cache_target_dirty_ratio (2) It is the administrator's responsibility to understand the semantic and give a correct value (3) Even the high ratio is lower than ratio, it seems no problem, the flusher will do the job more aggressively according to the administrator's wishes. Nevertheless, we are toally happy to do the validity check, it is up to your decision :), and we could submit other patches to do other missing validity check as well.

@liewegas liewegas assigned liewegas and athanatos and unassigned liewegas Jun 4, 2015
@liewegas
Copy link
Member

liewegas commented Jun 4, 2015

Fair enough, looks good to me!

@XinzeChi
Copy link
Contributor

@dragonylffly , If in the slow flush mode, we could consider system load, Does it make sense?
Such as if system load is high but in flush mode, We stop flush for a while?

@LiumxNL
Copy link
Contributor

LiumxNL commented Jun 15, 2015

If in the slow flush mode, we could consider system load, Does it make sense?
Such as if system load is high but in flush mode, We stop flush for a while?
I don't think it's easy to consider system load and separate this pool`s load with others.secondly when should we restart flush,how to decide,i think it will be complicated.it's hard to say it will perform better if we consider it?@XinzeChi

@XinzeChi
Copy link
Contributor

@LiumxNL , @dragonylffly , What about restrict the flush time in idle time defined by user, such as between 22:00 - 07:00. It would be more simpler.
In busy time, such as 08:00 - 21:00, if not reach flush high mode, pg would not flush any object.
This could be a new feature? user could choose to turn on or off?

@markhpc
Copy link
Member

markhpc commented Jun 17, 2015

Usually I don't like increasing the number of options that the user has to tweak, but In this case I think the user specified idle times seems pretty straightforward if an automatic mechanism can't be made.

@dragonylffly
Copy link
Contributor Author

@XinzeChi @markhpc thanks for the suggestions. I personally also think it is a good idea to add an option to give user the opportunity to specify no-flushing time, however, what if we can not wait until reaching the specified flushing time, then the avantages of writeback throttling are lost. In addition, a little concern about that will it be too long to keep the dirty objects in cache, although the data are persisted... @liewegas what is your option?

@dragonylffly
Copy link
Contributor Author

Yes, I think maybe current implementation suffices, for user specified busy time, it won't do any flush until reaching the low threshold, however, if we do reach the low threshold, we must start to flush

@fiskn
Copy link
Contributor

fiskn commented Jun 18, 2015

I'm not sure adding "no flush" times would be particularly useful. The whole idea of low speed flushing was to try and make sure the cache has some headroom for the next burst of writes. Currently you have to promote to do a write, for latency reasons you don't also want to be trying to evict an old object for every incoming write IO as well. During busy times you will definitely want to do be doing low speed flushing, otherwise you will soon find yourself bouncing around the high watermark. I would hope that the low speed flushing should have a minimal impact on performance anyway.

@dragonylffly
Copy link
Contributor Author

@tchaikov thanks for testing

dragonylffly added a commit that referenced this pull request Jun 23, 2015
…or-cache-tiering

Wip writeback throttling for cache tiering

This patch is to do write back throttling for cache tiering, which is similar to what the Linux kernel does for  page cache write back.  A paramter 'cache_target_dirty_high_ratio'  (default 0.6) is introduced as the high speed flushing threshold, while leave the 'cache_target_dirty_ratio' (default 0.4) to represent the low speed  threshold. The flush speed is controlled by limiting the parallelism of flushing. The maximum parallelism under low speed is half of the parallelism under high speed. If there is at least one PG such that the dirty ratio beyond the high threshold, full speed mode is entered; If there is no PG such that dirty ratio beyond the low threshold, idle mode is entered; In other cases, slow speed mode is entered.

Signed-off-by: Mingxin Liu <mingxinliu@ubuntukylin.com>
Reviewed-by: Li Wang <liwang@ubuntukylin.com>
Suggested-by: Nick Fisk <nick@fisk.me.uk>
Tested-by: Kefu Chai <kchai@redhat.com>
@dragonylffly dragonylffly merged commit c1bd02c into ceph:master Jun 23, 2015
@VinceOnGit
Copy link

Why do we have to wait for cache_target_dirty_ratio' (default 0.4) to start a low speed flush ? If i do some high write till 0.39, then nothing during the night where my config is sleeping, i will start a new day or even a new month at 0.39. That is to say that the next time i will need to perform high write, i'll be penalised bye flush operation ... Is there a way to do flushing at a very slow rate as soon as cache_target_dirty_ratio > 0 and raise the rate at 0.4 then at 0.6

@fiskn
Copy link
Contributor

fiskn commented Jun 29, 2015

Because during busy times you might end up writing to the same blocks over and over again. In which case you don't want to keep flushing them to disk you want to keep them in cache. Now there are two thresholds you have the power to adjust them to best suit your workload.

If you know you will have very bursty behaviour, keep them fairly close together and hopefully the cache full percentage should oscillate between them, whilst doing its best to keep the hot blocks in cache.

If you know that your workload will have long sustained periods of writes which you know will result in cache misses, then probably setting the low threshold to 0.1 or 0.2 and the high threshold to 0.8 will make sure the cache has plenty of space for the writes without the risk of high intensity flushing taking place.

If you really want to make sure the cache is clean or empty prior to the days work starting, then probably manipulating the thresholds with a nightly cron job or job scheduler is the best bet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants