Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: implement image qos in tokenbucket algorithm. #17032

Merged
merged 2 commits into from Nov 15, 2017

Conversation

Projects
None yet
5 participants
@yangdongsheng
Copy link
Member

commented Aug 15, 2017

Same with #16865. But recreated on correct branch of my repo.

@dillaman please help to review. thanx

@@ -12,7 +12,9 @@
#include <map>
#include <mutex>

#include "Cond.h"

This comment has been minimized.

Copy link
@amitkumar50

amitkumar50 Aug 15, 2017

Contributor

Comment: To maintain consistency throughout the code
Can you kindly explain this condition...
Why header file name starts from Capital.
I believe in ceph, headers starts from lower cases.

This comment has been minimized.

Copy link
@dillaman

dillaman Aug 15, 2017

Contributor

@amitkumar50 That's an existing file -- there are several headers that start w/ upper case (including nearly 100% of the librbd headers to match the class name contained within it).

This comment has been minimized.

Copy link
@amitkumar50

amitkumar50 Aug 15, 2017

Contributor

@dillaman Thanks for clarification.. All good from my end

@@ -330,4 +332,71 @@ class OrderedThrottle {
uint32_t waiters = 0;
};


class Bucket {

This comment has been minimized.

Copy link
@dillaman

dillaman Aug 16, 2017

Contributor

Why is this a standalone class? Can it just be folded into TokenBucketThrottle instead?

This comment has been minimized.

Copy link
@yangdongsheng

yangdongsheng Aug 16, 2017

Author Member

Yes, we can put it in TokenBucketThrottle. The reason I put it outside is that I was thinking is there a possibility we need to implement another LeakyBucketThrottle which will use the Bucket as well? But that's just a possibility in the future, and we can move it outside on that day. I will put Bucket into TokenBucketThrottle now.

void cancel_tokens();

public:
int get(int64_t c);

This comment has been minimized.

Copy link
@dillaman

dillaman Aug 16, 2017

Contributor

The IO path needs to be non-blocking, so this would need to support Context callback. Since it would be wasteful to construct a Context for the allowed case, you could provide a templated get method that had a parameters for the callback class and function. If a throttle is required, it could create a Context to invoke the provided class/function when tokens are available and return false; otherwise it returns true to indicate that throttling isn't required.

This comment has been minimized.

Copy link
@yangdongsheng

yangdongsheng Aug 16, 2017

Author Member

Thanx for your detailed suggestion. will try an async-way for this.

public:
int get(int64_t c);
void set_max(int64_t m);
void set_avg(int64_t avg);

This comment has been minimized.

Copy link
@dillaman

dillaman Aug 16, 2017

Contributor

Nit: set_average

@@ -5226,6 +5226,14 @@ static std::vector<Option> get_rbd_options() {
Option("rbd_journal_max_concurrent_object_sets", Option::TYPE_INT, Option::LEVEL_ADVANCED)
.set_default(0)
.set_description("maximum number of object sets a journal client can be behind before it is automatically unregistered"),

Option("rbd_image_qos_iops_read_limit", Option::TYPE_UINT, Option::LEVEL_ADVANCED)

This comment has been minimized.

Copy link
@dillaman

dillaman Aug 16, 2017

Contributor

Nit: _image_ is unnecessary since all rbd_ options are for an image. Might also be good to name these rbd_read_iops_max and rbd_write_iops_max.

@@ -88,6 +89,9 @@ namespace librbd {
ImageWatcher<ImageCtx> *image_watcher;
Journal<ImageCtx> *journal;

TokenBucketThrottle *read_iops_throttle;

This comment has been minimized.

Copy link
@dillaman

dillaman Aug 16, 2017

Contributor

Nit: these should be within librbd::io::ImageRequestWQ

@@ -179,6 +183,8 @@ namespace librbd {
uint32_t readahead_trigger_requests;
uint64_t readahead_max_bytes;
uint64_t readahead_disable_after_bytes;
uint64_t write_iops_limit;

This comment has been minimized.

Copy link
@dillaman

dillaman Aug 16, 2017

Contributor

Nit: these are unused

@@ -597,6 +597,12 @@ void *ImageRequestWQ<I>::_void_dequeue() {
return nullptr;
}

if (write_op) {
m_image_ctx.write_iops_throttle->get(1);

This comment has been minimized.

Copy link
@dillaman

dillaman Aug 16, 2017

Contributor

This will block the thread pool's worker thread. These need to be asynchronous callbacks when throttling is required. Also, if I set me write qos to X and my read to Y (where Y > X), all my reads can be throttled if I go past my write limit. Perhaps just start w/ a single rbd_iops_max configuration parameter that throttles both -- or provide separate read and write queues for throttled IOs.

This comment has been minimized.

Copy link
@yangdongsheng

yangdongsheng Aug 16, 2017

Author Member

If I make this async-ed, I believe the problem you mentioned will be gone, right?

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2017

Hi @dillaman I found I can't catch up a good idea to make it asynchronoused. If I do the work like #15799 to return nullptr when the WQ->front need to be throttled. Yes, this seems non-blocking, but the worker will check the same item again in the next _void_dequeue() and return nullptr again. Then the read io behind the WQ->front will be blocked as well, although we did not set the read iops qos.

So I think the solution to solve the problem you mentioned at #17032 (comment) is seperating read and write queue. Then a write IO would be blocked when it can't get token, but no need to worry, the all io behind this write in the same queue are all writing IO. and they should be blocked as well. At the same time, the reading IO is being processed in another queue without any blocking.

So I incline start with only a rbd_iops_max as you suggested. And then provide separate queues for read and write later.

What's your suggestion?

@dillaman

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2017

@yangdongsheng That's what I was trying to say in my last comment. Either (for now) don't try to separate the read and write IOPS throttle so that it doesn't matter, or else you will need to add two "throttled requests" lists to ImageRequestWQ. If a write request is throttled, push the request to the throttled write requests lists, and if a read is throttled push it to the throttled read requests list. The callback from the throttle class can set a "do not throttle me" flag on the request, re-push the request at the top of the queue, and wake up the thread. Back in _void_dequeue, the request would bypass the throttle if the flag is set.

Alternatively, instead of the two lists, if the TokenBucketThrottle::get template method were to take a third template param, it could pass the request pointer and the request could get easily wrapped together w/ the callback to consolidate lists.

@yangdongsheng yangdongsheng force-pushed the yangdongsheng:rbd_qos branch 3 times, most recently from b2eba40 to e3aad72 Aug 17, 2017

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Aug 17, 2017

@dillaman what about this version as the start with only rbd_qos_iops_limit for both of read and write.

@dillaman

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2017

@yangdongsheng Fine w/ me

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Aug 30, 2017

Hi @dillaman I am preparing another pr for read or write qos. But this PR is for both read and write. Please help to review. thanx a lot.


bool TokenBucketThrottle::Bucket::_wait(int64_t c)
{
utime_t start;

This comment has been minimized.

Copy link
@amitkumar50

amitkumar50 Aug 30, 2017

Contributor

looks start is unused variable here.

This comment has been minimized.

Copy link
@yangdongsheng

yangdongsheng Aug 30, 2017

Author Member

@amitkumar50 yea, thanx

@dillaman

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2017

@yangdongsheng Looks like all my existing comments need to be addressed -- specifically with regard to async operation.

@yangdongsheng yangdongsheng force-pushed the yangdongsheng:rbd_qos branch 4 times, most recently from 36e644b to 85357b1 Sep 7, 2017

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Sep 7, 2017

@dillaman Hi Jason, what about this version? There is a list in TokenBucketThrottle which will hold the callback for each IO if need blocked.

@yangdongsheng yangdongsheng force-pushed the yangdongsheng:rbd_qos branch from 85357b1 to 83673ec Sep 7, 2017

@dillaman
Copy link
Contributor

left a comment

If bursting is to eventually be supported, shouldn't the tokens being added to the bucket represent an IO operation where the timer should be draining the bucket at the IOPS average and the max is set to the burst IOPS?

@@ -1464,7 +1466,8 @@ void Operations<I>::execute_metadata_remove(const std::string &key,
ldout(cct, 5) << this << " " << __func__ << ": key=" << key << dendl;

operation::MetadataRemoveRequest<I> *request =
new operation::MetadataRemoveRequest<I>(m_image_ctx, on_finish, key);
new operation::MetadataRemoveRequest<I>(m_image_ctx,
new C_NotifyUpdate<I>(m_image_ctx, on_finish), key);

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Nit: spacing -- align w/ previous line's parameter (if room) or bump all the parameters down and indent.

auto &expect = EXPECT_CALL(get_mock_io_ctx(mock_image_ctx.md_ctx),
exec(mock_image_ctx.header_oid, _, StrEq("rbd"), StrEq("metadata_list"), _, _, _));
if (r < 0) {
expect.WillOnce(Return(r));

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Add a test for the failure path

@@ -1,6 +1,8 @@
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
// vim: ts=8 sw=2 smarttab

#include <boost/bind.hpp>

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Nit: should be able to use std::bind

Bucket(CephContext *cct, const std::string& n, int64_t m = 0);
~Bucket();

public:

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Nit: duplicate

Mutex lock;
list<Cond*> cond;
int64_t get(int64_t c);
friend class TokenBucketThrottle;

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Nit: just make these structs and remove the need for friend class -- the C++11 standard also already provide access to nested classes.

};

class Blocker {
int64_t token_requested;

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Nit: tokens_requested -- why signed int throughout?

CephContext *cct;
const std::string name;
std::atomic<int64_t> remain = { 0 }, max = { 0 };
Mutex lock;

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Why does this require its own lock?

This comment has been minimized.

Copy link
@yangdongsheng

yangdongsheng Sep 11, 2017

Author Member

To protect the remain and max. Yes, the remain and max are both atomic but we need to make sure a function atomic. e.g, function put(c):

if (remain + c <= max)
    remain += c;

If the remain is 2 and max is 5. There are two threads A and B are doing put(2).

Thread A                                                                Thread B
if (remain + c <= max) <---- 2 + 2 < 5
                                                                         if (remain + c <= max) <---- 2 + 2 < 5
                                                                               remain += c     <-------remain = 4
   remain += c  <-------- remain = 6

Then we probably get a remain > max.

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 11, 2017

Contributor

... but since Bucket is only accessible via TokenBucketThrottle and it already has a lock, why not just have TokenBucketThrottle use its lock to protect its nested class?

This comment has been minimized.

Copy link
@yangdongsheng

yangdongsheng Sep 12, 2017

Author Member

do you mean TokenBucketThrottle::m_blockers_lock, this is to protect the blockers_list. but not to protect fields in bucket. for example, the set_max() is not protected by m_blockers_lock, will be in a race condition if we don't have a Bucket::lock.

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 12, 2017

Contributor

TokenBucketThrottle::set_max can easily take the lock before invoking the method in TokenBucketThrottle::Bucket::set_max -- and TokenBucketThrottle::m_blockers_lock can be renamed TokenBucketThrottle::m_lock

This comment has been minimized.

Copy link
@yangdongsheng

yangdongsheng Sep 12, 2017

Author Member

Hm.... okey, not bad. thanx.

{
}

TokenBucketThrottle::Bucket::~Bucket()

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Nit: unused -- remove

@@ -12,7 +12,9 @@
#include <map>
#include <mutex>

#include "Cond.h"

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Nit: not used

const std::string name;
std::atomic<int64_t> remain = { 0 }, max = { 0 };
Mutex lock;
list<Cond*> cond;

This comment has been minimized.

Copy link
@dillaman

dillaman Sep 8, 2017

Contributor

Nit: not used

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Sep 11, 2017

@dillaman yes, the max could be different with average if we support bursting. for example, if we set
iops_limit = 20 and iops_burst = 1000. then the max of bucket would be 1000 and average would be 20.But if we want to support iops_burst_length, like what qemu is doing, I think we need some more code for it. In my opinion maybe we need another bucket instance in TokenBucketThrottle, but I am not sure right now.

@yangdongsheng yangdongsheng force-pushed the yangdongsheng:rbd_qos branch 2 times, most recently from 46999e0 to 0c2413b Sep 11, 2017

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Sep 12, 2017

@ceph-jenkins retest this please

@yangdongsheng yangdongsheng force-pushed the yangdongsheng:rbd_qos branch 5 times, most recently from 40068ec to 820d451 Sep 12, 2017

Throttle: add a new TokenBucketThrottle
Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

@yangdongsheng yangdongsheng force-pushed the yangdongsheng:rbd_qos branch from f57b8bf to 470c1d8 Nov 8, 2017

@dillaman
Copy link
Contributor

left a comment

lgtm

ceph:
conf:
client:
rbd qos iops limit = 50

This comment has been minimized.

This comment has been minimized.

Copy link
@yangdongsheng

yangdongsheng Nov 14, 2017

Author Member

Oh, sorry, what a stupid mistake.

librbd: limit IO per second by TokenBucketThrottle
Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>

@yangdongsheng yangdongsheng force-pushed the yangdongsheng:rbd_qos branch from 470c1d8 to bf4e454 Nov 14, 2017

@dillaman

This comment has been minimized.

Copy link
Contributor

commented Nov 14, 2017

@yangdongsheng the fsx test hangs due to an issue where m_in_flight_writes is not properly handled when throttling. I plan to fix it as part of the merge commit once I re-run it through testing.

@dillaman dillaman merged commit bf4e454 into ceph:master Nov 15, 2017

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

dillaman added a commit that referenced this pull request Nov 15, 2017

Merge pull request #17032 from yangdongsheng/rbd_qos
rbd: implement image qos in tokenbucket algorithm

Reviewed-by: Jason Dillaman <dillaman@redhat.com>
@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Nov 15, 2017

@dillaman Okey, thanx

@junxfl

This comment has been minimized.

Copy link

commented Apr 18, 2018

I merged these patches into 10.2.5. I find a bug which is image.close unable to return, m_in_progress_writes had been added twice times., I made the following modifications. Is there any problem with it? Thank you very much.
--- a/src/librbd/AioImageRequest.h
+++ b/src/librbd/AioImageRequest.h
@@ -64,6 +64,12 @@ public:
return m_throttled_flag & flag;
}

  • bool is_throttled() {

  •   if (m_throttled_flag != 0)
    
  •           return true;
    
  •   return false;
    
  • }

  • void set_throttled(uint64_t flag) {
    m_throttled_flag |= flag;
    }
    diff --git a/src/librbd/AioImageRequestWQ.cc b/src/librbd/AioImageRequestWQ.cc
    index 70480d2..32dcf39 100644
    --- a/src/librbd/AioImageRequestWQ.cc
    +++ b/src/librbd/AioImageRequestWQ.cc
    @@ -398,7 +398,7 @@ void *AioImageRequestWQ::_void_dequeue() {
    }

     // refresh will requeue the op -- don't count it as in-progress
    
  •  if (!refresh_required) {
    
  •  if (!refresh_required && !peek_item->is_throttled()) {
       m_in_progress_writes.inc();
     }
    
    } else if (m_require_lock_on_read) {
@dillaman

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2018

@junxfl This is a new feature for the mimic release, which means it's not planned to be backported to luminous (let alone the jewel release). Merge conflicts are to be expected and are something that you will have to handle if you want to run your own custom jewel-release of Ceph (due to resource limitations on our side).

@junxfl

This comment has been minimized.

Copy link

commented Apr 24, 2018

I test these patch by LIO + tcmu-runner. I set conf_rbd_qos_iops_limit 100, conf_rbd_qos_read_iops_limit 30, I found that read iops can be limited to 30,but write iops only 24,not 70. May it be caused by the fact that all the iO requests are in the same queue? Do you have any suggestion on it? Thank you in advance! The fio configuration file is as follows:
[root@sds-72-139 conf]# cat config.iops.two
[global]
ioengine=libaio
bs=4k
runtime=30
iodepth=32
direct=1
thread
filename=/dev/sdb

[test]
rw=randread

[test1]
rw=randwrite

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2018

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2018

But there is a CLOSED pr #18762 to implement it. I am sorry I did not finish to merge this into upstream. I will try my best to comeback to push this feature into upstream.

But you can pick them and give it a try.

Thanx

@junxfl

This comment has been minimized.

Copy link

commented Apr 24, 2018

Sorry, I merge other patchs from your github address(https://github.com/yangdongsheng/ceph/commits/rbd_qos_read_write_bps),then I test it.

@junxfl

This comment has been minimized.

Copy link

commented Apr 24, 2018

Thank you very much for your reply. I merge all patchs from your pr #18762,And then I found that read,write can't limit at the same time.

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2018

@junxfl I will rebase My read_write qos branch against the latest ceph. And do a test you described.

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2018

@junxfl I rebased my patch and did a test as follow.

[root@atest-guest build]# rbd image-meta list test
There are 3 metadata on this image:

Key                           Value
conf_rbd_qos_iops_limit       100
conf_rbd_qos_read_iops_limit  30
conf_rbd_qos_write_iops_limit 70
[root@atest-guest build]# rbd bench --io-type write --io-size 4096 --io-threads 1 test
bench  type write io_size 4096 io_threads 1 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1        70     71.00  290811.30
    2       140     70.50  288763.34
    3       210     70.33  288080.68
    4       280     70.25  287739.36
    5       350     70.20  287534.56
    6       420     69.98  286658.04
    7       490     69.98  286658.04
    8       560     69.98  286658.04
    9       630     69.98  286658.05
   10       700     69.98  286658.05
   11       770     70.00  286715.38

on another terminal:

[root@atest-guest build]# rbd bench --io-type read --io-size 4096 --io-threads 1 test
bench  type read io_size 4096 io_threads 1 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1        31     32.03  131201.09
    2        61     31.02  127037.47
    3        91     30.68  125650.52
    4       121     30.51  124957.22
    5       151     30.41  124541.30
    6       181     29.99  122853.45
    7       211     29.99  122853.45
    8       241     29.99  122853.45
    9       271     29.99  122853.45

It works well.

@junxfl

This comment has been minimized.

Copy link

commented Apr 24, 2018

@yangdongsheng but if two threads(read、write)operate the same imageCtx,the issue will come up,like i said.

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2018

Okey,then I will try the test case, but anyway, this should be a topic in my another PR, thanx and I will investigate it. Write should not be blocked by read throttle.

@yangdongsheng

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2018

@junxfl I did a fio testing as below, that's a little different with what you met:

[root@atest-guest fio]# ./fio examples/rbd.fio --eta-newline 1
rbd_iodepth32: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32
rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32
fio-2.99
Starting 2 processes
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=752KiB/s,w=216KiB/s][r=188,w=54 IOPS][eta 14h:33m:45s]
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=776KiB/s,w=256KiB/s][r=194,w=64 IOPS][eta 14h:47m:56s]
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=756KiB/s,w=228KiB/s][r=189,w=57 IOPS][eta 15h:18m:18s]
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=772KiB/s,w=128KiB/s][r=193,w=32 IOPS][eta 15h:25m:03s]
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=756KiB/s,w=212KiB/s][r=189,w=53 IOPS][eta 14h:59m:48s]
^C

fio file

[root@atest-guest fio]# cat examples/rbd.fio 
######################################################################
# Example test for the RBD engine.
# 
# Runs a 4k random write test against a RBD via librbd
#
# NOTE: Make sure you have either a RBD named 'fio_test' or change
#       the rbdname parameter.
######################################################################
[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=rbd
clientname=admin
pool=rbd
rbdname=test
bs=4k

[rbd_iodepth32]
iodepth=32
rw=randread
[rbd_iodepth32]
iodepth=32
rw=randwrite
[root@atest-guest build]# ./bin/rbd image-meta list test
There are 3 metadata on this image:

Key                           Value 
conf_rbd_qos_iops_limit       1000  
conf_rbd_qos_read_iops_limit  200   
conf_rbd_qos_write_iops_limit 50 
@junxfl

This comment has been minimized.

Copy link

commented Apr 25, 2018

@yangdongsheng Yes, it appears normal if test in this way,but it shows abnormal result tested using LIO +tcmu,let me double check it,many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.