Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: implement image qos in tokenbucket algorithm. #17032

Merged
merged 2 commits into from Nov 15, 2017

Conversation

yangdongsheng
Copy link
Contributor

@yangdongsheng yangdongsheng commented Aug 15, 2017

Same with #16865. But recreated on correct branch of my repo.

@dillaman please help to review. thanx

@@ -12,7 +12,9 @@
#include <map>
#include <mutex>

#include "Cond.h"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: To maintain consistency throughout the code
Can you kindly explain this condition...
Why header file name starts from Capital.
I believe in ceph, headers starts from lower cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitkumar50 That's an existing file -- there are several headers that start w/ upper case (including nearly 100% of the librbd headers to match the class name contained within it).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman Thanks for clarification.. All good from my end

@@ -330,4 +332,71 @@ class OrderedThrottle {
uint32_t waiters = 0;
};


class Bucket {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a standalone class? Can it just be folded into TokenBucketThrottle instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can put it in TokenBucketThrottle. The reason I put it outside is that I was thinking is there a possibility we need to implement another LeakyBucketThrottle which will use the Bucket as well? But that's just a possibility in the future, and we can move it outside on that day. I will put Bucket into TokenBucketThrottle now.

void cancel_tokens();

public:
int get(int64_t c);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IO path needs to be non-blocking, so this would need to support Context callback. Since it would be wasteful to construct a Context for the allowed case, you could provide a templated get method that had a parameters for the callback class and function. If a throttle is required, it could create a Context to invoke the provided class/function when tokens are available and return false; otherwise it returns true to indicate that throttling isn't required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx for your detailed suggestion. will try an async-way for this.

public:
int get(int64_t c);
void set_max(int64_t m);
void set_avg(int64_t avg);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: set_average

@@ -5226,6 +5226,14 @@ static std::vector<Option> get_rbd_options() {
Option("rbd_journal_max_concurrent_object_sets", Option::TYPE_INT, Option::LEVEL_ADVANCED)
.set_default(0)
.set_description("maximum number of object sets a journal client can be behind before it is automatically unregistered"),

Option("rbd_image_qos_iops_read_limit", Option::TYPE_UINT, Option::LEVEL_ADVANCED)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: _image_ is unnecessary since all rbd_ options are for an image. Might also be good to name these rbd_read_iops_max and rbd_write_iops_max.

@@ -88,6 +89,9 @@ namespace librbd {
ImageWatcher<ImageCtx> *image_watcher;
Journal<ImageCtx> *journal;

TokenBucketThrottle *read_iops_throttle;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these should be within librbd::io::ImageRequestWQ

@@ -179,6 +183,8 @@ namespace librbd {
uint32_t readahead_trigger_requests;
uint64_t readahead_max_bytes;
uint64_t readahead_disable_after_bytes;
uint64_t write_iops_limit;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these are unused

@@ -597,6 +597,12 @@ void *ImageRequestWQ<I>::_void_dequeue() {
return nullptr;
}

if (write_op) {
m_image_ctx.write_iops_throttle->get(1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block the thread pool's worker thread. These need to be asynchronous callbacks when throttling is required. Also, if I set me write qos to X and my read to Y (where Y > X), all my reads can be throttled if I go past my write limit. Perhaps just start w/ a single rbd_iops_max configuration parameter that throttles both -- or provide separate read and write queues for throttled IOs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I make this async-ed, I believe the problem you mentioned will be gone, right?

@yangdongsheng
Copy link
Contributor Author

Hi @dillaman I found I can't catch up a good idea to make it asynchronoused. If I do the work like #15799 to return nullptr when the WQ->front need to be throttled. Yes, this seems non-blocking, but the worker will check the same item again in the next _void_dequeue() and return nullptr again. Then the read io behind the WQ->front will be blocked as well, although we did not set the read iops qos.

So I think the solution to solve the problem you mentioned at #17032 (comment) is seperating read and write queue. Then a write IO would be blocked when it can't get token, but no need to worry, the all io behind this write in the same queue are all writing IO. and they should be blocked as well. At the same time, the reading IO is being processed in another queue without any blocking.

So I incline start with only a rbd_iops_max as you suggested. And then provide separate queues for read and write later.

What's your suggestion?

@dillaman
Copy link

dillaman commented Aug 16, 2017

@yangdongsheng That's what I was trying to say in my last comment. Either (for now) don't try to separate the read and write IOPS throttle so that it doesn't matter, or else you will need to add two "throttled requests" lists to ImageRequestWQ. If a write request is throttled, push the request to the throttled write requests lists, and if a read is throttled push it to the throttled read requests list. The callback from the throttle class can set a "do not throttle me" flag on the request, re-push the request at the top of the queue, and wake up the thread. Back in _void_dequeue, the request would bypass the throttle if the flag is set.

Alternatively, instead of the two lists, if the TokenBucketThrottle::get template method were to take a third template param, it could pass the request pointer and the request could get easily wrapped together w/ the callback to consolidate lists.

@yangdongsheng
Copy link
Contributor Author

@dillaman what about this version as the start with only rbd_qos_iops_limit for both of read and write.

@dillaman
Copy link

@yangdongsheng Fine w/ me

@yangdongsheng
Copy link
Contributor Author

Hi @dillaman I am preparing another pr for read or write qos. But this PR is for both read and write. Please help to review. thanx a lot.


bool TokenBucketThrottle::Bucket::_wait(int64_t c)
{
utime_t start;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks start is unused variable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitkumar50 yea, thanx

@dillaman
Copy link

dillaman commented Sep 2, 2017

@yangdongsheng Looks like all my existing comments need to be addressed -- specifically with regard to async operation.

@yangdongsheng
Copy link
Contributor Author

@dillaman Hi Jason, what about this version? There is a list in TokenBucketThrottle which will hold the callback for each IO if need blocked.

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If bursting is to eventually be supported, shouldn't the tokens being added to the bucket represent an IO operation where the timer should be draining the bucket at the IOPS average and the max is set to the burst IOPS?

@@ -1464,7 +1466,8 @@ void Operations<I>::execute_metadata_remove(const std::string &key,
ldout(cct, 5) << this << " " << __func__ << ": key=" << key << dendl;

operation::MetadataRemoveRequest<I> *request =
new operation::MetadataRemoveRequest<I>(m_image_ctx, on_finish, key);
new operation::MetadataRemoveRequest<I>(m_image_ctx,
new C_NotifyUpdate<I>(m_image_ctx, on_finish), key);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spacing -- align w/ previous line's parameter (if room) or bump all the parameters down and indent.

auto &expect = EXPECT_CALL(get_mock_io_ctx(mock_image_ctx.md_ctx),
exec(mock_image_ctx.header_oid, _, StrEq("rbd"), StrEq("metadata_list"), _, _, _));
if (r < 0) {
expect.WillOnce(Return(r));
Copy link

@dillaman dillaman Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for the failure path

@@ -1,6 +1,8 @@
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
// vim: ts=8 sw=2 smarttab

#include <boost/bind.hpp>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should be able to use std::bind

Bucket(CephContext *cct, const std::string& n, int64_t m = 0);
~Bucket();

public:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: duplicate

Mutex lock;
list<Cond*> cond;
int64_t get(int64_t c);
friend class TokenBucketThrottle;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: just make these structs and remove the need for friend class -- the C++11 standard also already provide access to nested classes.

};

class Blocker {
int64_t token_requested;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: tokens_requested -- why signed int throughout?

CephContext *cct;
const std::string name;
std::atomic<int64_t> remain = { 0 }, max = { 0 };
Mutex lock;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this require its own lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To protect the remain and max. Yes, the remain and max are both atomic but we need to make sure a function atomic. e.g, function put(c):

if (remain + c <= max)
    remain += c;

If the remain is 2 and max is 5. There are two threads A and B are doing put(2).

Thread A                                                                Thread B
if (remain + c <= max) <---- 2 + 2 < 5
                                                                         if (remain + c <= max) <---- 2 + 2 < 5
                                                                               remain += c     <-------remain = 4
   remain += c  <-------- remain = 6

Then we probably get a remain > max.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but since Bucket is only accessible via TokenBucketThrottle and it already has a lock, why not just have TokenBucketThrottle use its lock to protect its nested class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean TokenBucketThrottle::m_blockers_lock, this is to protect the blockers_list. but not to protect fields in bucket. for example, the set_max() is not protected by m_blockers_lock, will be in a race condition if we don't have a Bucket::lock.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenBucketThrottle::set_max can easily take the lock before invoking the method in TokenBucketThrottle::Bucket::set_max -- and TokenBucketThrottle::m_blockers_lock can be renamed TokenBucketThrottle::m_lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.... okey, not bad. thanx.

{
}

TokenBucketThrottle::Bucket::~Bucket()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unused -- remove

@@ -12,7 +12,9 @@
#include <map>
#include <mutex>

#include "Cond.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not used

const std::string name;
std::atomic<int64_t> remain = { 0 }, max = { 0 };
Mutex lock;
list<Cond*> cond;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not used

@yangdongsheng
Copy link
Contributor Author

@dillaman yes, the max could be different with average if we support bursting. for example, if we set
iops_limit = 20 and iops_burst = 1000. then the max of bucket would be 1000 and average would be 20.But if we want to support iops_burst_length, like what qemu is doing, I think we need some more code for it. In my opinion maybe we need another bucket instance in TokenBucketThrottle, but I am not sure right now.

@yangdongsheng
Copy link
Contributor Author

@ceph-jenkins retest this please

Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

ceph:
conf:
client:
rbd qos iops limit = 50

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, what a stupid mistake.

Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
@dillaman
Copy link

@yangdongsheng the fsx test hangs due to an issue where m_in_flight_writes is not properly handled when throttling. I plan to fix it as part of the merge commit once I re-run it through testing.

@dillaman dillaman merged commit bf4e454 into ceph:master Nov 15, 2017
dillaman pushed a commit that referenced this pull request Nov 15, 2017
rbd: implement image qos in tokenbucket algorithm

Reviewed-by: Jason Dillaman <dillaman@redhat.com>
@yangdongsheng
Copy link
Contributor Author

@dillaman Okey, thanx

@junxfl
Copy link

junxfl commented Apr 18, 2018

I merged these patches into 10.2.5. I find a bug which is image.close unable to return, m_in_progress_writes had been added twice times., I made the following modifications. Is there any problem with it? Thank you very much.
--- a/src/librbd/AioImageRequest.h
+++ b/src/librbd/AioImageRequest.h
@@ -64,6 +64,12 @@ public:
return m_throttled_flag & flag;
}

  • bool is_throttled() {

  •   if (m_throttled_flag != 0)
    
  •           return true;
    
  •   return false;
    
  • }

  • void set_throttled(uint64_t flag) {
    m_throttled_flag |= flag;
    }
    diff --git a/src/librbd/AioImageRequestWQ.cc b/src/librbd/AioImageRequestWQ.cc
    index 70480d2..32dcf39 100644
    --- a/src/librbd/AioImageRequestWQ.cc
    +++ b/src/librbd/AioImageRequestWQ.cc
    @@ -398,7 +398,7 @@ void *AioImageRequestWQ::_void_dequeue() {
    }

     // refresh will requeue the op -- don't count it as in-progress
    
  •  if (!refresh_required) {
    
  •  if (!refresh_required && !peek_item->is_throttled()) {
       m_in_progress_writes.inc();
     }
    
    } else if (m_require_lock_on_read) {

@dillaman
Copy link

@junxfl This is a new feature for the mimic release, which means it's not planned to be backported to luminous (let alone the jewel release). Merge conflicts are to be expected and are something that you will have to handle if you want to run your own custom jewel-release of Ceph (due to resource limitations on our side).

@junxfl
Copy link

junxfl commented Apr 24, 2018

I test these patch by LIO + tcmu-runner. I set conf_rbd_qos_iops_limit 100, conf_rbd_qos_read_iops_limit 30, I found that read iops can be limited to 30,but write iops only 24,not 70. May it be caused by the fact that all the iO requests are in the same queue? Do you have any suggestion on it? Thank you in advance! The fio configuration file is as follows:
[root@sds-72-139 conf]# cat config.iops.two
[global]
ioengine=libaio
bs=4k
runtime=30
iodepth=32
direct=1
thread
filename=/dev/sdb

[test]
rw=randread

[test1]
rw=randwrite

@yangdongsheng
Copy link
Contributor Author

yangdongsheng commented Apr 24, 2018 via email

@yangdongsheng
Copy link
Contributor Author

But there is a CLOSED pr #18762 to implement it. I am sorry I did not finish to merge this into upstream. I will try my best to comeback to push this feature into upstream.

But you can pick them and give it a try.

Thanx

@junxfl
Copy link

junxfl commented Apr 24, 2018

Sorry, I merge other patchs from your github address(https://github.com/yangdongsheng/ceph/commits/rbd_qos_read_write_bps),then I test it.

@junxfl
Copy link

junxfl commented Apr 24, 2018

Thank you very much for your reply. I merge all patchs from your pr #18762,And then I found that read,write can't limit at the same time.

@yangdongsheng
Copy link
Contributor Author

@junxfl I will rebase My read_write qos branch against the latest ceph. And do a test you described.

@yangdongsheng
Copy link
Contributor Author

@junxfl I rebased my patch and did a test as follow.

[root@atest-guest build]# rbd image-meta list test
There are 3 metadata on this image:

Key                           Value
conf_rbd_qos_iops_limit       100
conf_rbd_qos_read_iops_limit  30
conf_rbd_qos_write_iops_limit 70
[root@atest-guest build]# rbd bench --io-type write --io-size 4096 --io-threads 1 test
bench  type write io_size 4096 io_threads 1 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1        70     71.00  290811.30
    2       140     70.50  288763.34
    3       210     70.33  288080.68
    4       280     70.25  287739.36
    5       350     70.20  287534.56
    6       420     69.98  286658.04
    7       490     69.98  286658.04
    8       560     69.98  286658.04
    9       630     69.98  286658.05
   10       700     69.98  286658.05
   11       770     70.00  286715.38

on another terminal:

[root@atest-guest build]# rbd bench --io-type read --io-size 4096 --io-threads 1 test
bench  type read io_size 4096 io_threads 1 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1        31     32.03  131201.09
    2        61     31.02  127037.47
    3        91     30.68  125650.52
    4       121     30.51  124957.22
    5       151     30.41  124541.30
    6       181     29.99  122853.45
    7       211     29.99  122853.45
    8       241     29.99  122853.45
    9       271     29.99  122853.45

It works well.

@junxfl
Copy link

junxfl commented Apr 24, 2018

@yangdongsheng but if two threads(read、write)operate the same imageCtx,the issue will come up,like i said.

@yangdongsheng
Copy link
Contributor Author

Okey,then I will try the test case, but anyway, this should be a topic in my another PR, thanx and I will investigate it. Write should not be blocked by read throttle.

@yangdongsheng
Copy link
Contributor Author

yangdongsheng commented Apr 24, 2018

@junxfl I did a fio testing as below, that's a little different with what you met:

[root@atest-guest fio]# ./fio examples/rbd.fio --eta-newline 1
rbd_iodepth32: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32
rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32
fio-2.99
Starting 2 processes
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=752KiB/s,w=216KiB/s][r=188,w=54 IOPS][eta 14h:33m:45s]
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=776KiB/s,w=256KiB/s][r=194,w=64 IOPS][eta 14h:47m:56s]
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=756KiB/s,w=228KiB/s][r=189,w=57 IOPS][eta 15h:18m:18s]
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=772KiB/s,w=128KiB/s][r=193,w=32 IOPS][eta 15h:25m:03s]
Jobs: 2 (f=2): [r(1),w(1)][0.0%][r=756KiB/s,w=212KiB/s][r=189,w=53 IOPS][eta 14h:59m:48s]
^C

fio file

[root@atest-guest fio]# cat examples/rbd.fio 
######################################################################
# Example test for the RBD engine.
# 
# Runs a 4k random write test against a RBD via librbd
#
# NOTE: Make sure you have either a RBD named 'fio_test' or change
#       the rbdname parameter.
######################################################################
[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=rbd
clientname=admin
pool=rbd
rbdname=test
bs=4k

[rbd_iodepth32]
iodepth=32
rw=randread
[rbd_iodepth32]
iodepth=32
rw=randwrite
[root@atest-guest build]# ./bin/rbd image-meta list test
There are 3 metadata on this image:

Key                           Value 
conf_rbd_qos_iops_limit       1000  
conf_rbd_qos_read_iops_limit  200   
conf_rbd_qos_write_iops_limit 50 

@junxfl
Copy link

junxfl commented Apr 25, 2018

@yangdongsheng Yes, it appears normal if test in this way,but it shows abnormal result tested using LIO +tcmu,let me double check it,many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants