Allow the dmclock library to handle the "cost" of an operation/request #46

ivancich · 2018-01-25T20:25:18Z

Allow the dmclock library to handle the "cost" of an operation/request. This is in contrast to just measuring ops. If the cost is set at the value 1, then it's equivalent to the old code.

The cost is calculated on the server side and affects the tag values for the request. Furthermore, the cost is then sent back to the client, so the ServiceTracker can use the cost to properly calculate delta and rho values.

We now allow the delta and rho values sent with a request to be zero, since the request cost is included on the server-side tag calculation, and the request cost must be positive guaranteeing the advancement of tags.

The OrigTracker has now been updated so as not to add one when computing delta and rho.

The unit tests have been updated to use request costs. And the simulator has been updated to use cost, and the cost of a client group's requests can be configured.

The simulation has been changed to user the OrigTracker as the default tracker.

Signed-off-by: J. Eric Ivancich ivancich@redhat.com

ivancich · 2018-01-25T20:33:36Z

@bspark8 @TaewoongKim Please let me know your thoughts on this PR. Thanks!

TaewoongKim · 2018-02-01T09:12:45Z

src/dmclock_server.h

-	  }
-	  return std::max(time, prev + increment);
+	  double tag_increment = increment * (dist_req_val + cost);
+	  return std::max(time, prev + tag_increment);


I'm not digging the code in depth yet. Sorry about that.
I just looked the code roughly and I have a question about the manner dealing the cost.
In the code, the cost is usually added to rho or delta value. However, I think cost is somewhat needed to be multiplied. Because cost is a relative value. So it should represent how much times consume the resource than the unit request(e.g. 8KB write request.)
In the dmclock paper, author said "So, for tagging purposes, a single request of IO size S is treated as equivalent to: (1+S/(Tm×Bpeak)) IO requests"
I understood that sentence as (1+S/(Tm×Bpeak)) is something like a cost and it should be multiplied with tag incremental value to deal the IO request as multiple IO requests.

yeah, i feel the same. cost is quite the same as the number of unit requests. and both \rho_{i} and \delta_{i} denote the number of served requests. if this certain request is treated a multiple of unit request, we should certainly use something like

Thank you both for your detailed look into this. This is the type of discussion I was hoping we'd have!

So cost is calculated on the server side by a formula as discussed in the ceph-devel discussion. So yes, on a hard disk it would have to incorporate seek times and read/write times.

One thing to keep in mind is that cost, rho, and delta are all in the same units. On the client side, when the response comes back, delta and possibly rho (if the request was handled during the reservation phase) are incremented by the cost value. See dmclock_client.h:dmclock::SimpleTracker::resp_update.

So delta (or rho) represents costs of services already received, and cost represents cost of services being scheduled to be received. Because they are in the same unit, they should not be multiplied but instead be added. It's not clear to me what cost^2 units would represent.

Cost doesn't include unit concept. It's relative value like weight or ratio value that don't have any unit info.
Therefore, we need another unit concept that represents an abstract amount of resources that would be needed to process a unit request(e.g 8KB write) so that a cost can be multiplied with this unit and the result can have a meaning in the real world. This means a cost for a specific request will be represented as a ratio to the unit cost(e.g. cost for an 8KB write).
Rho & delta is also a kind of ratio value and these don't have a unit also.
So, we can multiply these values with each other multiple times.

When I reproduce the experiment from the mclock paper, I find the behavior of the dmclock with cost integrated is not the same as expected (when cost=1). The reason for that is due to line 224: double tag_increment = increment * (dist_req_val + cost). My understanding is that cost should replace the delta and rho (or multiply it). In dmclock_client.h, we already added the cost into delta and rho, so here at line 224, we should multiply increment with cost directly.

@tchaikov yes, I have updated the comment. Thanks

I'd like to update everyone with some conversations that @yzhan298 and I had. @yzhan298's suggested fix cannot be correct, because it would mean that rho and delta would have no impact on tag calculation, and that cannot be correct. Once we eliminate delta and rho from tag calculation we no longer have "dmclock" but instead "mclock". The modification did produce results from the simulation more in line with expectations. But that should serve as a reminder for the importance of having a theoretical grounding for changes.

But working through that with @yzhan298 made me question the calculation of delta and rho, and that's done in the client-side trackers. So I re-evaluated that and since we're adding cost in on the server side, we no longer have to insure delta and rho are at least 1. So I modified the OrigTracker to get rid of the "1 +" and @yzhan298 was kind enough to test it out, and it produced the expected results.

In the past we'd migrated from the OrigTracker to the BorrowingTracker. The OrigTracker was written to implement the algorithm in the dmclock paper. It subtracts out responses it gets from the same server in the calculation of delta and rho.

But it seems like the new OrigTracker is the way to go.

@TaewoongKim You are absolutely correct that cost is a scalar. That is why when we discussed the addition of cost, we decided that we should re-think how we conceptualize how we think about reservation and limit. Currently they are in the unit of ops per second. But it would likely be more appropriate to think of them as cost units per second. So a simple/small operation might be the foundation as one cost unit. Operations that require more data to be written/read would represent multiple cost units; the same would be the case for operations that ultimately turn into multiple sub-operations.

We are even contemplating using cost units per time unit that is not seconds in order to make sure reservations and limits are expressed in positive integers given some issues in sending doubles in messages.

For the initial announcement and follow-up conversation, please see the ceph-devel RFC in:
https://marc.info/?l=ceph-devel&m=151362386931501&w=2

@tchaikov Yes, rho and delta are defined to count operations. But in the transition to thinking about costs and operations of widely varying costs, we need to include cost in rho and delta, which this PR does. They are measures of work done for a client by other servers in a given time interval (time since last request to the same server). And they should include not just # of ops but combined cost of those ops. That is why in the tracker the client increments delta by cost (and rho by cost when done in the reservation phase).

So when we calculate the tag for a new request, we need to incorporate the cost of work done by other servers and the calculated cost of this operation. We can think of it as cost_recent_past and cost_this and we therefore must add them together. Multiplying them together would result in "units" of cost_squared, which is not meaningful.

@TaewoongKim The part of the dmclock paper that you quote deals specifically with characteristics of a spinning hard disk. We need something more general as we're not working strictly with hard disks, but SSDs and non-volatile memory. Plus the cost of operations sent to the OSD will vary widely. So we need to work with a more general concept of cost, it seems.

ivancich · 2018-04-26T19:55:47Z

@yzhan298 This is exciting and I can't wait to spend some time looking over your suggested changes. I'm on vacation, but I think I'll have time this weekend. Thanks for your help on this!!

cbodley · 2018-05-07T15:27:48Z

src/dmclock_server.h

+      double   reservation;
+      double   proportion;
+      double   limit;
+      uint64_t cost;


consider exposing a typedef for Cost as part of the interface. this would help disambiguate it from other uint64_t parameters, and calling code would be less likely to rely on its exact representation

Very nice idea. Latest version does this.

cbodley · 2018-05-07T15:28:31Z

src/dmclock_server.h

 	ready(false),
 	arrival(time)
      {
+	assert(cost > 0.0);


cost is no longer floating point, so cost > 0 here? (and again on line 180)

for requests that pass cost=0, it's probably better to fail with EINVAL error instead of aborting

Fixed 0 literal issues.

I don't know that I want to take on improving error returning results in this PR. But I think it's an excellent idea. At the moment there are a lot of asserts that'd need to be tackled.

tchaikov · 2018-05-10T02:38:19Z

src/dmclock_recs.h

@@ -33,20 +33,20 @@ namespace crimson {

    struct ReqParams {
      // count of all replies since last request; MUSTN'T BE 0
-      uint32_t delta;
+      uint64_t delta;


i am just curious why we want to use uint64_t for delta and rho now? because we could be running into integer overflow with uint32_t. as IMO, uint32_t is able to represent a fairly large number already.

This is a very good point. I think we had a mention that rho and delta would be u64s on our etherpad when we discussed it. But that seems to large. 4 billion should be more than enough to count the number of replies received from other servers. I'll switch it back.

That raises a related question. Should Cost (which is a type currently defined to be uint64_t) also be uint32_t? If we use cost=1 as a base, would there be an op that had a relative cost of more than 4 billion? If not, uint32_t would be sufficient.

@cbodley @tchaikov Your thoughts?

i can imagine a strategy that sets cost=bytes in a request, and radosgw will happily serve requests bigger than 4G. but given a Cost=uint32_t, you could easily adapt the strategy to some fraction of the bytes instead

are we doing much math with these values, where there'd be risk of an intermediate multiplication overflowing?

it seems like the benefits of uint32_t would be less overhead in osd messages, and less per-request memory for the dmclock prio queue? those saving could definitely add up

I can imagine someone trying that strategy as well, although I'd hope people would think in terms of blocks rather than bytes. Assuming 4K blocks a request of 4G would result in a cost of 1M. You'd then need a request of 16T to overflow cost.

So I think we're piecing together a case for uint32_t for cost. @tchaikov -- your thoughts?

ivancich · 2018-05-10T12:59:12Z

src/dmclock_server.h

+    // future changes; we're assuming that Cost is relatively small
+    // and that it would be more efficient to pass-by-value than
+    // by-reference.
+    using Cost = uint64_t;


@tchaikov's question regarding whether delta and rho should be uint64_t's raises a similar question w.r.t. Cost. Should Cost also be uint32_t? If we use cost=1 as a base, would there be an op that had a relative cost of more than 4 billion? If not, uint32_t would be sufficient.

@cbodley @tchaikov Your thoughts?

i think 4G is a very large number in the context of a networking protocol at this moment. we can cap the cost at std::numeric_limit<uint32>::max() (which is 4G). if its value is actually greater than it.

you've convinced me that uint32_t is sufficient here

ivancich · 2018-05-10T13:51:15Z

src/dmclock_server.h

-	    increment *= dist_req_val;
-	  }
-	  return std::max(time, prev + increment);
+	  double tag_increment = increment * (dist_req_val + cost);


@cbodley This is the key calculation with cost. If type Cost is changed to uint32_t, then we'd be adding two uint32_t's and then multiplying by a double. If we're concerned about overflow, we could cast one of the addends to a uint64_t to force uint64_t addition, as in:
double tag_increment = increment * (uint64_t(dist_req_val) + cost);

ivancich · 2018-05-11T12:28:43Z

@tchaikov I'd like to get this merged. You had indicated doubts about how cost was integrated. Have your doubts been addressed?

tchaikov · 2018-05-11T13:35:59Z

@ivancich yes. and regarding to #46 (comment) , i don't feel strong either way.

cbodley · 2018-05-11T15:01:46Z

sim/src/config.h

@@ -37,6 +37,7 @@ namespace crimson {
      double client_reservation;
      double client_limit;
      double client_weight;
+      uint64_t client_req_cost;


can stuff in sim/ use the Cost typedef?

I ended up doing an odd work-around here. The simulator defines its own Cost and then I assert upon start-up that the sizeofs of the two are the same. I played a bit with SFINAE but couldn't get it to work and didn't want to spend more time on it. Your thoughts?

cbodley · 2018-05-11T15:02:09Z

src/dmclock_client.h

-	++the_delta;
-	++my_delta;
+			      Counter& the_rho,
+			      Counter cost) {


Counter -> Cost?

cbodley · 2018-05-11T15:06:17Z

src/dmclock_server.h

+    // future changes; we're assuming that Cost is relatively small
+    // and that it would be more efficient to pass-by-value than
+    // by-reference.
+    using Cost = uint64_t;


you've convinced me that uint32_t is sufficient here

cbodley · 2018-05-11T15:07:48Z

test/test_test_client.cc

@@ -45,6 +45,7 @@ TEST(test_client, full_bore_timing) {
  sim::TestResponse resp(0);
  dmc::PhaseType resp_params = dmc::PhaseType::priority;
  test::DmcClient* client;
+  const uint64_t request_cost = 1u;


uint64_t -> Cost

ivancich · 2018-05-11T19:18:11Z

sim/src/test_dmclock.cc

@@ -45,3 +46,13 @@ void test::dmc_client_accumulate_f(test::DmcAccum& a,
    ++a.proportion_count;
  }
 }
+
+
+// Note: this would be better if we could use std::is_same and SFINAE


@cbodley Here's where I played with SFINAE....

operation/request. This is in contrast to just measuring ops. If the cost is set at the value 1, then it's equivalent to the old code. The cost is calculated on the server side and affects the tag values for the request. Furthermore, the cost is then sent back to the client, so the ServiceTracker can use the cost to properly calculate delta and rho values. We now allow the delta and rho values sent with a request to be zero, since the request cost is included on the server-side tag calculation, and the request cost must be positive guaranteeing the advancement of tags. The OrigTracker has now been updated so as not to add one when computing delta and rho. The unit tests have been updated to use request costs. And the simulator has been updated to use cost, and the cost of a client group's requests can be configured. The simulation has been changed to user the OrigTracker as the default tracker. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>

tchaikov · 2018-05-12T01:42:13Z

sim/src/test_dmclock.cc

+// Note: if this static_assert fails then our two definitions of Cost
+// do not match; change crimson::qos_simulation::Cost to match the
+// definition of crimson::dmclock::Cost.
+static_assert(std::is_same<crimson::qos_simulation::Cost,crimson::dmclock::Cost>::value,


nit, could use std::is_same_v next time.

Thanks; I did not know about it.

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

ivancich requested a review from tchaikov January 25, 2018 20:26

ivancich force-pushed the wip-add-cost branch from d71298e to 270908b Compare January 25, 2018 20:35

TaewoongKim reviewed Feb 1, 2018

View reviewed changes

ivancich force-pushed the wip-add-cost branch from 270908b to 511dca6 Compare March 16, 2018 18:19

ivancich mentioned this pull request Apr 18, 2018

osd: remove cost from mclock op queues; cost not handled well in dmcl… ceph/ceph#21428

Merged

ivancich force-pushed the wip-add-cost branch from 511dca6 to c3815c3 Compare April 19, 2018 22:10

Update .gitignore to deal with cscope files.

c7e4470

ivancich force-pushed the wip-add-cost branch 2 times, most recently from 030ee12 to 30f66fc Compare May 4, 2018 15:26

ivancich requested a review from cbodley May 4, 2018 17:47

cbodley reviewed May 7, 2018

View reviewed changes

ivancich force-pushed the wip-add-cost branch 3 times, most recently from 0b08294 to 373be44 Compare May 9, 2018 17:36

cbodley approved these changes May 9, 2018

View reviewed changes

tchaikov reviewed May 10, 2018

View reviewed changes

ivancich force-pushed the wip-add-cost branch from 373be44 to 17662d8 Compare May 10, 2018 12:50

ivancich commented May 10, 2018

View reviewed changes

ivancich added the enhancement label May 11, 2018

cbodley reviewed May 11, 2018

View reviewed changes

ivancich force-pushed the wip-add-cost branch from 17662d8 to 6369eab Compare May 11, 2018 19:14

ivancich commented May 11, 2018

View reviewed changes

ivancich force-pushed the wip-add-cost branch 2 times, most recently from ee91f28 to 9cbf6c3 Compare May 11, 2018 19:42

ivancich force-pushed the wip-add-cost branch from 9cbf6c3 to 6f8e3d6 Compare May 11, 2018 21:02

ivancich merged commit effea49 into ceph:master May 11, 2018

tchaikov reviewed May 12, 2018

View reviewed changes

cbodley mentioned this pull request Jul 3, 2018

DNM: rgw dmclock integration ceph/ceph#22834

Closed

theanalyst added a commit to theanalyst/ceph that referenced this pull request Jul 5, 2018

rgw: op: set dmclock_cost to 1

1a6bfa6

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Aug 7, 2018

rgw: op: set dmclock_cost to 1

7dfd1de

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Aug 7, 2018

rgw: op: set dmclock_cost to 1

840d28e

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Aug 30, 2018

rgw: op: set dmclock_cost to 1

6dfb13c

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Sep 7, 2018

rgw: op: set dmclock_cost to 1

f22613c

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Sep 18, 2018

rgw: op: set dmclock_cost to 1

4284f29

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Jan 4, 2019

rgw: op: set dmclock_cost to 1

a95414f

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Jan 7, 2019

rgw: op: set dmclock_cost to 1

49ebaf1

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Jan 30, 2019

rgw: op: set dmclock_cost to 1

0d1bf1b

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Jan 30, 2019

rgw: op: set dmclock_cost to 1

1443014

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

theanalyst added a commit to theanalyst/ceph that referenced this pull request Jan 31, 2019

rgw: op: set dmclock_cost to 1

97a610d

This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>

Allow the dmclock library to handle the "cost" of an operation/request #46

Allow the dmclock library to handle the "cost" of an operation/request #46

Conversation

ivancich commented Jan 25, 2018 • edited

ivancich commented Jan 25, 2018

Choose a reason for hiding this comment

tchaikov Feb 1, 2018 • edited

Choose a reason for hiding this comment

ivancich Feb 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yzhan298 Apr 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivancich commented Apr 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbodley May 7, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivancich May 10, 2018 • edited

Choose a reason for hiding this comment

ivancich commented May 11, 2018

tchaikov commented May 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivancich commented Jan 25, 2018 •

edited

tchaikov Feb 1, 2018 •

edited

ivancich Feb 2, 2018 •

edited

yzhan298 Apr 25, 2018 •

edited

cbodley May 7, 2018 •

edited

ivancich May 10, 2018 •

edited