New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow the dmclock library to handle the "cost" of an operation/request #46
Conversation
@bspark8 @TaewoongKim Please let me know your thoughts on this PR. Thanks! |
d71298e
to
270908b
Compare
src/dmclock_server.h
Outdated
} | ||
return std::max(time, prev + increment); | ||
double tag_increment = increment * (dist_req_val + cost); | ||
return std::max(time, prev + tag_increment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not digging the code in depth yet. Sorry about that.
I just looked the code roughly and I have a question about the manner dealing the cost.
In the code, the cost is usually added to rho or delta value. However, I think cost is somewhat needed to be multiplied. Because cost is a relative value. So it should represent how much times consume the resource than the unit request(e.g. 8KB write request.)
In the dmclock paper, author said "So, for tagging purposes, a single request of IO size S is treated as equivalent to: (1+S/(Tm×Bpeak)) IO requests"
I understood that sentence as (1+S/(Tm×Bpeak)) is something like a cost and it should be multiplied with tag incremental value to deal the IO request as multiple IO requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for your detailed look into this. This is the type of discussion I was hoping we'd have!
So cost is calculated on the server side by a formula as discussed in the ceph-devel discussion. So yes, on a hard disk it would have to incorporate seek times and read/write times.
One thing to keep in mind is that cost
, rho
, and delta
are all in the same units. On the client side, when the response comes back, delta
and possibly rho
(if the request was handled during the reservation phase) are incremented by the cost
value. See dmclock_client.h:dmclock::SimpleTracker::resp_update
.
So delta
(or rho
) represents costs of services already received, and cost
represents cost of services being scheduled to be received. Because they are in the same unit, they should not be multiplied but instead be added. It's not clear to me what cost^2
units would represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cost doesn't include unit concept. It's relative value like weight or ratio value that don't have any unit info.
Therefore, we need another unit concept that represents an abstract amount of resources that would be needed to process a unit request(e.g 8KB write) so that a cost can be multiplied with this unit and the result can have a meaning in the real world. This means a cost for a specific request will be represented as a ratio to the unit cost(e.g. cost for an 8KB write).
Rho & delta is also a kind of ratio value and these don't have a unit also.
So, we can multiply these values with each other multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I reproduce the experiment from the mclock paper, I find the behavior of the dmclock with cost integrated is not the same as expected (when cost
=1). The reason for that is due to line 224: double tag_increment = increment * (dist_req_val + cost)
. My understanding is that cost should replace the delta
and rho
(or multiply it). In dmclock_client.h
, we already added the cost into delta
and rho
, so here at line 224, we should multiply increment
with cost
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaikov yes, I have updated the comment. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to update everyone with some conversations that @yzhan298 and I had. @yzhan298's suggested fix cannot be correct, because it would mean that rho and delta would have no impact on tag calculation, and that cannot be correct. Once we eliminate delta and rho from tag calculation we no longer have "dmclock" but instead "mclock". The modification did produce results from the simulation more in line with expectations. But that should serve as a reminder for the importance of having a theoretical grounding for changes.
But working through that with @yzhan298 made me question the calculation of delta and rho, and that's done in the client-side trackers. So I re-evaluated that and since we're adding cost in on the server side, we no longer have to insure delta and rho are at least 1. So I modified the OrigTracker to get rid of the "1 +" and @yzhan298 was kind enough to test it out, and it produced the expected results.
In the past we'd migrated from the OrigTracker to the BorrowingTracker. The OrigTracker was written to implement the algorithm in the dmclock paper. It subtracts out responses it gets from the same server in the calculation of delta and rho.
But it seems like the new OrigTracker is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TaewoongKim You are absolutely correct that cost is a scalar. That is why when we discussed the addition of cost, we decided that we should re-think how we conceptualize how we think about reservation and limit. Currently they are in the unit of ops per second. But it would likely be more appropriate to think of them as cost units per second. So a simple/small operation might be the foundation as one cost unit. Operations that require more data to be written/read would represent multiple cost units; the same would be the case for operations that ultimately turn into multiple sub-operations.
We are even contemplating using cost units per time unit that is not seconds in order to make sure reservations and limits are expressed in positive integers given some issues in sending doubles in messages.
For the initial announcement and follow-up conversation, please see the ceph-devel RFC in:
https://marc.info/?l=ceph-devel&m=151362386931501&w=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaikov Yes, rho and delta are defined to count operations. But in the transition to thinking about costs and operations of widely varying costs, we need to include cost in rho and delta, which this PR does. They are measures of work done for a client by other servers in a given time interval (time since last request to the same server). And they should include not just # of ops but combined cost of those ops. That is why in the tracker the client increments delta by cost (and rho by cost when done in the reservation phase).
So when we calculate the tag for a new request, we need to incorporate the cost of work done by other servers and the calculated cost of this operation. We can think of it as cost_recent_past
and cost_this
and we therefore must add them together. Multiplying them together would result in "units" of cost_squared, which is not meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TaewoongKim The part of the dmclock paper that you quote deals specifically with characteristics of a spinning hard disk. We need something more general as we're not working strictly with hard disks, but SSDs and non-volatile memory. Plus the cost of operations sent to the OSD will vary widely. So we need to work with a more general concept of cost, it seems.
@yzhan298 This is exciting and I can't wait to spend some time looking over your suggested changes. I'm on vacation, but I think I'll have time this weekend. Thanks for your help on this!! |
030ee12
to
30f66fc
Compare
src/dmclock_server.h
Outdated
double reservation; | ||
double proportion; | ||
double limit; | ||
uint64_t cost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider exposing a typedef for Cost
as part of the interface. this would help disambiguate it from other uint64_t parameters, and calling code would be less likely to rely on its exact representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice idea. Latest version does this.
src/dmclock_server.h
Outdated
ready(false), | ||
arrival(time) | ||
{ | ||
assert(cost > 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cost is no longer floating point, so cost > 0
here? (and again on line 180)
for requests that pass cost=0, it's probably better to fail with EINVAL error instead of aborting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed 0 literal issues.
I don't know that I want to take on improving error returning results in this PR. But I think it's an excellent idea. At the moment there are a lot of asserts that'd need to be tackled.
0b08294
to
373be44
Compare
src/dmclock_recs.h
Outdated
@@ -33,20 +33,20 @@ namespace crimson { | |||
|
|||
struct ReqParams { | |||
// count of all replies since last request; MUSTN'T BE 0 | |||
uint32_t delta; | |||
uint64_t delta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am just curious why we want to use uint64_t
for delta and rho now? because we could be running into integer overflow with uint32_t
. as IMO, uint32_t
is able to represent a fairly large number already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. I think we had a mention that rho and delta would be u64s on our etherpad when we discussed it. But that seems to large. 4 billion should be more than enough to count the number of replies received from other servers. I'll switch it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can imagine a strategy that sets cost=bytes in a request, and radosgw will happily serve requests bigger than 4G. but given a Cost=uint32_t, you could easily adapt the strategy to some fraction of the bytes instead
are we doing much math with these values, where there'd be risk of an intermediate multiplication overflowing?
it seems like the benefits of uint32_t would be less overhead in osd messages, and less per-request memory for the dmclock prio queue? those saving could definitely add up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine someone trying that strategy as well, although I'd hope people would think in terms of blocks rather than bytes. Assuming 4K blocks a request of 4G would result in a cost of 1M. You'd then need a request of 16T to overflow cost.
So I think we're piecing together a case for uint32_t for cost. @tchaikov -- your thoughts?
src/dmclock_server.h
Outdated
// future changes; we're assuming that Cost is relatively small | ||
// and that it would be more efficient to pass-by-value than | ||
// by-reference. | ||
using Cost = uint64_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaikov's question regarding whether delta and rho should be uint64_t's raises a similar question w.r.t. Cost. Should Cost also be uint32_t? If we use cost=1 as a base, would there be an op that had a relative cost of more than 4 billion? If not, uint32_t would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think 4G is a very large number in the context of a networking protocol at this moment. we can cap the cost at std::numeric_limit<uint32>::max()
(which is 4G). if its value is actually greater than it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you've convinced me that uint32_t is sufficient here
src/dmclock_server.h
Outdated
increment *= dist_req_val; | ||
} | ||
return std::max(time, prev + increment); | ||
double tag_increment = increment * (dist_req_val + cost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley This is the key calculation with cost. If type Cost is changed to uint32_t
, then we'd be adding two uint32_t
's and then multiplying by a double
. If we're concerned about overflow, we could cast one of the addends to a uint64_t
to force uint64_t
addition, as in:
double tag_increment = increment * (uint64_t(dist_req_val) + cost);
@tchaikov I'd like to get this merged. You had indicated doubts about how cost was integrated. Have your doubts been addressed? |
@ivancich yes. and regarding to #46 (comment) , i don't feel strong either way. |
sim/src/config.h
Outdated
@@ -37,6 +37,7 @@ namespace crimson { | |||
double client_reservation; | |||
double client_limit; | |||
double client_weight; | |||
uint64_t client_req_cost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can stuff in sim/ use the Cost typedef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up doing an odd work-around here. The simulator defines its own Cost and then I assert upon start-up that the sizeofs of the two are the same. I played a bit with SFINAE but couldn't get it to work and didn't want to spend more time on it. Your thoughts?
src/dmclock_client.h
Outdated
++the_delta; | ||
++my_delta; | ||
Counter& the_rho, | ||
Counter cost) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counter -> Cost?
src/dmclock_server.h
Outdated
// future changes; we're assuming that Cost is relatively small | ||
// and that it would be more efficient to pass-by-value than | ||
// by-reference. | ||
using Cost = uint64_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you've convinced me that uint32_t is sufficient here
test/test_test_client.cc
Outdated
@@ -45,6 +45,7 @@ TEST(test_client, full_bore_timing) { | |||
sim::TestResponse resp(0); | |||
dmc::PhaseType resp_params = dmc::PhaseType::priority; | |||
test::DmcClient* client; | |||
const uint64_t request_cost = 1u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64_t -> Cost
sim/src/test_dmclock.cc
Outdated
@@ -45,3 +46,13 @@ void test::dmc_client_accumulate_f(test::DmcAccum& a, | |||
++a.proportion_count; | |||
} | |||
} | |||
|
|||
|
|||
// Note: this would be better if we could use std::is_same and SFINAE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley Here's where I played with SFINAE....
ee91f28
to
9cbf6c3
Compare
operation/request. This is in contrast to just measuring ops. If the cost is set at the value 1, then it's equivalent to the old code. The cost is calculated on the server side and affects the tag values for the request. Furthermore, the cost is then sent back to the client, so the ServiceTracker can use the cost to properly calculate delta and rho values. We now allow the delta and rho values sent with a request to be zero, since the request cost is included on the server-side tag calculation, and the request cost must be positive guaranteeing the advancement of tags. The OrigTracker has now been updated so as not to add one when computing delta and rho. The unit tests have been updated to use request costs. And the simulator has been updated to use cost, and the cost of a client group's requests can be configured. The simulation has been changed to user the OrigTracker as the default tracker. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
// Note: if this static_assert fails then our two definitions of Cost | ||
// do not match; change crimson::qos_simulation::Cost to match the | ||
// definition of crimson::dmclock::Cost. | ||
static_assert(std::is_same<crimson::qos_simulation::Cost,crimson::dmclock::Cost>::value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could use std::is_same_v
next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; I did not know about it.
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
This was introduced in ceph/dmclock#46, let the default behaviour be that cost is 1 as we assert this. Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
Allow the dmclock library to handle the "cost" of an operation/request. This is in contrast to just measuring ops. If the cost is set at the value 1, then it's equivalent to the old code.
The cost is calculated on the server side and affects the tag values for the request. Furthermore, the cost is then sent back to the client, so the ServiceTracker can use the cost to properly calculate delta and rho values.
We now allow the delta and rho values sent with a request to be zero, since the request cost is included on the server-side tag calculation, and the request cost must be positive guaranteeing the advancement of tags.
The OrigTracker has now been updated so as not to add one when computing delta and rho.
The unit tests have been updated to use request costs. And the simulator has been updated to use cost, and the cost of a client group's requests can be configured.
The simulation has been changed to user the OrigTracker as the default tracker.
Signed-off-by: J. Eric Ivancich ivancich@redhat.com