New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: add an external operation queue #18280
Conversation
That is responsible for external client requests only. Signed-off-by: bspark <bspark8@sk.com>
Can you explain how this is different than configuring one shard with many threads servicing that shard? |
When applying dmClock to Ceph, there are some problems as mentioned at #16369. Problems
Currently, we are working on the following two solutions (A,B) at the same time. A. (#16369) + one shard with many threads servicing that shard In the case of the A method, the problems 1 and 2 are solved. When B methed (the external op queue) is used, the problems 1,2 and 3 are solved at the same time. |
it looks introduce another queue and threadpool. I don't think it's a free lunch especially in this critical path.... |
Yeah, I'm very interesting in hearing whether we can normalize delta/rho instead... |
(https://www.slideshare.net/ssusercee823/implementing-distributed-mclock-in-ceph#12) In case of Client I/O, Delta/Rho are required to calculate tagging value. Therefore, Background I/O also needs Delta/Rho to perform fairly QoS between Client I/O and Background I/O. However, in the case of Background I/O, the location of dmClock ServiceTracker for calculating Delta/Rho is somewhat ambiguous, and direct comparison with Delta/Rho of Client I/O is also difficult. Therefore, the using the normalized (e.g. unit time average value) Delta/Rho value occurring in the current OSD could be the way. |
Then, according to the opinion of Sage, Haomai, the external op is paused and tries to proceed with the following method first.
Thank you. |
@bspark8 I would like to hear more about delta/rho normalization. My understanding is that you feel we need this because the delta and rho values that come in with client requests disadvantage external client ops relative to background ops. But the reason for this is that those external clients are also getting some of their ops serviced by different OSDs, and delta and rho exist to factor that in. So it's not clear to me why this is an issue. Perhaps a related and overarching issue is how to combine the different (but interrelated) purposes we see dmclock playing. We want dmclock to prioritize different classes of operations (client ops, background ops -- snaptrip, scrub, recovery). On top of that, we want to control the relative priority of different external client ops (initially by pool or by rbd image). Right now we're flattening these various kinds of requests and priorities into a single dmclock queue on the OSD (assuming 1 shard). So each client competes on an equal basis (modulo dmclock params) to every other client and every background process. The more clients requests that are out there the more they could diminish necessary background processes, because each brings with it its own reservation and weight. Perhaps this is what we ultimately want, but I'm thinking about another possibility. I'm thinking about a hierarchical dmclock. At the top level we could set "global" priorities of the various background processes against all all client ops collectively. So there'd be perhaps four or five top-level categories (not sure whether replication ops would be separate at this level or not). Then, within the client ops category we'd have another dmclock queue where the clients would compete for their place using the dmclock priorities for, for example, the rbd image or pool. Now we have global controls to weight the background processes against client requests. And we have separate control to prioritize various clients. [Depending on our ultimate goals, perhaps it would be useful to consider the inversion of this hierarchy -- with each client at the top level along with a collective of all background ops, and then a lower level for the various types of background ops. That, however, seems to get us further from our goals as I understand them, though.] I don't think the implementation would be that difficult. I imagine a dmclock queue at the top level, where as each client op comes in, it receives a proxy op with the global client configuration (reservation, weight, limit). Then when we're ready to pull a client op for execution, we descend to the next level dmclock to choose among the clients using client specific configurations (reservation, weight, limit). I'd be curious what others think of this. |
Thank you for your feedback and I totally agree with you. The main reason for the need of hierarchical dmclock is: The current PR, external and internal dmclock op queues can also be considered hierarchical dmclock. The first thing to discuss is about the implementation of a hierarchical structure.
In addition, unrelated to the topic, the in delta/rho normalization, thought as follows.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
As in the number of shards in Op Queue (#16369),
an external operation queue was introduced to solve deepening of the dmClock queue depth problem due to multiple sharded op queue. (https://www.slideshare.net/ssusercee823/implementing-distributed-mclock-in-ceph#13)
As a result of measuring the performance of the external operation queue, there was no significant difference in performance from when the external operation queue was not used.
FIO 4KB Random Write without external operation queue, bluestore (original)
137352 IOPs
FIO 4KB Random Write with external operation queue, bluestore (external opqueue: WPQ, internal sharded opqueue: WPQ)
136914 IOPs