Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: add an external operation queue #18280

Closed
wants to merge 1 commit into from

Conversation

bspark8
Copy link
Contributor

@bspark8 bspark8 commented Oct 13, 2017

As in the number of shards in Op Queue (#16369),
an external operation queue was introduced to solve deepening of the dmClock queue depth problem due to multiple sharded op queue. (https://www.slideshare.net/ssusercee823/implementing-distributed-mclock-in-ceph#13)

As a result of measuring the performance of the external operation queue, there was no significant difference in performance from when the external operation queue was not used.

  1. FIO 4KB Random Write without external operation queue, bluestore (original)
    137352 IOPs

  2. FIO 4KB Random Write with external operation queue, bluestore (external opqueue: WPQ, internal sharded opqueue: WPQ)
    136914 IOPs

That is responsible for external client requests only.

Signed-off-by: bspark <bspark8@sk.com>
@myoungwon
Copy link
Member

myoungwon commented Oct 17, 2017

@liewegas @ivancich We need to discuss this PR. A single external operation queue really does not have a negative effect on performance?

@liewegas
Copy link
Member

Can you explain how this is different than configuring one shard with many threads servicing that shard?

@bspark8
Copy link
Contributor Author

bspark8 commented Oct 17, 2017

When applying dmClock to Ceph, there are some problems as mentioned at #16369.

Problems

  1. The identifier of dmClock ServiceTracker
  2. The number of shards in Op Queue
  3. Weight Control's Delta/Rho for Background I/O

Currently, we are working on the following two solutions (A,B) at the same time.
Because A is not considered as a perfect solution.

A. (#16369) + one shard with many threads servicing that shard
B. Applied to external op queue

In the case of the A method, the problems 1 and 2 are solved.
However, this should force the user to set one shard with many threads servicing that shard.
This can be seen as removing the existing Ceph shard structure.
Also, 3 problem is not solved. (Alternative is currently being proposed, use the normalized Delta/Rho value occurring in the current OSD)

When B methed (the external op queue) is used, the problems 1,2 and 3 are solved at the same time.
The user does not need to know the presence of the shard and operates independently of the existing Ceph shard structure.
3 problem Also, in the external op queue, only QoS between the client IOs is performed, so there is no need to care about the QoS between the client IOs and the background IOs.

@yuyuyu101
Copy link
Member

it looks introduce another queue and threadpool. I don't think it's a free lunch especially in this critical path....

@liewegas
Copy link
Member

Yeah, I'm very interesting in hearing whether we can normalize delta/rho instead...

@bspark8
Copy link
Contributor Author

bspark8 commented Oct 18, 2017

(https://www.slideshare.net/ssusercee823/implementing-distributed-mclock-in-ceph#12)
As mentioned in the URL above, when applying per-client QoS in Ceph, we have to handle Client I/O and Background I/O in one queue at the same time in current OP queue structure.

In case of Client I/O, Delta/Rho are required to calculate tagging value. Therefore, Background I/O also needs Delta/Rho to perform fairly QoS between Client I/O and Background I/O.

However, in the case of Background I/O, the location of dmClock ServiceTracker for calculating Delta/Rho is somewhat ambiguous, and direct comparison with Delta/Rho of Client I/O is also difficult.

Therefore, the using the normalized (e.g. unit time average value) Delta/Rho value occurring in the current OSD could be the way.

@bspark8
Copy link
Contributor Author

bspark8 commented Oct 18, 2017

Then, according to the opinion of Sage, Haomai, the external op is paused and tries to proceed with the following method first.

Thank you.

@ivancich
Copy link
Member

@bspark8 I would like to hear more about delta/rho normalization. My understanding is that you feel we need this because the delta and rho values that come in with client requests disadvantage external client ops relative to background ops. But the reason for this is that those external clients are also getting some of their ops serviced by different OSDs, and delta and rho exist to factor that in. So it's not clear to me why this is an issue.

Perhaps a related and overarching issue is how to combine the different (but interrelated) purposes we see dmclock playing. We want dmclock to prioritize different classes of operations (client ops, background ops -- snaptrip, scrub, recovery). On top of that, we want to control the relative priority of different external client ops (initially by pool or by rbd image).

Right now we're flattening these various kinds of requests and priorities into a single dmclock queue on the OSD (assuming 1 shard). So each client competes on an equal basis (modulo dmclock params) to every other client and every background process. The more clients requests that are out there the more they could diminish necessary background processes, because each brings with it its own reservation and weight. Perhaps this is what we ultimately want, but I'm thinking about another possibility.

I'm thinking about a hierarchical dmclock. At the top level we could set "global" priorities of the various background processes against all all client ops collectively. So there'd be perhaps four or five top-level categories (not sure whether replication ops would be separate at this level or not). Then, within the client ops category we'd have another dmclock queue where the clients would compete for their place using the dmclock priorities for, for example, the rbd image or pool.

Now we have global controls to weight the background processes against client requests. And we have separate control to prioritize various clients.

[Depending on our ultimate goals, perhaps it would be useful to consider the inversion of this hierarchy -- with each client at the top level along with a collective of all background ops, and then a lower level for the various types of background ops. That, however, seems to get us further from our goals as I understand them, though.]

I don't think the implementation would be that difficult. I imagine a dmclock queue at the top level, where as each client op comes in, it receives a proxy op with the global client configuration (reservation, weight, limit). Then when we're ready to pull a client op for execution, we descend to the next level dmclock to choose among the clients using client specific configurations (reservation, weight, limit).

I'd be curious what others think of this.

@bspark8
Copy link
Contributor Author

bspark8 commented Oct 25, 2017

Thank you for your feedback and I totally agree with you.
Based on what you have said, I rearranged it as follows.

The main reason for the need of hierarchical dmclock is:
In case of flattening these various kinds of requests and priorities into a single dmclock queue on the OSD,
as you say, as the number of individual client types grows, the share of necessary background ops will be relatively diminished.

The current PR, external and internal dmclock op queues can also be considered hierarchical dmclock.

The first thing to discuss is about the implementation of a hierarchical structure.

  1. Like the current PR, two layers of dmclock queue are used in Ceph directly.
  2. Implement two hierarchical structures within the dmclock library.

In addition, unrelated to the topic, the in delta/rho normalization, thought as follows.

  • clients ops from client: use delta/rho generated from client tracker
  • clients ops from OSD (replication ops): use normalized delta/rho
  • background ops (snaptrip, scrub, recovery): use normalized delta/rho

@stale
Copy link

stale bot commented Oct 18, 2018

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If you are a maintainer or core committer, please follow-up on this issue to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Oct 18, 2018
@liewegas liewegas closed this Oct 19, 2018
@liewegas liewegas reopened this Oct 19, 2018
@stale stale bot removed the stale label Oct 19, 2018
@stale
Copy link

stale bot commented Dec 18, 2018

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If you are a maintainer or core committer, please follow-up on this issue to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Dec 18, 2018
@stale
Copy link

stale bot commented Apr 22, 2019

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@stale stale bot closed this Apr 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants