New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: no delay for single message MSG_ALIVE and MSG_PGTEMP #12107

Merged
merged 1 commit into from Jan 16, 2017

Conversation

Projects
None yet
7 participants
@mslovy
Contributor

mslovy commented Nov 21, 2016

Currently, monitor may wait 'paxos_propose_interval' before propose_pending
However, we don't want this long wait for message MSG_ALIVE and MSG_PGTEMP since it will block the peering process so that all IO will block a long time when adding or removing osd in the cluster.
Thus, wait paxos_min_wait(default = 0.05) for those messages. Even through, we can set paxos_min_wait = 0 to make the propose_pending immediately if we want.

Signed-off-by: yaoning yaoning@unitedstack.com

@tchaikov tchaikov self-assigned this Nov 21, 2016

@gregsfortytwo

While it would be nice to push these through immediately, ALIVE and PGTEMP are exactly the message types we need to throttle. These are generated very quickly during thrash storms and if we generated a new OSD map for each ones, you'd be much more likely to destroy your cluster.

Unless @athanatos or @liewegas have other thoughts I don't think this is something we can do. Maybe if there were a bunch of extra tracking logic so it only does this for n pgtemp updates within a given time period, so you could let one OSD change go through quickly but throttle once the cluster is in trouble?

@mslovy

This comment has been minimized.

Contributor

mslovy commented Nov 22, 2016

Yeah, paxos_min_wait (default set: 0.05) is one of the throttle policy which make sure that it is only possible to have 20 maximum OSDMap changes per second, so the things we want to answers is whether it can finish at least 20 paxos updates in one second for Monitor. Any one here thinks leveldb and rocksdb cannot finish 20 updates per second?

Also, if during thrash storms, I think most OSDs are remain in unavailable state, so the purpose is still to make the whole cluster state picture clear as fast as possible if the monitor has enough ability.

Furthermore, I think it is possible to add throttle count to track the OSDMap update within one second and reset it in tick(). In this way, we exactly controll the throughput of the update? and do not wait any times if the throttle allows the request going on.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Nov 22, 2016

The question isn't about what the monitor's leveldb can push through; it's about how many cluster state updates the OSDs can handle. They need to do a fixed amount of fairly expensive processing on every map update and spamming out 20 updates every second is guaranteed to put most clusters into a death spiral.

@athanatos

This comment has been minimized.

Contributor

athanatos commented Nov 22, 2016

Yeah, OSDMaps are costly to process on the OSD side. That's the main concern.

@mslovy

This comment has been minimized.

Contributor

mslovy commented Nov 23, 2016

Ok, I think the main concern is that it is quite costy for pg to catch up the lastest map, right? the handle_osd_map in class OSD is not quite expensive.

Here, generally, it seems there are two cases(that is whether acting_up_affected):

  1. If yes, state machine should break the current state and enter Reset state, like the case you mentioned such like OSD is flapping. If the OSDMap generates too fast, then it will be sequentially dealing with each OSDMap updating Event and make the things worse.
    I think the impact is related to the number of PGs affected during flapping and the number of PGs in each OSD.
  2. If no, then it will be much better such like Updating up_thru according to the MSG_ALIVE. The OSDMap does not affect and break any pgs state machine and all things go well?

So can we immediately rise a propose for MSG_ALIVE, but need to do some further constraint on MSG_PGTEMP?

The goal of the improvement will be:

  1. if the things is predicted such like we just start one OSDs. we want MSG_PGTEMP and MSG_ALIVE acks as fast as possible.
  2. if too many flapping occurs, throttle the OSDMap updating in order to make sure it works fine.

so we can use
ENUM PROPOSE_WAIT {
NO_WAIT,
MIN_WAIT,
WAIT
}
to indicate what should be done.

Based on the throttle strategy, message types and something else to determine the delay time, right?

I will reconsider and proposal new strategy later.

@ghost ghost added the core label Nov 23, 2016

@athanatos

This comment has been minimized.

Contributor

athanatos commented Nov 23, 2016

Mmm, we still probably want to combine a bunch of up_thru messages into 1 map if possible.

@mslovy mslovy changed the title from mon: delay for a short period (paxos_min_wait) for message MSG_ALIVE and MSG_PGTEMP to mon: no delay for single message MSG_ALIVE and MSG_PGTEMP Nov 25, 2016

@mslovy

This comment has been minimized.

Contributor

mslovy commented Nov 25, 2016

Yeah, I agree that it is better to combine a bunch of up_thru messages.

Actually, what we want is:

  1. start a single osd, then we want the peering process finished immediately without waiting
  2. after backfilling, pg_temp will be cleared and then activating pg again, we want this process finish immediately

So what about current update PR?

@liewegas liewegas self-assigned this Nov 26, 2016

last_no_wait_time = now;
return true;
}
}

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Nov 28, 2016

Member

This logic doubles the number of OSDMaps we'd otherwise generate — we'll do two every paxos_propose_interval. You'd need something at least a little more sophisticated. Maybe it would work if you renamed last_no_wait_time -> last_attempted_nowait_time, set it every time you run this check, and only allow it to succeed every two paxos_propose_intervals? Or just only let it go through if at least a paxos_propose_interval has passed since the previous OSDMap was generated?

@gregsfortytwo gregsfortytwo assigned mslovy and unassigned tchaikov and liewegas Nov 28, 2016

@liewegas

This comment has been minimized.

Member

liewegas commented Nov 28, 2016

What about something like this: In general, we don't want to propose more often than paxos_propose_interval, unless it is truly a high-priority even (like a command from the cli) that has no delay. But if we get an up_thru message and haven't proposed a map in a while (> propose interval) then then we could propose immediately without exceeding our rate limit. We still delay the min amount to hopefully batch a bit, but it's the min_interval value, so not long. If another up_thru comes after that, it has to wait a full interval before proposing again (because we just published a map).

I think this boils down to noting the time when the last round finished, and for one of these expidited messages, setting the delay such that we propose no sooner than previous round + the normal interval or now + min interval, whichever is later.

@mslovy

This comment has been minimized.

Contributor

mslovy commented Jan 11, 2017

hi,all,based on the comments, what's about this refined version?

@@ -131,6 +131,9 @@ class OSDMonitor : public PaxosService {
bool check_failure(utime_t now, int target_osd, failure_info_t& fi);
void force_failure(utime_t now, int target_osd);
// the time of last msg(MSG_ALIVE and MSG_PGTEMP) proposed without delay
utime_t last_attempped_minwait_time;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jan 11, 2017

Member

"attempted", not "attempped"?

This comment has been minimized.

@mslovy

mslovy Jan 11, 2017

Contributor

sry, fixed

dout(15) << " propose as fast as possible for up_thru/pg_temp" << dendl;
utime_t now = ceph_clock_now();
if (now - last_attempped_minwait_time > g_conf->paxos_propose_interval) {

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jan 11, 2017

Member

This needs to look at the last map generated without using minwait as well -- it might be more recent!

This comment has been minimized.

@mslovy

mslovy Jan 11, 2017

Contributor

do you mean this?

@gregsfortytwo

This looks fine to me now. Testing required.

yaoning
mon: no delay for single message MSG_ALIVE and MSG_PGTEMP
delay and batch proposed as before if lots of messages arrive at the same time

if too many messages are MSG_ALIVE and MSG_PGTEMP, wait a long period as before

Signed-off-by: yaoning <yaoning@unitedstack.com>
// want to merge OSDMap changes as much as possible
if ((pending_inc.new_primary_temp.size() == 1
|| pending_inc.new_up_thru.size() == 1)
&& pending_inc.new_state.size() < 2) {

This comment has been minimized.

@tchaikov

tchaikov Jan 13, 2017

Contributor

@mslovy why we are checking the new_state.size() here? just to minimize the impact of flapping osd? and only let the single osd change pass through? am i right?

This comment has been minimized.

@mslovy

mslovy Jan 14, 2017

Contributor

Yeah, you are right. If lots of osd flapping and update its state. I think it predictably a lots of primary_temp and up_thru request arrived in a following short period. At this time, we do not want the first primary_temp or up_thru updated individually but expect to wait the following request arrived and update at once.

@tchaikov tchaikov assigned tchaikov and unassigned mslovy Jan 14, 2017

@yuriw yuriw merged commit f1618fe into ceph:master Jan 16, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
@xiaoxichen

This comment has been minimized.

Contributor

xiaoxichen commented May 23, 2018

This change doesn't looks right and causing twice as much proposal as we targeted to (limited by paxos_propose_interval).

Imaging we have a sequence of pg_temp/up_thru during a large recovery.

now =T
The 1st up_thru/pg_temp will go through fast path and trigger propose in T + paxos_min_wait, last_attempted_minwait_time = T.

now = T+ paxos_min_wait
The [2, K] up_thru will failed by now - last_attempted_minwait_time > g_conf->paxos_propose_interval and go through PaxosService::should_propose, which will schedule the propose in T+paxos_propose_interval

now= T+ paxos_propose_interval + paxos_min_wait
The K+1 up_thru/pg_temp comes, both (now - last_attempted_minwait_time > g_conf->paxos_propose_interval and now - paxos->get_last_commit_time() > g_conf->paxos_min_wait satisfied, so we trigger another propose in now+ paxos_min_wait = T+ paxos_propose_interval +paxos_min_wait.

clearly we made TWO proposal in each paxos_propose_interval, which almost taking down our cluster when we reweight a brunch of OSDs

It looks to me like PaxosService::should_propose is exactly implement what @liewegas suggested in previous comment, this commit is not necessary.

I think this boils down to noting the time when the last round finished, and for one of these expidited messages, setting the delay such that we propose no sooner than previous round + the normal interval or now + min interval, whichever is later.

@xiaoxichen

This comment has been minimized.

Contributor

xiaoxichen commented May 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment