Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
mon: no delay for single message MSG_ALIVE and MSG_PGTEMP #12107
Currently, monitor may wait 'paxos_propose_interval' before propose_pending
Signed-off-by: yaoning email@example.com
While it would be nice to push these through immediately, ALIVE and PGTEMP are exactly the message types we need to throttle. These are generated very quickly during thrash storms and if we generated a new OSD map for each ones, you'd be much more likely to destroy your cluster.
Unless @athanatos or @liewegas have other thoughts I don't think this is something we can do. Maybe if there were a bunch of extra tracking logic so it only does this for n pgtemp updates within a given time period, so you could let one OSD change go through quickly but throttle once the cluster is in trouble?
Yeah, paxos_min_wait (default set: 0.05) is one of the throttle policy which make sure that it is only possible to have 20 maximum OSDMap changes per second, so the things we want to answers is whether it can finish at least 20 paxos updates in one second for Monitor. Any one here thinks leveldb and rocksdb cannot finish 20 updates per second?
Also, if during thrash storms, I think most OSDs are remain in unavailable state, so the purpose is still to make the whole cluster state picture clear as fast as possible if the monitor has enough ability.
Furthermore, I think it is possible to add throttle count to track the OSDMap update within one second and reset it in tick(). In this way, we exactly controll the throughput of the update? and do not wait any times if the throttle allows the request going on.
The question isn't about what the monitor's leveldb can push through; it's about how many cluster state updates the OSDs can handle. They need to do a fixed amount of fairly expensive processing on every map update and spamming out 20 updates every second is guaranteed to put most clusters into a death spiral.
Ok, I think the main concern is that it is quite costy for pg to catch up the lastest map, right? the handle_osd_map in class OSD is not quite expensive.
Here, generally, it seems there are two cases(that is whether acting_up_affected):
So can we immediately rise a propose for MSG_ALIVE, but need to do some further constraint on MSG_PGTEMP?
The goal of the improvement will be:
so we can use
Based on the throttle strategy, message types and something else to determine the delay time, right?
I will reconsider and proposal new strategy later.
Yeah, I agree that it is better to combine a bunch of up_thru messages.
Actually, what we want is:
So what about current update PR?
What about something like this: In general, we don't want to propose more often than paxos_propose_interval, unless it is truly a high-priority even (like a command from the cli) that has no delay. But if we get an up_thru message and haven't proposed a map in a while (> propose interval) then then we could propose immediately without exceeding our rate limit. We still delay the min amount to hopefully batch a bit, but it's the min_interval value, so not long. If another up_thru comes after that, it has to wait a full interval before proposing again (because we just published a map).
I think this boils down to noting the time when the last round finished, and for one of these expidited messages, setting the delay such that we propose no sooner than previous round + the normal interval or now + min interval, whichever is later.
This change doesn't looks right and causing twice as much proposal as we targeted to (limited by paxos_propose_interval).
Imaging we have a sequence of pg_temp/up_thru during a large recovery.
now = T+ paxos_min_wait
now= T+ paxos_propose_interval + paxos_min_wait
clearly we made TWO proposal in each paxos_propose_interval, which almost taking down our cluster when we reweight a brunch of OSDs
It looks to me like