Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of background manager with Handoff and AAE tree rebuild #481

Closed
wants to merge 5 commits into from

Conversation

buddhisthead
Copy link
Contributor

Replaced by #484

NOTE: DO NOT MERGE YET (requires #475 + a rebase to remove the bg_manager itself from this PR)

The riak_core background manager PR is here: #475

The riak_repl PR is here: basho/riak_repl#488

The background manager allows subsystems to coordinate with locks (and tokens) in a non-blocking API when desired. In our case, we've seen several customer support issues related to a production cluster being overloaded by the combined weight of MDC fullsync replication in combination with either handoff, or AAE tree rebuilds (or all three!).

Each of these subsystems is being taught, for 2.0, to "take" a lock on the kv vnode/partition before it does a fold over that partition. Locks are returned when the process dies or exists normally. Each subsystem (at this point in time) is coded to keep working on other partitions if one partition is denied a lock via the return value of max_concurrency, which means progress is made even when one partition is busy.

Use of this feature can be bypassed in production. Besides a global skill switch on the background manager itself, the riak_core configuration supports two new parameters: aae_skip_bg_manager and handoff_skip_bg_manager, which defaults to false. Setting it to true will not call the background manager and proceed like it used to.

@buddhisthead
Copy link
Contributor Author

Notes to self:

  • squash commits into a single entry.
  • add cuttlefish schema for bg_mgr configuration parameters

@@ -145,6 +146,7 @@ init([Mod, Index, InitialInactivityTimeout, Forward]) ->
process_flag(trap_exit, true),
State = #state{index=Index, mod=Mod, forward=Forward,
inactivity_timeout=InitialInactivityTimeout},
try_set_vnode_lock_limit(Mod, Index),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed previously I'm not sure riak_core_vnode is the correct place to be doing this. As the PR states, the goal of this work is to limit folds over riak_*kv*_vnode's. This registers a lock per-partition, per module so there is also a {vnode_lock, riak_pipe_vnode, Partition} for each partition for example. This may have the desired effect of also providing the locks for riak_kv_vnode but the scope is too wide. I'll leave some comments elsewhere to try and illustrate.

It still seems to me that if we want to limit folds over riak_kv_vnode backends than riak_kv is the correct place to be registering these locks and acquiring them. The challenge is integrating that with handoff because handoff is a general mechanism in core and its hooks at the riak_kv level are insufficient for grabbing locks. The correct way is probably to extend these hooks but alas it is probably to late for that. As I previously expressed, I think we should skip handoff this go around because of these issues. Handoff can still be limited via the node-wide concurrency limit so provided exclusive access to the partition can be done via a mixture of two knobs instead of just one. I understand that handoff has been a large pain point and motivation for this work and that a more ideal approach is desired but we have to start being realistic about what we can get in with the time we have.

Without handoff this becomes a much simpler problem. Register the locks in riak_kv_vnode as {riak_kv_vnode_lock, Partition} and the Repl and AAE changes should basically remain the same. However, I really think the solution we should be discussing is doing locking across all backend folds that use ?FOLD_REQ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should not register the pipe locks. I missed that. In the other integration points, I match on riak_kv_vnode and I think we should do the same here. There are plenty of other places in core that do the same thing - it's not like kv is completely isolated from core. I have run riak tests with this configuration and it works fine with handoff.

But maybe you have a good point about where to do the registration. We can do it from kv and still integrate with handoff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely not match on the name of a specific vnode here. That is akin to a super-type being knowledgeable of its child types. In addition, riak_core is not only being used by Riak. This is registering locks on behalf of every vnode in every riak_core application. Core is certainly not the cleanest abstraction but the only references to a specific vnode are examples in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand. Moving the registration to kv makes total sense to me. Sorry about the confusion.

* The goal is to allow riak sub-systems to coordinate use of shared resources,
* e.g. protect from concurrent vnode folds on the same partition.
* Locks and tokens have a settable maximum concurrency limit.
* "Taken" locks and tokens are tracked in an ETS table.
* max_concurrency is returned when the set limits are reached.
* Processes that take locks are monitored.
* Locks are released when the taking processes terminate.
* Tokens are refreshed at a specified periodic rate.
* Token processes are not monitored because tokens never "release".
* A table manager is introduced to add persistence across process crashes,
* and to allow proper table transfer to occur without losing the table.
* An EQC test exercises the majority of the API. see test/bg_manager_eqc.erl
* See the original PR for background manager here: #364
…le is unavailable * Allows callers to try again later
* Use per-vnode locks to guard concurrent kv folds on the same partition.
* Include kv module name to lock to differentiate from pipe/other vnode types.
@buddhisthead
Copy link
Contributor Author

replaced by #484

@jrwest jrwest modified the milestones: 2.0-beta, 2.0 Mar 24, 2014
@seancribbs seancribbs deleted the cet-bg-mgr-vnode-lock branch April 1, 2015 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants