Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service Fabric cluster membership providers #2542

Merged

Conversation

ReubenBond
Copy link
Member

@ReubenBond ReubenBond commented Dec 22, 2016

This is a step towards Deep Service Fabric Integration (#1059)

This adds support for using Service Fabric itself for cluster membership.

Unlike existing membership providers (Azure Tables, Consul, etc), this implements IMembershipOracle, not IMembershipTable. There is also an implementation of IMembershipTable which you can see in this PR (in ServiceFabricPropertyManagerMembershipProvider, which will be deleted).

I've tested that this recovers from faults in manual testing by killing random hosts and removing and redeploying the service. There are also some automated tests to verify the behavior of the FabricMembershipOracle

@ReubenBond ReubenBond added this to the 1.4.0 milestone Dec 22, 2016
@ReubenBond ReubenBond force-pushed the feature-servicefabric-membership branch 2 times, most recently from 8abd85d to 2b33fa2 Compare December 23, 2016 04:28
@ReubenBond ReubenBond force-pushed the feature-servicefabric-membership branch 2 times, most recently from 52d5f47 to 10d5140 Compare December 24, 2016 09:10
@ReubenBond ReubenBond force-pushed the feature-servicefabric-membership branch 3 times, most recently from f8e8f5d to 1d32075 Compare January 2, 2017 23:53
@ReubenBond
Copy link
Member Author

Note that the first commit (56cf4e2) has a separate PR opened for it: #2557 at the request of @galvesribeiro, since the intention is to use IMembershipOracle for integration with other systems (namely ones from the Docker ecosystem).

The commit which this PR adds has only one change in non-ServiceFabric projects, just to specify the base for SF error codes.

@ReubenBond ReubenBond force-pushed the feature-servicefabric-membership branch from 1d32075 to 3f61812 Compare January 3, 2017 22:47
@ReubenBond ReubenBond force-pushed the feature-servicefabric-membership branch from 3f61812 to ae3fe81 Compare January 4, 2017 03:48
@sergeybykov
Copy link
Contributor

Similar to what I asked about Docker in #2569 (comment) - Is there an implied equivalency here of "service instance/partition is up" and "silo is running"? Can't we get into a situation that a silo is unresponsive, e.g. it's messaging queues are or threads are blocked, but SF will consider the instance/partition healthy and keep it in the list of silos on the cluster?

/// </summary>
/// <param name="configuration">The Orleans cluster configuration.</param>
/// <returns>A <see cref="ServiceInstanceListener"/> which manages an Orleans silo.</returns>
public static ServiceReplicaListener CreateStateful(ClusterConfiguration configuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateStateful [](start = 45, length = 14)

This doesn't appear to be used anywhere. Did you intend to add another sample for it? StatefulCalculator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do intend to add such a sample, but the intention was to add one when we have deeper integration for stateful services. The code for stateless and stateful services is identical except for the type returned - so I'm comfortable including both without having samples for both.

If you rather, I can include a 'stateful' calculator service now, too - but the stateful part would be a misnomer since the Orleans part couldn't reasonably store state within SF until we specifically support that.

/// <param name="silos">The updated set of partitions.</param>
public void OnUpdate(FabricSiloInfo[] silos)
{
this.gateways = silos.Select(silo => silo.GatewayAddress.ToGatewayUri()).ToList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silos.Select(silo => silo.GatewayAddress.ToGatewayUri()).ToList(); [](start = 28, length = 66)

Doesn't this assume that every silo is a gateway? Or does it rely on cluster config being the same for all nodes?

Copy link
Member Author

@ReubenBond ReubenBond Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does, thanks for pointing that out. I've replaced this line with this:

var newGateways = new List<Uri>(silos.Length);
foreach (var silo in silos)
{
    if (!silo.IsGateway) continue;
    newGateways.Add(silo.GatewayAddress.ToGatewayUri());
}

this.gateways = newGateways;

@ReubenBond
Copy link
Member Author

ReubenBond commented Jan 5, 2017

@sergeybykov you're right.

I'm considering the best approach for creating a more global view of an individual silo's health. The system used by the default oracle is quite good, but it's not easily transferable.

The way that pings work today is How about if I divorce the ping functionality from the inbuilt MembershipOracle into a separate IPingResponder? It will have just one method, Ping().

The ping method is never actually hit, since we short-circuit it in the message receiver. (Search code for PING_APPLICATION_HEADER)

I'll then create an ISiloPingService in OrleansRuntime with a PingSilo(SiloAddress) method. The implementation will be a system target.

Finally, I'll write a FabricSiloWatchdog class which will listen for service updates and periodically ping each of them, reporting health warnings to Service Fabric when a failure occurs. If too many health events are fired for a particular silo, the watchdog will instruct Service Fabric to kill that replica/instance so that it can be moved elsewhere. I'll make sure that a restart only occurs if the faulty health events span a certain minimum time, eg 2 minutes.

@sergeybykov
Copy link
Contributor

I'm considering the best approach for creating a more global view of an individual silo's health.

Before we rush to a solution, I think we need to step back and try to think through this challenge holistically, as it applies to any 'outsourced' membership implementation, e.g. Docker. I'm curious to hear @gabikliot's opinion on this.

Today, in the membership oracle those two concerns are intertwined, and for a (seemingly) good reason - we are able to detect network partitions and asymmetric disconnects in a cluster. Distribution of the remote monitoring (pinging) responsibility is integrated with updates to cluster state, so that we always have enough monitoring coverage. That part is conceptually easy to decouple from membership management. Although that would mean that we can't fully outsource failure detection, and would instead have to morph it into health monitoring. But still keep it fully functional as failure detection for the default case.

Maybe we don't have a choice, but it feels clumsy to have to run an 'overlay' failure detection protocol on top of another cluster membership protocol like SF or Docker. The problem I think stems from the fact that we cannot make a decision of whether or not a node is healthy locally. Nor can we have a simple monitoring service that would ping every node. We have to have a sufficient mesh of cross-pinging to ensure we detect network partitions and partial connectivity cases because we require 100% bidirectional p2p traffic.

I don't know if I'm overcomplicating this, but these are my initial thoughts. We could, of course, do nothing, and just document that on those cluster management systems certain classes of failures wouldn't be detected. But that doesn't strike me as a good long term strategy.

/// <summary>
/// Provides information about an Orleans cluster hosted on Service Fabric.
/// </summary>
internal class FabricServiceSiloResolver : IFabricServiceSiloResolver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FabricServiceSiloResolver [](start = 19, length = 25)

Nit: this name confused me initially. But once I figured out what it is responsible for, I couldn't come up with a better name. Maybe FabricServiceMonitor or FabricClusterMonitor? But I'm not convinced they are better.

@ReubenBond
Copy link
Member Author

ReubenBond commented Jan 5, 2017

Good points, I suppose these are the options:

  • Ignore any issue and convince ourselves that silos are alive as long as {Service Fabric | Docker | whatever} thinks they are.
  • Write ad-hoc solutions for every hosting environment (which is what I implied in my suggestion above)
  • Separate the entire failure detection system from MembershipOracle into a separate system and create an interface for hooking into/handling failures. In the case of MembershipOracle, those failures are handled simply by removing the failing silo from membership. In the case of Service Fabric, those failures also come with some recovery actions (eg, instruct fabric to restart the service).

I like the last one for the long-term, but it does come with a complication: the current implementation stores health info in the the membership table via MembershipEntry's List<Tuple<SiloAddress, DateTime>> SuspectTimes property, which doesn't yet exist in the case of Service Fabric. I think I can provide that info via fabric's health events, though.

EDIT: by the way, I see the SiloControl system target has a Ping() method already, so that could be used for pings.

@sergeybykov
Copy link
Contributor

I think I can provide that info via fabric's health events, though.

Is it possible to provide health events for nodes other than yourself?

@ReubenBond
Copy link
Member Author

Is it possible to provide health events for nodes other than yourself?
@sergeybykov yes, it is.

@gabikliot
Copy link
Contributor

gabikliot commented Jan 6, 2017

Ok, about pings, health, etc.. Here is what I think:

  1. We should not "divorce the ping functionality from the inbuilt MembershipOracle into a separate IPingResponder". The Ping functionality that we have now is an internal implementation of our MBR Oracle. We would keep it as is, leave the responding part low in the Networking layer, just like now (don't move responding up to a System Target). I can explain in more details why i think moving it up is not a good idea.
  2. We can, if we want to, add another silo component, called a LocalSiloWarchdog. In fact, we already have both the local watchdog AND IHeathCheckParticipant.
  3. In a system with MBR Oracle implementation that allows to hook in local watchdog heath, we may want to hook this local watchdog to the health monitoring of that silo. But only locally. The distributed decision should still be outsourced to the implementation of the MBR Oracle.
  4. For SF it means that we keep using SF MBR, we do NOT do our own distributed pings (as you guys have suggested optionally) and if SF allows the SF Service/Process to report this process health to SF via some SF API, sure, we can do that. You will need to tune that reporting, make sure you don't report not being healthy too eagerly,....
  5. If we want to do the same for our native MBR, we can do that too. The Ping responder that is in the Networking now can upon the ping check the local silo health and decide for example not to respond (or throw) if it deems this silo not healthy.

That way you clearly distinguish distributed detection and networking failures from the in process health. Silo locally is ALWAYS better in determining its own health than someone remotely determining its health. Its decision will always be more accurate since it does not need to deal with delays in those decision, lost msgs, ... And all the distributed decision is about reachability, network and agreeing on the membership and is fully outsourced to the MBR implementation.

I think having both SF MBR AND our pings is not a good idea, for multiple reasons.

Does that make sense? Its basically the architecture that I saw multiple times in the past: sort of a hierarchical decision.

@gabikliot
Copy link
Contributor

@ReubenBond , the "SiloControl system target has a Ping() method already, so that could be used for pings." was not meant for MBR. It was meant for manual/admin like tools (Orleans manager and the like) to try to check if silo is alive. But not for an automatic MBR.

@ReubenBond
Copy link
Member Author

Thanks for the input, @gabikliot.

Silo locally is ALWAYS better in determining its own health than someone remotely determining its health.

The true test of health is whether or not it functions correctly when accessed remotely, so that's what needs to be tested. Almost every serious production system will have off-box 'synthetic transactions' ranging from a simple ping to a test of more complex functionality. There are too many issues that can't be detected locally and we can't trust IHealthCheckParticipant impls to pick them all up.

So having some way to detect silos which are functional or not is beneficial. SF won't check that our service functions correctly, just that it's running and the node is accessible.

My point is not to separate pings out, it's about making Orleans' fault detector available to components other than the membership provider.

Maybe I'm misunderstanding what you're saying.

@sergeybykov
Copy link
Contributor

@gabikliot

Silo locally is ALWAYS better in determining its own health than someone remotely determining its health.

This is a controversial statement to me. For example, in an network partition case a node can be unaware it's unreachable to others. Yes, cluster management systems are supposed to detect a complete loss of network connectivity. My concern though is that such a detection does not work up the stack - when failures happen at the messaging layer.

We had a couple of bugs in the past when we ran out buffers, and hence were failing to send or receive messages. A cluster management system wouldn't detect that. Local health check? Maybe, maybe not.

That's why, like @ReubenBond already mentioned, serious services set up external canary test solutions that periodically perform end-to-end transactions to determine whether the system is up from the external users perspective, even if it appears 100% healthy from inside.

I think it is possible to refactor the built in p2p monitoring system in such a way that it continues to feed death votes into the existing membership oracle as is, but could also provide such votes to an external cluster management system. My concern here is the risk of such a refactoring. It seems like it can be treated as a P2 for the SF case because at least it performs p2p node failure detection. I'm less sure about Docker. If it merely relies on process exit to detect a container failure, then we have a problem.

@gabikliot
Copy link
Contributor

We can expose a health check API that the membership implementation will invoke, in addition to doing its own membership. That is fine.
What I would like to avoid is for us to do our pings (or us invoking this health check API), making some kinds of decision based on that AND have someone else's membership protocol.

If we use someone else's membership, let them do all pings and if they support it, let them call our health check. That way they fully decide when to declare silo dead and reach global agreement on that.

In our our current MBR Oracle we can add in addition to pings also those health checks and then we, as part of our MBR Oracle will decide how we combine the pings plus health checks into one coherent decision. I gave one option above (silo throws from ping responder), but that is just an example. What is important is that ONE protocol makes the decision, potentially based on multiple sources of data (ping + health check).

My point is not to separate pings out, it's about making Orleans' fault detector available to components other than the membership provider.

@ReubenBond
Copy link
Member Author

ReubenBond commented Jan 7, 2017

I should have been clearer yesterday, @gabikliot - I was almost asleep when I realized.

What I'm advocating is not 2 systems making decisions on cluster membership, but more like this:

  1. SF provides list of nodes to us
  2. We determine actual health of nodes using votes (must be for a minimum amount of time, eg, 5 minutes of complaints)
  3. We tell SF to kill nodes which are unhealthy
  4. Repeat

So in the most degenerate case, the entire cluster is partitioned in approx half at the application level (eg, firewall config on half the nodes is messed up) and the smaller partition is killed by the other every 5 minutes.

Does it sound sound?

@gabikliot
Copy link
Contributor

Does it mean that in this case we don't use SF for health pinging, but just for rendezvous (find the total set of potential nodes)? In such a case we won't use SF livenes really, which I thought is a goal here. Similarly, we could use azure standard API to find total set. But we don't do that too. We could. Both ways could work, but one needs to decide why this way, and make sure overall it works correctly. Depends on your goals. Lots of choices. Maybe you don't want to use SF livenes at all, if that doesn't bring you value.

@ReubenBond
Copy link
Member Author

SF's health pinging is its own business, and it covers probably 95+% of faults, but it can never tell us if our system is actually live.

I'm saying:

  • SF provides the set of silos (FabricMembershipOracle)
  • We perform health checks at the application level (ClusterHealthMonitor)
  • We tell SF to kill silos which we deem dead (FabricSiloHealthMonitor subscribes to decisions from the above and performs these actions)

i.e, we don't have two systems deciding on membership - just SF. We do, however, have our own system for making decisions about silo health and we act on those decisions in a way which happens to affect membership.

If I'm still not making myself clear, then perhaps this is best discussed via a call

@gabikliot
Copy link
Contributor

In this proposal, are we sending remote health checks (health pings) from one silo to another?

@ReubenBond
Copy link
Member Author

ReubenBond commented Jan 10, 2017

@gabikliot yes, the silos would monitor each other. We could alternatively have an entirely different SF service which performed the job of monitoring, but bundling them makes deployment easier on end-users.

@gabikliot
Copy link
Contributor

Ok, so that's what I am recommending is NOT to do: not to do 2 distributed monitoring: SF doing its own and us doing ours. I recommend using only one. I understand your proposal of laying them on top. It CAN work, the question is if this is a good/best design. My intuition is not.

Notice, we could have done similar two level membership with regular azure, taking total membership from azure and doing pings within it. It would have worked. But we decided it adds an extra complications without benefit.

So what you are suggesting CAN work, but the question if that the best way to build it. You can try, think through all the edges cases, write a pseudo protocol, and if u convince yourself it can work, than ... no problem, go ahead.

@ReubenBond
Copy link
Member Author

ReubenBond commented Jan 10, 2017

These 2 distributed monitoring systems cover different things:

  • SF covers SF
  • (I propose) We cover ourselves

SF cannot ensure our liveness and we cannot disable SF's internal monitoring (not that we would want to).

Again, practically every real distributed system has application-level monitoring, including most systems built on Service Fabric. We did with our services within Azure Active Directory team at MS - we had synthetic transactions which would exercise various functionality and fire alerts. Another system would monitor those alerts and perform automatic recovery actions (eg, rebooting the host).

Edit: Sergey opened a proposal in #2580

@gabikliot
Copy link
Contributor

OK, you can build it this way as well. As I wrote, I would recommend having 1 system to do both: either us doing liveness + health like now, OR SF doing liveness and us reporting health to SF (but NOT doing our own distributed health monitoring).
But it can work the way you described too.

@sergeybykov sergeybykov merged commit a045dd5 into dotnet:master Jan 14, 2017
@ReubenBond ReubenBond deleted the feature-servicefabric-membership branch January 14, 2017 02:20
@gabikliot
Copy link
Contributor

Congrats @ReubenBond and @sergeybykov !
It is a big milestone.

@sergeybykov
Copy link
Contributor

@gabikliot Using some of the code you wrote back in 2012. :-)

@gabikliot
Copy link
Contributor

I am glad it was useful, after all. It's always nice to know your work was not a throw away, even if it took 4 years to come around :-)

Hopefully users will find it useful.

@galvesribeiro
Copy link
Member

Definitively it will, @gabikliot :)

@AceHack
Copy link

AceHack commented Nov 2, 2017

Service fabric allows you to report health about other services to their API. It's not 100% clear what path you guys decided on but if one silo detects failure in another you can report that failure to service fabric via it's health reporting APIs. They give an exmaple about a watchdog here.
https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-health-introduction

@ReubenBond
Copy link
Member Author

I'm aware, but currently we're not performing application-level health checks when Service Fabric membership is being used. We created an issue to track that separately: #2580

@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants