determine how to respond to stale silo metric data #4

danvanderboom · 2016-08-13T16:32:06Z

Orleans silos fail sometimes, and they can't continue sending metrics to the ClusterMetricsGrain when hardware is down (for example). The metrics grain holds onto that last known metrics snapshot from each silo.

Some metrics are transient, unpredictable with any precision from moment to moment like CPU or RAM usage, and so it doesn't make sense to preserve them across silo restarts. Others, like some counters, are useful even when they're stale. We can say with certainly, for example, that a counter of all exceptions across all silos is only going to increase, never decrease. So the last-known set of counter values for a silo that's blown up, melted down, or is restarting remains a valid component of the total counter value.

How counters behave on silo restart is another piece to figure out. Currently this extension has no persistence to disk-based storage (actively trying to avoid that dependency). One idea is for the MetricsTracker extension, on initialization, to ask the ClusterMetricGrain for its last known silo metrics snapshot, from which it can continue. This is one part of making silo counter caches resilient to individual silo failures.

The other part is distributing the singleton ClusterMetricGrain grain to several mirror activations on other silos, perhaps using a custom storage provider and some carefully placed local grains, one on each silo. Or maybe using RAFT like Orleans.Consensus, or somehow piggybacking on top of the internal Orleans consensus mechanism.

It would be good to have some way of indicating for each counter or metric whether it makes sense to preserve across silo restarts, to decide what the default behavior should be for new metrics, and to figure out how to expose those choices in the API.

What should ClusterMetricsGrain do about stale data from one or more silos?

Nothing, satisfied that the grain already provides all the data needed to make these determinations elsewhere.
Detect silo disconnections with configured timeout durations, and generate an event: through a GrainObserver, an event message on a virtual stream, or some other Orleans-compatible mechanism.
Analyze the aggregate silo staleness data to detect network partitions and other distributed anomalies.

danvanderboom mentioned this issue Aug 15, 2016

Visualising Orleans dotnet/orleans#2059

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

determine how to respond to stale silo metric data #4

determine how to respond to stale silo metric data #4

danvanderboom commented Aug 13, 2016

determine how to respond to stale silo metric data #4

determine how to respond to stale silo metric data #4

Comments

danvanderboom commented Aug 13, 2016