Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

determine how to respond to stale silo metric data #4

Open
danvanderboom opened this issue Aug 13, 2016 · 0 comments
Open

determine how to respond to stale silo metric data #4

danvanderboom opened this issue Aug 13, 2016 · 0 comments

Comments

@danvanderboom
Copy link
Owner

Orleans silos fail sometimes, and they can't continue sending metrics to the ClusterMetricsGrain when hardware is down (for example). The metrics grain holds onto that last known metrics snapshot from each silo.

Some metrics are transient, unpredictable with any precision from moment to moment like CPU or RAM usage, and so it doesn't make sense to preserve them across silo restarts. Others, like some counters, are useful even when they're stale. We can say with certainly, for example, that a counter of all exceptions across all silos is only going to increase, never decrease. So the last-known set of counter values for a silo that's blown up, melted down, or is restarting remains a valid component of the total counter value.

How counters behave on silo restart is another piece to figure out. Currently this extension has no persistence to disk-based storage (actively trying to avoid that dependency). One idea is for the MetricsTracker extension, on initialization, to ask the ClusterMetricGrain for its last known silo metrics snapshot, from which it can continue. This is one part of making silo counter caches resilient to individual silo failures.

The other part is distributing the singleton ClusterMetricGrain grain to several mirror activations on other silos, perhaps using a custom storage provider and some carefully placed local grains, one on each silo. Or maybe using RAFT like Orleans.Consensus, or somehow piggybacking on top of the internal Orleans consensus mechanism.

It would be good to have some way of indicating for each counter or metric whether it makes sense to preserve across silo restarts, to decide what the default behavior should be for new metrics, and to figure out how to expose those choices in the API.

What should ClusterMetricsGrain do about stale data from one or more silos?

  1. Nothing, satisfied that the grain already provides all the data needed to make these determinations elsewhere.
  2. Detect silo disconnections with configured timeout durations, and generate an event: through a GrainObserver, an event message on a virtual stream, or some other Orleans-compatible mechanism.
  3. Analyze the aggregate silo staleness data to detect network partitions and other distributed anomalies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant