You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Orleans silos fail sometimes, and they can't continue sending metrics to the ClusterMetricsGrain when hardware is down (for example). The metrics grain holds onto that last known metrics snapshot from each silo.
Some metrics are transient, unpredictable with any precision from moment to moment like CPU or RAM usage, and so it doesn't make sense to preserve them across silo restarts. Others, like some counters, are useful even when they're stale. We can say with certainly, for example, that a counter of all exceptions across all silos is only going to increase, never decrease. So the last-known set of counter values for a silo that's blown up, melted down, or is restarting remains a valid component of the total counter value.
How counters behave on silo restart is another piece to figure out. Currently this extension has no persistence to disk-based storage (actively trying to avoid that dependency). One idea is for the MetricsTracker extension, on initialization, to ask the ClusterMetricGrain for its last known silo metrics snapshot, from which it can continue. This is one part of making silo counter caches resilient to individual silo failures.
The other part is distributing the singleton ClusterMetricGrain grain to several mirror activations on other silos, perhaps using a custom storage provider and some carefully placed local grains, one on each silo. Or maybe using RAFT like Orleans.Consensus, or somehow piggybacking on top of the internal Orleans consensus mechanism.
It would be good to have some way of indicating for each counter or metric whether it makes sense to preserve across silo restarts, to decide what the default behavior should be for new metrics, and to figure out how to expose those choices in the API.
What should ClusterMetricsGrain do about stale data from one or more silos?
Nothing, satisfied that the grain already provides all the data needed to make these determinations elsewhere.
Detect silo disconnections with configured timeout durations, and generate an event: through a GrainObserver, an event message on a virtual stream, or some other Orleans-compatible mechanism.
Analyze the aggregate silo staleness data to detect network partitions and other distributed anomalies.
The text was updated successfully, but these errors were encountered:
Orleans silos fail sometimes, and they can't continue sending metrics to the ClusterMetricsGrain when hardware is down (for example). The metrics grain holds onto that last known metrics snapshot from each silo.
Some metrics are transient, unpredictable with any precision from moment to moment like CPU or RAM usage, and so it doesn't make sense to preserve them across silo restarts. Others, like some counters, are useful even when they're stale. We can say with certainly, for example, that a counter of all exceptions across all silos is only going to increase, never decrease. So the last-known set of counter values for a silo that's blown up, melted down, or is restarting remains a valid component of the total counter value.
How counters behave on silo restart is another piece to figure out. Currently this extension has no persistence to disk-based storage (actively trying to avoid that dependency). One idea is for the MetricsTracker extension, on initialization, to ask the ClusterMetricGrain for its last known silo metrics snapshot, from which it can continue. This is one part of making silo counter caches resilient to individual silo failures.
The other part is distributing the singleton ClusterMetricGrain grain to several mirror activations on other silos, perhaps using a custom storage provider and some carefully placed local grains, one on each silo. Or maybe using RAFT like Orleans.Consensus, or somehow piggybacking on top of the internal Orleans consensus mechanism.
It would be good to have some way of indicating for each counter or metric whether it makes sense to preserve across silo restarts, to decide what the default behavior should be for new metrics, and to figure out how to expose those choices in the API.
What should ClusterMetricsGrain do about stale data from one or more silos?
The text was updated successfully, but these errors were encountered: