-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visualising Orleans #2059
Comments
I'd like to be able to monitor custom metrics defined and reported by Orleans systems. While activations per silo or grain type are a good high-level gut check that the system is running as expected and the partitioning strategy seems to be working, a much more granular and domain-specific approach to metrics is needed for full transparency and deep insight into live, running systems running at production loads. This could be as simple as tracking business-valued events with Key Performance Indicators (KPIs) in production, performance monitoring metrics that only run in debug builds in QA, or through silo interception to track uncaught exceptions and grain method calls for full traceability to enable traffic-flow visualizations (like those done in Netflix's Vizceral). The metrics tracking telemetry consumer is UI agnostic, and provides no UI of its own. Instead, it can be queried for current metrics values at any time, or it can push metrics snapshots out to a virtual stream at any rate (defaulting to once per second). Visualization tools could either subscribe to this metrics stream directly; or an intermediary could pull from the metrics extension and push to the visualization surface. I like the idea of separating back-end metrics functionality from front-end UI visualization in general, even if the metrics tracker isn't ultimately the best fit in any given scenario. It lets good UI developers focus on making the UI fantastic, and a shared backend means it can be evolved toward a mature set of great features which many tool UIs can take advantage of, rather than having to reinvent the backend wheel for each new monitoring, dashboard, or traffic visualization tool. |
Awesome initiatives! As part of #368, I'm working right now on making the counters to push data to So basically both internal KPIs (from the counters/statistics framework) and custom domain-specific ones would be pushed to the telemetry API making those engagements even easier. I would like to have a control panel that visualise the cluster state but at same time can make changes to the configuration and maybe perform some operations in it like simple Grain calls. But interaction with it I think is a second step and should be discussed in another issue. Visualizing it with a nice tool would be perfect. |
I concur with @danvanderboom. I would like to ingest events sourced both from Orleans system and the runtime. Specific examples are technical data such as performance counter data (also without a third party component) and Windows Events, business specific KPIs and also events derived from these. I would also define the persistence sink, or several of them. I don't know how access control factors into this, but should doable. |
I should also include a reference to @ReubenBond's console. The dashboard requires temporary storage of metrics (i.e. values aggregated by short periods (around a second) and stored for ~1 minute). It also wants to query metrics in a variety of ways, i.e. slicing by grain/method, silo etc... and aggregating results (i.e. sum, max, min, avg). It strikes me that there should be some kind of database (SQLite?) to fulfil this? It would be nice to have some clean, simple interfaces, and then we could have a pluggable architecture where we can combine together different approaches for sampling and collecting data, storing data, and displaying data. |
@richorama do you think we should add this interface on telemetry APIs? I'm about to make changes in it soon, so shoot anything so I can put all together. |
@galvesribeiro I don't know enough about the telemetry API to make the call. |
I mean, some people once asked me for a way to "cache" the values and batch push it in a timed loop. So what you are suggesting with SQLite is to make it resistant to Silo failures in case it crash so that metrics doesn't get lost, right? |
@richorama About "some kind of database", I suppose that means an interface for the sink and other one for it as a source. I see in-memory cache and durable storage as two sinks and/or sources which I'd like to define at my leisure. <:edit: And these sinks and sources, I think we should cue in other new developments in the .NET world and Orleans, specifically the streaming initiatives. |
Ok, now I understand :) The |
I consider some factors of the reliability of metrics caching in danvanderboom/Orleans.TelemetryConsumers.MetricsTracker#4 including an idea for recovering silo metric data on restart, so long as the cluster itself is up and the ClusterMetricsGrain hasn't disappeared with the failed silo. In that issue, I talk about how some metrics don't have any meaning across silo restarts (like current CPU or memory utilization), while some metrics and counters (like total number of exceptions across the cluster, or dollars of sales accumulated for the day) do have meaning and should be persisted across restarts. So it would be nice to specify for each metric whether it has meaning that spans individual silo restarts, to somehow differentiate it from metrics which can be safely discarded, starting over each time. |
Hmm, about "dollars of sales accumulated for the day", I'll be more exact that in my previous passage I'd consider it to be application specific KPIs I likely would accumulate and show otherwise. The KPIs I was thinking were more about technical SLAs. Though I would assume that if someone would want the system to show such metrics, it should be doable. That being written, it looks like we have something concrete to grasp on. :) |
Metrics APIs: TrackMetric, IncrementMetric, and DecrementMetric make a good pattern. These are good for tracking double values. I'd love to see these added:
Though I don't have a strong opinion of whether these are added to IMetricsTelemetryConsumer, or whether they get split into different interfaces, etc. |
@veikkoeeva I don't actually have any business-value KPI needs that I'm planning to address with my recent work on metrics tracking. Those scenarios might be better tracked within logic in grains, where grain persistence (or other persistence mechanisms) are in place, rather than in something built as a cross-cutting concern to the Orleans system's domain and primary focus. (I'm brainstorming a little as I write here.:) |
Let me get to the telemetry API again and I'll hit you again with it. |
We'll need to be careful about persisting any metrics to disk. If we're persisting every metric update, every counter increment, it will be difficult not to slow the system down. At least in some high-frequency metrics capturing scenarios, if not everywhere, it might be preferable to "compress" a bunch of counter increment/decrement calls into net deltas before flushing them to disk, so long as they occur really close together. The question is: are we interested in preserving individual deltas and updates, responding to each with durable storage activity? Or are we interested more in occasional snapshots, even if by "occasional" we mean several/many times per second? What level of "durability" is needed for tracking silo and cluster level metrics? So long as the cluster survives, reliability mechanisms like RAFT have been proposed in Orleans Gitter chats to quickly recover metrics from silo failures and restarts. Some metrics will need to be more durably and reliably tracked than others. |
@danvanderboom that is what I'm trying to say... Its a judgement call that a custom telemetry consumer must do when you are implementing it. I don't think your extension should care about it. If someone care about persist it, create another consumer that just store it. Remember that telemetry ingestion is multicast (you can have multiple consumers for the same kind of telemetry). |
@galvesribeiro Agreed. I'm doing that myself, having implemented a second telemetry consumer to route logging calls to a logging service and database. |
With talk of building an OrleansHostManager which could host multiple Orleans silos on a single machine (and manage rollouts/rollbacks, health monitoring), we're probably talking another level of aggregation for statistics/metrics. We'll want to see how a machine is performing, not just a silo, if multiple silos can be hosted on one machine. It wouldn't make sense to report CPU and memory utilization per silo, for example, if there are several silos running on each node. This was only one suggested approach for how Orleans might adopt some SF-like features, and may not be the best alternative. But the question of how this could affect the collection and aggregation of metrics is valid regardless of the specific approach, and it'd be nice to see that coming instead of being bitten by it down the road. |
One concern I have: it looks like there can only be one silo interceptor set at one time, is that right? If multiple extensions or visualizers potentially try to intercept all grain method calls via silo interceptors, I wonder about them stepping on each other, with the last interceptor to be registered taking control. Can silo interceptors be changed to enable multiple to be registered, similar to the way multiple telemetry consumers can be registered? What are the "rules of the road" for setting these interceptors, especially from Orleans extensions? |
The goals for Orniscient are primarily along the lines of visualising specifically virtual actors. We see it as both a training tool, and something that could be used in production to navigate the grain "metaverse". We view virtual actors as self-contained nano services which in theory should be relatively safe to invoke methods on ad-hoc, and so the ability to find a grain, and invoke a method on it with parameters is something our ops team can benefit from. |
Also, we'd like to provide an example project running in the cloud with a test cluster, so users unfamiliar with Orleans or even actors, can understand the programming model visually on a website. |
@danvanderboom when adding a silo interceptor you should call any previously set interceptor. |
Adding a link to the meetup where visualization solutions where shown - https://github.com/OrleansContrib/meetups#meetup-11-a-monitoring-and-visualisation-show-with-richard-astbury-dan-vanderboom-and-roger-Creyke. |
We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone. |
This issue is to capture ideas/requirements for visualising Orleans, with the hope that we can form a strategy to build some cohesive tooling (probably not in the main Orleans repo to begin with).
There are a number of initiatives to make Orleans easier to visualise and monitor runtime performance:
Outside of Orleans, Service Fabric includes a dashboard out of the box,
Netflix also has some impressive visualisations
What do people want to see? A dashboard, an API, a control panel?
The text was updated successfully, but these errors were encountered: