-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add latencies on graph edges ? #43
Comments
That's a very good idea ! 👍 |
We've been working on some network observability tooling @Shopify, inspired by Microsoft's PingMesh paper. We've come up with this so far:
We were wondering if this visualization (plus the round trip additions) would be of interest to contribute upstream to resolve this issue? [EDIT] |
Hey @thegedge thanks a lot for putting this forward. TL;DR: We really like this idea, and would absolutely welcome the contribution. A few thoughts, in no particular order:
So to answer your question: yes please. What can we do to help with that ? |
I've discussed this with my team, and it's definitely another possibility. The hard thing about it (EDIT: "hard" here meaning configurability, in case someone wanted bucketing and someone else wanted continuous) is that these are just static files, so the best I think we'll be able to do is perhaps put some JS constants at the top of the file that people could tweak for their own preferences. Another option would be to set up a We have lots of clusters, so we're actually planning on having a side service to persist/aggregate all of the data, and present a global view.
Unfortunately, that would mean persisting the data, or having the JS keep some of it in memory. I'm sure this wouldn't be terribly difficult, but likely outside of the scope of what we'll be doing in goldpinger.
Actually, this is live data from one of our own clusters (minus the black cross, which was an artificially introduced failure). I was pretty surprised also that the diagonal wasn't green. FYI green in that image would mean <100 ms round trip.
We already have this running internally, with our own fork of goldpinger :) I'll polish it up a bit, and then get some PRs rolling. |
I'm not sure I understand that bit. I initially thought it was an actual image being produced - do you mean that's a dynamically build HTML + CSS ? Or do you mean SVG or equivalent ? We could probably just have a dropdown at the top bar of the UI, that allows you to pick some options ? Or something along these lines ?
This is intriguing. I'd probably assume that something's seriously wrong, if a ping to localhost takes >100ms.
Very sweet, looking forward to taking it for a spin ! |
Yep, it's an
I'll just make these settings be JS variables with hardcoded values for the first PR, with (eventually) a follow-up PR that adds in some UI elements to configure them. How does that sound?
Agreed, and I'm looking into that. I'm thinking the problem could potentially be the use of wall-clock time with so many goroutines running, although I've seen ~100ms timings from something as simple as |
Hey @thegedge just checking in how that's going ? Do you need any assistance ? |
Sorry for the lack of communication, @seeker89. I've been caught up on some changes for our own internal project which have been keeping me busy Unfortunately this means we are no longer using goldpinger, but I did want to make this visualization available to the project. You can find it here: master...Shopify:add-latency. There's still some work to be done, so I'll hand it off there to someone who would like to take it to the finish line. |
That's a real shame. What did you decide to build instead ? |
We ended up rebuilding a stripped down pinger, without all the bells and whistles (no API, no swagger, no static file serving). Now we're focused on a federated Prometheus cluster, with a central dashboard to combine all of this data across clusters in a useful visualization (likely something similar to the screenshot I posted above). Honestly, the primary reason for us making this move is the ability to move faster. Maintaining a fork with internal, experimental, and public work would be too much friction right now. One other finding I can share: we had very low CPU requirements set up in k8s, so our round-trip times were way off (a combination of goroutine scheduling + cgroups throttling). A simple change that dramatically improved our timing was to do the pings serially instead of spawning goroutines for all of the pings at once. Eventually I plan on staggering the pings, but for now doing everything in serial mostly results in good timings. |
I would be curious to know why the bells and whistles were a problem ? They don't really add much of an overhead in any meaningful way ?
That's something the community could definitely benefit from. If you keep the prometheus metrics compatible with Either way, good luck, and keep rocking! |
I noticed the fun spinoff here https://github.com/vmarchaud/consul-topology-visualizer#inspiration
And in the spirit of cross-pollination I'm wondering - would people find it useful to display the latencies on the edges ? It looks kind of neat, although will only really be useful for smaller graphs, and is also redundant with the metrics already exported. It could be a checkbox on the UI.
Thoughts ?
The text was updated successfully, but these errors were encountered: