Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add latencies on graph edges ? #43

Closed
seeker89 opened this issue Jan 7, 2019 · 11 comments
Closed

Add latencies on graph edges ? #43

seeker89 opened this issue Jan 7, 2019 · 11 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@seeker89
Copy link
Contributor

seeker89 commented Jan 7, 2019

I noticed the fun spinoff here https://github.com/vmarchaud/consul-topology-visualizer#inspiration

And in the spirit of cross-pollination I'm wondering - would people find it useful to display the latencies on the edges ? It looks kind of neat, although will only really be useful for smaller graphs, and is also redundant with the metrics already exported. It could be a checkbox on the UI.

Thoughts ?

@seeker89 seeker89 added enhancement New feature or request question Further information is requested labels Jan 7, 2019
@jpmondet
Copy link
Contributor

jpmondet commented Jan 9, 2019

That's a very good idea ! 👍
This could even go further by colorizing the edges depending on the latency.
I don't think it's heavy/redundant since it's not necessarily the same guys that use the UI and the ones that leverage the metrics :-)

@thegedge
Copy link

thegedge commented Feb 5, 2019

We've been working on some network observability tooling @Shopify, inspired by Microsoft's PingMesh paper. We've come up with this so far:

latency grid

  • Each row represents a node sending out a ping, each column a node receiving a ping.
  • The dark cross is a failing node.
  • Green, yellow, and red indicates low, medium, and high round trips, respectively. Currently bucketed with fixed thresholds.

We were wondering if this visualization (plus the round trip additions) would be of interest to contribute upstream to resolve this issue?

[EDIT]
I should mention we're actually interested in emitting more than just round trips. We'll probably want to output TLS handshake, connection open, and DNS resolution times.

@seeker89
Copy link
Contributor Author

seeker89 commented Feb 6, 2019

Hey @thegedge thanks a lot for putting this forward.

TL;DR: We really like this idea, and would absolutely welcome the contribution.

A few thoughts, in no particular order:

  • I like how this representation allows for a compact view of larger clusters, than what can be easily viewed as a graph,
  • there is probably a lot of tweaks that can be made to it, and that could probably be configurable to to suit various use cases. Some ideas that spring to mind:
    • using a continuous scale, instead of bucketing (it would look more like a height map)
    • using grayscale for the whole thing (to allow exporting very small images)
    • wondering if it would be interesting to produce animations that show evolution over time
  • I assume that the image is an artist's impression, otherwise there should be a repeating green on the diagonal ?
  • measuring TLS handshake, connection open, and DNS resolution times would all expand the spectrum of utility of goldpinger, so are a great idea.

So to answer your question: yes please. What can we do to help with that ?

@thegedge
Copy link

thegedge commented Feb 6, 2019

using a continuous scale

I've discussed this with my team, and it's definitely another possibility. The hard thing about it (EDIT: "hard" here meaning configurability, in case someone wanted bucketing and someone else wanted continuous) is that these are just static files, so the best I think we'll be able to do is perhaps put some JS constants at the top of the file that people could tweak for their own preferences. Another option would be to set up a make target to compile the static files from templates.

We have lots of clusters, so we're actually planning on having a side service to persist/aggregate all of the data, and present a global view.

wondering if it would be interesting to produce animations that show evolution over time

Unfortunately, that would mean persisting the data, or having the JS keep some of it in memory. I'm sure this wouldn't be terribly difficult, but likely outside of the scope of what we'll be doing in goldpinger.

I assume that the image is an artist's impression, otherwise there should be a repeating green on the diagonal ?

Actually, this is live data from one of our own clusters (minus the black cross, which was an artificially introduced failure). I was pretty surprised also that the diagonal wasn't green. FYI green in that image would mean <100 ms round trip.

So to answer your question: yes please. What can we do to help with that ?

We already have this running internally, with our own fork of goldpinger :)

I'll polish it up a bit, and then get some PRs rolling.

@seeker89
Copy link
Contributor Author

seeker89 commented Feb 6, 2019

I've discussed this with my team, and it's definitely another possibility. The hard thing about it (EDIT: "hard" here meaning configurability, in case someone wanted bucketing and someone else wanted continuous) is that these are just static files, so the best I think we'll be able to do is perhaps put some JS constants at the top of the file that people could tweak for their own preferences. Another option would be to set up a make target to compile the static files from templates.

I'm not sure I understand that bit. I initially thought it was an actual image being produced - do you mean that's a dynamically build HTML + CSS ? Or do you mean SVG or equivalent ?

We could probably just have a dropdown at the top bar of the UI, that allows you to pick some options ? Or something along these lines ?

Actually, this is live data from one of our own clusters (minus the black cross, which was an artificially introduced failure). I was pretty surprised also that the diagonal wasn't green. FYI green in that image would mean <100 ms round trip.

This is intriguing. I'd probably assume that something's seriously wrong, if a ping to localhost takes >100ms.

We already have this running internally, with our own fork of goldpinger :)

I'll polish it up a bit, and then get some PRs rolling.

Very sweet, looking forward to taking it for a spin !

@thegedge
Copy link

thegedge commented Feb 6, 2019

I'm not sure I understand that bit. I initially thought it was an actual image being produced - do you mean that's a dynamically build HTML + CSS ? Or do you mean SVG or equivalent ?

Yep, it's an <svg> that gets populated using d3.js to supply data via the /check_all endpoint.

We could probably just have a dropdown at the top bar of the UI, that allows you to pick some options ? Or something along these lines ?

I'll just make these settings be JS variables with hardcoded values for the first PR, with (eventually) a follow-up PR that adds in some UI elements to configure them. How does that sound?

This is intriguing. I'd probably assume that something's seriously wrong, if a ping to localhost takes >100ms.

Agreed, and I'm looking into that. I'm thinking the problem could potentially be the use of wall-clock time with so many goroutines running, although I've seen ~100ms timings from something as simple as echo 'test' | nc localhost 8080 within the container. Maybe a combination of scheduling and some general slowness in the server. Definitely needs some more digging.

@seeker89
Copy link
Contributor Author

Hey @thegedge just checking in how that's going ? Do you need any assistance ?

@thegedge
Copy link

Sorry for the lack of communication, @seeker89. I've been caught up on some changes for our own internal project which have been keeping me busy

Unfortunately this means we are no longer using goldpinger, but I did want to make this visualization available to the project. You can find it here: master...Shopify:add-latency. There's still some work to be done, so I'll hand it off there to someone who would like to take it to the finish line.

@seeker89
Copy link
Contributor Author

That's a real shame. What did you decide to build instead ?

@thegedge
Copy link

We ended up rebuilding a stripped down pinger, without all the bells and whistles (no API, no swagger, no static file serving). Now we're focused on a federated Prometheus cluster, with a central dashboard to combine all of this data across clusters in a useful visualization (likely something similar to the screenshot I posted above).

Honestly, the primary reason for us making this move is the ability to move faster. Maintaining a fork with internal, experimental, and public work would be too much friction right now.

One other finding I can share: we had very low CPU requirements set up in k8s, so our round-trip times were way off (a combination of goroutine scheduling + cgroups throttling). A simple change that dramatically improved our timing was to do the pings serially instead of spawning goroutines for all of the pings at once. Eventually I plan on staggering the pings, but for now doing everything in serial mostly results in good timings.

@seeker89
Copy link
Contributor Author

We ended up rebuilding a stripped down pinger, without all the bells and whistles (no API, no swagger, no static file serving).

I would be curious to know why the bells and whistles were a problem ? They don't really add much of an overhead in any meaningful way ?

Now we're focused on a federated Prometheus cluster, with a central dashboard to combine all of this data across clusters in a useful visualization (likely something similar to the screenshot I posted above).

That's something the community could definitely benefit from. If you keep the prometheus metrics compatible with goldpinger's, maybe we could reuse the same dashboard !

Either way, good luck, and keep rocking!

@seeker89 seeker89 mentioned this issue Feb 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants