Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for hystrix dashboard support feature (issue #1244) #523

Closed

Conversation

trabetti
Copy link

@trabetti trabetti commented Mar 6, 2018

Documentation for PR: envoyproxy/envoy#2736


.. http:get:: /hystrix_event_stream

Start streaming :ref:`event stream<https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events>` to hystrix dashboard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/hystrix/Hystrix/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your comments. Will update everything with new commit, also DCO was missing.. In any case, this depends on envoyproxy/envoy#2736 which currently is blocked by failing checks (taking care of this as well)


The dashboard should set the stream source to this admin endpoint.

Note on usage of Hystrix dashboard to visualize Envoy statistics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you verify this looks as expected in the doc output? You will need to fix the docs build.


.. http:get:: /hystrix_event_stream

Start streaming :ref:`event stream<https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events>` to hystrix dashboard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use external RST link syntax rather than internal here, this is what is breaking docs build.

The dashboard should set the stream source to this admin endpoint.

Note on usage of Hystrix dashboard to visualize Envoy statistics:
Detalied description on Hystrix and its dashboard can be found :ref:\here<https://github.com/Netflix-Skunkworks/hystrix-dashboard>`. Not all the data presented in the Hystrix dashboard is relevant in Envoy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point on RST link syntax as above here.

* "Bad request" does not exist in Envoy, and therefore is set to '0'.
* Requests rejected as a result of exceeding max number of connections or queue size (in Envoy, referred as "short circuited") are presented as "Rejected", which better suits Hystrix's terminology.
* Latency information is currently not availbale.
* Window size is set to 10sec, with 10 buckets of 1sec each. Currently, the value is not configurable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer " ", e.g. 10 seconds, rather than 10sec.

* Hystrix's "Short circuit" is similar to Envoy's outlier detection, in that it is triggered by error responses. In Envoy the result of high error rate is the host being ejected from the load balancer pool, but it does not cause rejected requests, and therefore we set "Short circuited" to '0', and circuit breaker is set to "Forced Closed".
* "Bad request" does not exist in Envoy, and therefore is set to '0'.
* Requests rejected as a result of exceeding max number of connections or queue size (in Envoy, referred as "short circuited") are presented as "Rejected", which better suits Hystrix's terminology.
* Latency information is currently not availbale.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/availbale/available/. Please spell check.

@htuch htuch self-assigned this Mar 6, 2018
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, couple more comments and we can merge.

error responses. In Envoy the result of high error rate is the host being ejected from the load
balancer pool, but it does not cause rejected requests, and therefore we set "Short circuited"
to '0', and circuit breaker is set to "Forced Closed".
* "Bad request" does not exist in Envoy, and therefore is set to '0'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer avoiding quoting numbers in these docs, so I'd just write set to 0..

Not all the data presented in the Hystrix dashboard is relevant in Envoy.

* Success, Failure, Timeout, Error rate, Number of host, are shown in the dashboard.
* Hystrix's "Short circuit" is similar to Envoy's outlier detection, in that it is triggered by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we do *Short circuit* rather than quoting in Envoy docs to call out distinguished terms, but that's mostly done for headers. I don't feel that strongly, just pointing it out.

Signed-off-by: trabetti <talis@il.ibm.com>
@htuch
Copy link
Member

htuch commented Mar 7, 2018

In offline discussion with @mrice32, once thing that came up is that we should probably think a bit about the security model here. The admin port is not secured today, e.g. it doesn't do auth or serve up via SSL. This is probably not ideal for general operational deployment if this is relied upon for stats serving. It's not ideal to add features to Envoy unless we have a sensible way for users to deploy them in a network topology that can be secured.

This actually reminds me a lot of earlier conversations on how we handled Prometheus, see #210 and #220. CCing some of the folks who were involved there: @mattklein123 @douglas-reid @ramaraochavali

@trabetti
Copy link
Author

trabetti commented Mar 7, 2018

Also note that currently, the hystrix dashboard is only intended to be run in trusted network, or behind authentication and authorization. https://github.com/Netflix-Skunkworks/hystrix-dashboard#security

@htuch
Copy link
Member

htuch commented Mar 7, 2018

@trabetti Interesting. I think there's multiple security concerns here; how do we lock down the Envoy --> Hystrix path and how do we lock down Hystrix --> user, I think that link is talking only about the latter. It's worth just taking a quick poll to see if there are any thoughts from folks CCed on where we want to head with securing the non-data plane incoming ports on Envoy for pull based monitoring. FYI, this hasn't been an issue to date as all stats (other than the manual /stats on admin) is push based, so could make use of outgoing connections secured by normal gRPC SSL.

@mattklein123
Copy link
Member

@htuch I would suggest we defer discussion of admin security until we open the issue on adding admin security (from the security audit). I think so far there is a general assumption that there is no security on the admin endpoint and that it is accessed over a fully trusted network. That may not always be the case but I think that's a substantial design/development effort which we should track separately?

@htuch
Copy link
Member

htuch commented Mar 8, 2018

@mattklein123 I have a concern, that since endpoints are part of a stable API that we don't want to break, we want to be more deliberate about which endpoints we open up to external systems that will depends on them (CC: @wora). We are now seeing a multiplication of systems depending on the admin console, see DataDog/integrations-core#1156 as well.

Do we really want the admin console to be the point of integration for services that access Envoy via a pull model? Or should we have a gRPC serving entity that does mTLS for auth and SSL? The admin console has multiple roles conflated today, e.g. should the same person who can access stats also be able to /quitquitquit? The only access control we have today is to block the port completely.

@trabetti
Copy link
Author

trabetti commented Mar 8, 2018

Should we add an enable to this feature from config file, and set it to disabled as default?

@mattklein123
Copy link
Member

mattklein123 commented Mar 8, 2018

@htuch I think i'm really not that worried about admin security and back-compat because I think it's very straightforward to add-in optional security later by basically promoting the admin listener to a full listener with filters. This would mean it can support mTLS, a route table with inline RBAC filter (when we implement it soon), etc.

We could also eventually allow splitting admin endpoints across multiple listeners/ports, etc.

Given this I really don't think we need to worry about this right now and should just assume that admin is currently not secured.

@htuch
Copy link
Member

htuch commented Mar 8, 2018

@mattklein123 I've opened envoyproxy/envoy#2763 to start the discussion on admin endpoint security futures. Can you post the above points there?

With this in mind, I don't have any objections to merging this and the Hystrix endpoint.

@mattklein123
Copy link
Member

Sure will do.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can wait for envoyproxy/envoy#2736 to merge before merging this in case there are any additional changes needed.

@mattklein123
Copy link
Member

@trabetti this is not going to merge before we move data-plane-api over to envoy this weekend so I'm going to go ahead and close this. Please add these docs to your Envoy side PR once the merge happens. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants