-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design(docs/api): control plane metrics monitoring #1981
design(docs/api): control plane metrics monitoring #1981
Conversation
d99b201
to
2fd0dd1
Compare
2fd0dd1
to
9cf7048
Compare
Codecov Report
@@ Coverage Diff @@
## main #1981 +/- ##
==========================================
- Coverage 65.25% 65.22% -0.04%
==========================================
Files 97 97
Lines 14199 14213 +14
==========================================
+ Hits 9266 9270 +4
- Misses 4359 4366 +7
- Partials 574 577 +3
|
// DefaultEnvoyGatewayPrometheus returns a new EnvoyGatewayMetrics with default configuration parameters. | ||
func DefaultEnvoyGatewayPrometheus() *EnvoyGatewayPrometheusProvider { | ||
return &EnvoyGatewayPrometheusProvider{ | ||
// Enable prometheus pull by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont enable prom
by default in Envoy Proxy, I think we should be consistent and do the same here, and not enable it by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in Slack, we should enable prometheus in default, let us Use disable
for it.
a496868
to
58e6504
Compare
| Xds EnvoyProxy Server | 0.0.0.0 | 18000 | Yes | | ||
| Xds RateLimit Server | 0.0.0.0 | 18001 | Yes | | ||
| Admin Server | 127.0.0.1 | 19000 | Yes | | ||
| Metrics Server | 0.0.0.0 | 19001 | Yes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none of these are configurable atm, can we add No
and later move it to Yes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admin Server and Metrics Server can be configured by EnvoyGateway Config.
hey @Xunzhuo can you rm the |
@Xunzhuo this PR mostly LGTM from my end but it also has the design doc which includes the specific metric names as well as the design for the internal metrics lib which is being discussed in #1982. Suggest either only keeping the API in the design doc and adding the metrics lib design later or moving the design doc out of this PR |
58e6504
to
b066956
Compare
Removed from the docs. |
fdb9469
to
c56f83a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks !
a6e3c99
to
0e31051
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, leave some comments.
@@ -0,0 +1,238 @@ | |||
--- | |||
date: 2023-10-10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, consistent with other documents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we should set date from now now, it sorted docs with date instead of naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design doc needs to be kept updated, use alphabetical order is fine.
author: Xunzhuo Liu | ||
linkTitle: "Control Plane Observability: Metrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, consistent with other documents
{{% alert title="State" color="warning" %}} | ||
|
||
+ Author: [Xunzhuo Liu](https://github.com/Xunzhuo) | ||
+ Affiliation: Tencent | ||
+ Data: 2023-10-12 | ||
+ Status: Done | ||
{{% /alert %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, consistent with other documents
{{% alert title="Note" color="secondary" %}} | ||
**Data plane** observability (while important) is outside of scope for this document. | ||
{{% /alert %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced with right link, For dataplane, see xxxxxx
## Current State | ||
|
||
At present, the Envoy Gateway control plane provides logs and controller-runtime metrics, without traces. Logs are managed through our proprietary library (`internal/logging`, a shim to `zap`) and are written to `/dev/stdout`. | ||
|
||
The absence of comprehensive and robust control plane metrics observability hinders the effective monitoring of Envoy Gateway in a production environment, a critical requirement before deploying Envoy Gateway into production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to removed this paragraph.
|
||
### Standards | ||
|
||
Our metrics, and traces in the future, will be built upon the [OpenTelemetry](https://opentelemetry.io/) standards. All metrics will be configured via the [OpenTelemetry SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/), which offers neutral libraries that can be connected to various backends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's focus on metrics, remove traces
.
|
||
#### Example | ||
|
||
+ The following is an example to enable prometheus metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheus is enable by default
+ ***Deprecated***: a metric that is intended to be phased out. | ||
+ ***Experimental***: a metric that is off by default. | ||
+ ***Alpha***: a metric that is on by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can you categorize these?
Our objectives include: | ||
|
||
+ Supporting **PULL** mode for Prometheus metrics and exposing these metrics on the admin address. | ||
+ Supporting **PUSH** mode for Prometheus metrics, thereby sending metrics to the Open Telemetry Stats sink. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't mention HTTP or gRPC here, it seems that you use HTTP by default, so what about gRPC?
|
||
The OTEL libraries allow for the registration of Providers/Handlers. While we will offer the default ones (PULL via Prometheus, PUSH via OTEL HTTP metrics exporter) mentioned in Envoy Gateway's extensibility, we can easily allow custom builds of Envoy Gateway to plug in alternatives if the default options don't meet their needs. | ||
|
||
For instance, users may prefer to write metrics over the OTLP gRPC metrics exporter instead of the HTTP metrics exporter. This is perfectly acceptable -- and almost impossible to prevent. The OTEL has ways to register their providers/exporters, and Envoy Gateway can ensure its usage is such that it's not overly difficult to swap out a different provider/exporter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this doc spend too much paragraph on what's OTel, how OTel/OTel-collector work?
Signed-off-by: bitliu <bitliu@tencent.com>
Signed-off-by: bitliu <bitliu@tencent.com>
0e31051
to
77ea45f
Compare
please do not use force push, it's hard to track the changes for a large PR. |
Host string `json:"host"` | ||
// Protocol define the sink service protocol. | ||
// +kubebuilder:validation:Enum=grpc;http | ||
Protocol string `json:"protocol"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the default value of port, I suspect default value for this field is grpc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo its fine to start w/o a default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this first
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
design(docs/api): control plane metrics monitoring
Part of: #700