Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design(docs/api): control plane metrics monitoring #1981

Merged
merged 4 commits into from
Oct 24, 2023

Conversation

Xunzhuo
Copy link
Member

@Xunzhuo Xunzhuo commented Oct 17, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

design(docs/api): control plane metrics monitoring

Part of: #700

@Xunzhuo Xunzhuo requested a review from a team as a code owner October 17, 2023 08:31
@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics-docs-with-api branch from d99b201 to 2fd0dd1 Compare October 17, 2023 08:32
@Xunzhuo Xunzhuo added the area/observability Observability related issues label Oct 17, 2023
@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics-docs-with-api branch from 2fd0dd1 to 9cf7048 Compare October 17, 2023 08:46
@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

Merging #1981 (57a2ffa) into main (a858547) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1981      +/-   ##
==========================================
- Coverage   65.25%   65.22%   -0.04%     
==========================================
  Files          97       97              
  Lines       14199    14213      +14     
==========================================
+ Hits         9266     9270       +4     
- Misses       4359     4366       +7     
- Partials      574      577       +3     
Files Coverage Δ
api/v1alpha1/validation/envoygateway_validate.go 86.66% <100.00%> (+1.76%) ⬆️
api/v1alpha1/validation/envoyproxy_validate.go 75.00% <100.00%> (+1.14%) ⬆️

... and 2 files with indirect coverage changes

// DefaultEnvoyGatewayPrometheus returns a new EnvoyGatewayMetrics with default configuration parameters.
func DefaultEnvoyGatewayPrometheus() *EnvoyGatewayPrometheusProvider {
return &EnvoyGatewayPrometheusProvider{
// Enable prometheus pull by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont enable prom by default in Envoy Proxy, I think we should be consistent and do the same here, and not enable it by default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in Slack, we should enable prometheus in default, let us Use disable for it.

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics-docs-with-api branch 2 times, most recently from a496868 to 58e6504 Compare October 18, 2023 08:57
| Xds EnvoyProxy Server | 0.0.0.0 | 18000 | Yes |
| Xds RateLimit Server | 0.0.0.0 | 18001 | Yes |
| Admin Server | 127.0.0.1 | 19000 | Yes |
| Metrics Server | 0.0.0.0 | 19001 | Yes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of these are configurable atm, can we add No and later move it to Yes ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admin Server and Metrics Server can be configured by EnvoyGateway Config.

@arkodg
Copy link
Contributor

arkodg commented Oct 18, 2023

hey @Xunzhuo can you rm the png from the diff, thanks

@arkodg
Copy link
Contributor

arkodg commented Oct 18, 2023

@Xunzhuo this PR mostly LGTM from my end but it also has the design doc which includes the specific metric names as well as the design for the internal metrics lib which is being discussed in #1982. Suggest either only keeping the API in the design doc and adding the metrics lib design later or moving the design doc out of this PR

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics-docs-with-api branch from 58e6504 to b066956 Compare October 20, 2023 08:01
@Xunzhuo
Copy link
Member Author

Xunzhuo commented Oct 20, 2023

@Xunzhuo this PR mostly LGTM from my end but it also has the design doc which includes the specific metric names as well as the design for the internal metrics lib which is being discussed in #1982. Suggest either only keeping the API in the design doc and adding the metrics lib design later or moving the design doc out of this PR

Removed from the docs.

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics-docs-with-api branch 5 times, most recently from fdb9469 to c56f83a Compare October 20, 2023 23:57
arkodg
arkodg previously approved these changes Oct 21, 2023
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks !

@arkodg arkodg requested review from zirain, a team, kflynn, Alice-Lilith and chauhanshubham and removed request for a team October 21, 2023 00:28
@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics-docs-with-api branch 2 times, most recently from a6e3c99 to 0e31051 Compare October 23, 2023 06:59
@Xunzhuo Xunzhuo requested a review from arkodg October 23, 2023 07:19
Copy link
Contributor

@zirain zirain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, leave some comments.

@@ -0,0 +1,238 @@
---
date: 2023-10-10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, consistent with other documents

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we should set date from now now, it sorted docs with date instead of naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design doc needs to be kept updated, use alphabetical order is fine.

Comment on lines 4 to 5
author: Xunzhuo Liu
linkTitle: "Control Plane Observability: Metrics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, consistent with other documents

Comment on lines 8 to 12
{{% alert title="State" color="warning" %}}

+ Author: [Xunzhuo Liu](https://github.com/Xunzhuo)
+ Affiliation: Tencent
+ Data: 2023-10-12
+ Status: Done
{{% /alert %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, consistent with other documents

{{% alert title="Note" color="secondary" %}}
**Data plane** observability (while important) is outside of scope for this document.
{{% /alert %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with right link, For dataplane, see xxxxxx

Comment on lines 22 to 26
## Current State

At present, the Envoy Gateway control plane provides logs and controller-runtime metrics, without traces. Logs are managed through our proprietary library (`internal/logging`, a shim to `zap`) and are written to `/dev/stdout`.

The absence of comprehensive and robust control plane metrics observability hinders the effective monitoring of Envoy Gateway in a production environment, a critical requirement before deploying Envoy Gateway into production.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to removed this paragraph.


### Standards

Our metrics, and traces in the future, will be built upon the [OpenTelemetry](https://opentelemetry.io/) standards. All metrics will be configured via the [OpenTelemetry SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/), which offers neutral libraries that can be connected to various backends.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's focus on metrics, remove traces.


#### Example

+ The following is an example to enable prometheus metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prometheus is enable by default

Comment on lines +104 to +101
+ ***Deprecated***: a metric that is intended to be phased out.
+ ***Experimental***: a metric that is off by default.
+ ***Alpha***: a metric that is on by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can you categorize these?

Our objectives include:

+ Supporting **PULL** mode for Prometheus metrics and exposing these metrics on the admin address.
+ Supporting **PUSH** mode for Prometheus metrics, thereby sending metrics to the Open Telemetry Stats sink.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't mention HTTP or gRPC here, it seems that you use HTTP by default, so what about gRPC?


The OTEL libraries allow for the registration of Providers/Handlers. While we will offer the default ones (PULL via Prometheus, PUSH via OTEL HTTP metrics exporter) mentioned in Envoy Gateway's extensibility, we can easily allow custom builds of Envoy Gateway to plug in alternatives if the default options don't meet their needs.

For instance, users may prefer to write metrics over the OTLP gRPC metrics exporter instead of the HTTP metrics exporter. This is perfectly acceptable -- and almost impossible to prevent. The OTEL has ways to register their providers/exporters, and Envoy Gateway can ensure its usage is such that it's not overly difficult to swap out a different provider/exporter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this doc spend too much paragraph on what's OTel, how OTel/OTel-collector work?

Signed-off-by: bitliu <bitliu@tencent.com>
Signed-off-by: bitliu <bitliu@tencent.com>
Signed-off-by: bitliu <bitliu@tencent.com>
@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics-docs-with-api branch from 0e31051 to 77ea45f Compare October 24, 2023 07:10
Signed-off-by: bitliu <bitliu@tencent.com>
@Xunzhuo Xunzhuo requested a review from zirain October 24, 2023 07:39
@zirain
Copy link
Contributor

zirain commented Oct 24, 2023

please do not use force push, it's hard to track the changes for a large PR.

Host string `json:"host"`
// Protocol define the sink service protocol.
// +kubebuilder:validation:Enum=grpc;http
Protocol string `json:"protocol"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the default value of port, I suspect default value for this field is grpc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo its fine to start w/o a default

Copy link
Contributor

@zirain zirain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this first

@arkodg arkodg merged commit 78f2d4a into envoyproxy:main Oct 24, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability Observability related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants