Skip to content

proposal to new tags/labels natively metrics system#44433

Draft
wbpcode wants to merge 1 commit into
envoyproxy:mainfrom
wbpcode:dev-new-tags-friendly-stats-system-3
Draft

proposal to new tags/labels natively metrics system#44433
wbpcode wants to merge 1 commit into
envoyproxy:mainfrom
wbpcode:dev-new-tags-friendly-stats-system-3

Conversation

@wbpcode
Copy link
Copy Markdown
Member

@wbpcode wbpcode commented Apr 14, 2026

Background

See also #20289 for the broader discussion.

Envoy's current stats model starts from a flattened string name such as
cluster.foo.upstream_rq_total, and later tries to recover structure from that string by running
tag extraction rules. That model gave good compatibility with StatsD and workable Prometheus output,
but it has two fundamental problems:

  • It is fragile when a resource value itself contains .. A cluster name such as payments.v1
    should be one logical value, but it is embedded in a dotted stat path and must later be recovered
    heuristically.
  • It is wasteful. The same tag extraction rules are evaluated repeatedly across many metrics that
    share the same structure.

We already have *WithTags APIs, but they are additive. They do not solve the main problem for the
bulk of Envoy stats, which are still created from flattened names.

Goal

  • Stop depending on regex or token-based tag extraction as the primary source of metric structure.
  • Keep the same flattened stat name for StatsD and other backends that rely on it.
  • Keep compatibility with today's default tag behavior, but derive tags directly at metric creation
    time instead of recovering them later.

Non-goal

  • This proposal is not about store internals, cache layout, or migration mechanics.
  • This proposal does not require changing every existing call site at once.

Core idea

Instead of creating a stat from one opaque string and then reverse-engineering meaning from that
string, we create a stat from a structured sequence of elements.

Each metric is built from one source of truth and yields three related outputs:

Output Purpose
Full name Legacy flattened name, preserved for StatsD and compatibility.
Canonical name The name with tagged resource values removed, used as the stable metric family identity.
Explicit tags The structured labels attached at creation time.

The key change is that the canonical name and tags are produced directly from the structured input,
not inferred later from the full name.

StatElement

StatElement is the basic building block for the new API.

template <class T> class StatElementBase {
public:
  T value_{};
  T name_{};
  bool ignore_name_{false};
};

It can be read as:

Field Meaning
value_ The path token, or the tag value if this element is a tag.
name_ Optional tag name. If empty, this element is a normal path element.
ignore_name_ If true, keep the tag semantically, but do not emit the tag key into the legacy flattened name.

This gives three useful forms:

  1. Plain path element

    {.value_ = "upstream_rq_total"}

    This contributes to both the full name and the canonical name.

  2. Named tag

    {.value_ = route_name, .name_ = well_known.route_}

    This contributes a tag route=<route_name>, and also contributes route.<route_name> to the
    legacy full name.

  3. Compatibility tag

    {.value_ = cluster_name, .name_ = well_known.cluster_name_, .ignore_name_ = true}

    This contributes a tag cluster_name=<cluster_name>, but the legacy name stays
    cluster.<cluster_name>... instead of becoming cluster.cluster_name.<cluster_name>....

That last form is what lets us preserve existing flattened names while still making the tag explicit.

New scope API

The new API lets a scope carry structured prefix information instead of only a flat string prefix.

ScopeSharedPtr createScope(const std::string& name, ...);
ScopeSharedPtr createScope(StatElementViewSpan elements, ...);
ScopeSharedPtr createScope(StatElementSpan elements, ...);

Counter& getOrCreateCounter(StatElementSpan elements);
Gauge& getOrCreateGauge(StatElementSpan elements, Gauge::ImportMode import_mode);
Histogram& getOrCreateHistogram(StatElementSpan elements, Histogram::Unit unit);
TextReadout& getOrCreateTextReadout(StatElementSpan elements);

The design intent is:

  • createScope(std::string) remains the legacy entry point.
  • createScope(StatElementViewSpan) is convenient for configuration-time code that starts from
    string views.
  • createScope(StatElementSpan) and getOrCreate* are the structured API for code that already
    has interned stat names and wants to avoid flatten-then-recover behavior.

A scope created from structured elements keeps those elements as its prefix. Child scopes and child
metrics append more structured elements to that prefix. The final metric name, canonical name, and
tag set are all derived from the combined sequence.

So the new model is:

  1. Build the scope prefix as structured elements.
  2. Build each metric suffix as structured elements.
  3. Concatenate them.
  4. Derive the full name, canonical name, and explicit tags once.

Well-known tag names

For hot-path code, the tag keys should not be raw strings. Stats::Context provides interned,
well-known tag names:

stats_context.wellKnownTagStatNames().cluster_name_
stats_context.wellKnownTagStatNames().virtual_host_
stats_context.wellKnownTagStatNames().route_

This keeps common tag keys centralized and avoids repeatedly constructing the same tag-name symbols.

StatElementView still has value for config-time or cold-path code, where string input is natural.
But the long-term model is that frequently used stats should be assembled from StatName-backed
elements, not raw strings.

Cluster scope example

The cluster scope is a good example because it shows the compatibility requirement clearly.

Assume the cluster name is payments.v1.

Step 1: create the cluster scope

auto cluster_scope = store.rootScope()->createScope(
    {Stats::StatElementView{.value_ = "cluster"},
     Stats::StatElementView{.value_ = "payments.v1",
                            .name_ = Config::TagNames::get().CLUSTER_NAME,
                            .ignore_name_ = true}});

Semantically, that means:

  • "cluster" is a normal path element.
  • "payments.v1" is the value of the cluster_name tag.
  • ignore_name_ = true says: keep the tag, but preserve the legacy flattened path shape.

At this point, the scope prefix represents:

Derived value Result
Legacy full prefix cluster.payments.v1
Canonical prefix cluster
Scope tags cluster_name="payments.v1"

Step 2: create a metric inside that scope

auto& upstream_rq_total = cluster_scope->getOrCreateCounter(
    {Stats::StatElement{.value_ = stat_names.upstream_rq_total_}});

Now the combined structured input is effectively:

[
  {.value_ = "cluster"},
  {.value_ = "payments.v1", .name_ = "cluster_name", .ignore_name_ = true},
  {.value_ = "upstream_rq_total"},
]

From that, the metric becomes:

Derived value Result
Legacy full name cluster.payments.v1.upstream_rq_total
Canonical name cluster.upstream_rq_total
Tags cluster_name="payments.v1"

This is the important property of the design:

  • StatsD compatibility is preserved because the exported flat name is still
    cluster.payments.v1.upstream_rq_total.
  • Prometheus-style labeling is explicit because the metric already knows that
    cluster_name="payments.v1".
  • Dots in resource values stop being a parsing problem because payments.v1 is carried as one
    logical tagged value from the start, rather than discovered later by inspecting the string name.

Summary

The proposal is to make structured stat construction the primary API.

StatElement describes metric structure explicitly. The new scope APIs let that structure be
carried in scope prefixes and metric suffixes. The store then derives the legacy full name,
canonical metric identity, and explicit tags from one source of truth.

The cluster scope example shows the intended outcome: keep cluster.<name>.* exactly as it exists
today for compatibility, while also making cluster_name=<name> explicit and reliable without any
post-hoc extraction.

Graceful migration

The new API is additive. I will add new CLI parameter or environment variable options to enable the new API for specific scopes, and then migrate call sites in those scopes at a reasonable pace. The old API will continue to work until we remove it (This will be a very long time in the future). The cluster scope will be the first migration candidate, and I will use it as a test case to validate the design and implementation of the new API before migrating other scopes.

Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #44433 was opened by wbpcode.

see: more, trace.

@wbpcode wbpcode marked this pull request as draft April 14, 2026 05:13
@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented Apr 14, 2026

DON'T trying to review this PR. It's dirty and lot's of concept that haven't been verified. I will split this to reasonable pieces and test it, verify it to make it's possible to give reasonable review. Only check the description only to see whether the proposal make sense to you.

@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented Apr 14, 2026

cc @zirain

@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented Apr 14, 2026

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot added the stale stalebot believes this issue/PR has not been touched recently label May 14, 2026
@wbpcode wbpcode added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no stalebot Disables stalebot from closing an issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant