Skip to content

Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names#12356

Merged
gxglass merged 9 commits intoapple:mainfrom
gxglass:metrics5
Sep 5, 2025
Merged

Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names#12356
gxglass merged 9 commits intoapple:mainfrom
gxglass:metrics5

Conversation

@gxglass
Copy link
Copy Markdown
Collaborator

@gxglass gxglass commented Sep 5, 2025

Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of
XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input.

Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names
by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes
flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when
creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting
to future metrics platforms down the road.

Testing:
20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass

…alias. Also, move _WIN32 ifdefd code out of general purpose source files and into Platform.cpp
…forces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with
…alidation from Trace.cpp. This stuff is going to Splunk. Splunk takes what we give it.
@gxglass gxglass requested review from kakaiu and spraza September 5, 2025 01:19
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 0:25:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 4bfa14f
  • Duration 0:39:22
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 4bfa14f
  • Duration 0:49:12
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 1:00:12
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 1:03:46
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 4bfa14f
  • Duration 1:04:00
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 1:11:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Copy link
Copy Markdown
Contributor

@saintstack saintstack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it.

Comment thread flow/SimpleCounter.cpp
// Convert hierarchical metric names into Prometheus-compatible metric
// names. Do this by a) removing initial '/' chars, and b) converting
// remaining '/' chars to '_' chars.
static std::string hierarchicalToPrometheus(const std::string input) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thing dependent upon on old format? Tooling?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I can tell. I think we'll find out as we test in production, which seems to be the only way forward.

Comment thread flow/Trace.cpp
@@ -266,7 +266,6 @@ struct TraceLog {
};
void action(WriteBuffer& a) {
for (const auto& event : a.events) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because assert has been moved elsewhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because this code should not have an opinion about this stuff unless it is well justified opinion. No rationale was ever given in the code for the prior check. It would have cost somebody <= 3 minutes to type out a comment consisting of <= 3 sentences to explain this, but no, whoever wrote it couldn't be bothered to do this. Perhaps the reason for that is the common case that they were not actually confident that what they were doing was justifiable, but they felt they had to do SOMETHING, so they typed out some code that embodied some notion of what they vaguely thought the right thing might well have possibly been at the time they were writing the code.

And so I am removing it with something midway between prejudice and spite because it's getting in the way of progress.

@gxglass
Copy link
Copy Markdown
Collaborator Author

gxglass commented Sep 5, 2025

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 1:00:12
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack this is failing because of s3_backup_tests again. This is why I want this stuff to be optional

@gxglass
Copy link
Copy Markdown
Collaborator Author

gxglass commented Sep 5, 2025

@gxglass gxglass closed this Sep 5, 2025
@gxglass
Copy link
Copy Markdown
Collaborator Author

gxglass commented Sep 5, 2025

reopening to rerun CIs

@gxglass gxglass reopened this Sep 5, 2025
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 0:23:44
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 4bfa14f
  • Duration 0:38:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 4bfa14f
  • Duration 0:48:18
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 4bfa14f
  • Duration 1:03:55
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 1:08:09
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 1:08:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 4bfa14f
  • Duration 1:11:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@saintstack
Copy link
Copy Markdown
Contributor

@saintstack this is failing because of s3_backup_tests again. This is why I want this stuff to be optional

Ugh. Let me fix.

@gxglass gxglass merged commit 6530475 into apple:main Sep 5, 2025
7 checks passed
gxglass added a commit to gxglass/foundationdb that referenced this pull request Sep 15, 2025
…mpleCounter to emit Prometheus-compatible metric names (apple#12356)

Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of
XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input.

Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names
by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes
flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when
creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting
to future metrics platforms down the road.

Testing:
20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass

* replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with

* move Prometheus metric name validation to SimpleCounter.cpp.  Remove validation from Trace.cpp.  This stuff is going to Splunk.  Splunk takes what we give it.
gxglass added a commit that referenced this pull request Sep 16, 2025
… Net2 cherrypick to 7.4 (#12372)

* Arena/FastAlloc: add comments where potential metrics can be added (#12306)

* Arena/FastAlloc: add comments where potential metrics can be incremented.
The intent is to count allocations and bytes.

Remove a commented-out ifdef block that has been disabled for many years and
which does not work (per the explanation in the comment).  We don't need to
keep reading about the results of a small failed experiment from many years ago.

* add more one FIXME comment

* add one more METRICS-FIXME

* Add a SimpleCounter template for counter metrics (#12326)

* ignore TAGS (from etags/ctags)

* Add initial SimpleCounter interface/implementation/unit tests

* Add initial SimpleCounter interface/implementation/unit tests

* Fix unit test

* Improve clarity on unit test

* Update FIXME comments

* Address review comments.  Must use function local static mutex

* update comment

* Go back to template specializations to handle older C++ versions

* SimpleCounter: periodically log the counters to TraceEvent (#12329)

* SimpleCounter: periodically log the counters to TraceEvent.  Muck with hierarchical names to comply with random rules.

* Update doc about Prometheus metric names

* relax assertion about counter count, because unit test is actually running in fdbserver and that causes a unrelated counter to be created

* Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks (#12333)

* Add a pointed comment in UnitTest.h about some weaknesses

* Use counter names that will get converted to field names that TraceEvent does not complain about

* unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation

* update comment about caveats with unittests breaking simulation

* yet another field name fix

* just call validateField() directly from simple counters

* run report loop in unit tests

* fix build, fix comment

* blah blah blah

* always be munging metric names

* Instrument Arena, FastAlloc, Platform.cpp with SimpleCounter metrics to count allocations and bytes (#12339)

* emit a simpleCounterReport when we declare out of memory

* FastAlloc.h: initial pass of adding byte/object allocation/deallocation metrics

* Avoid conflict over the name SimpleCounter by eliminating this private definition of a name which is too valuable for this one random file to claim for its own use

* FastAlloc.cpp, Platform.actor.cpp: initial pass at adding SimpleCounter metrics to count allocations and bytes

* rename wrapper calls and update comments

* Count bytes copied in StringRef

* Arena.cpp: instrument allocations and some other stuff

* Arena.cpp: simplify use of SimpleCounter

* simpleCounterReport: generate TraceEvent in batches of MAX_TRACE_EVENT_LENGTH / 100 counters to avoid trace buffer overflow

* Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names (#12356)

Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of
XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input.

Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names
by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes
flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when
creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting
to future metrics platforms down the road.

Testing:
20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass

* replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with

* move Prometheus metric name validation to SimpleCounter.cpp.  Remove validation from Trace.cpp.  This stuff is going to Splunk.  Splunk takes what we give it.

* Use simple counter to replace recently added net2 counters (#12358)

* use simple counter to replace recent added net2 counters

* allow unit test to use SimpleCounter Trace event

---------

Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
gxglass added a commit to gxglass/foundationdb that referenced this pull request Apr 7, 2026
… Net2 cherrypick to 7.4 (apple#12372)

* Arena/FastAlloc: add comments where potential metrics can be added (apple#12306)

* Arena/FastAlloc: add comments where potential metrics can be incremented.
The intent is to count allocations and bytes.

Remove a commented-out ifdef block that has been disabled for many years and
which does not work (per the explanation in the comment).  We don't need to
keep reading about the results of a small failed experiment from many years ago.

* add more one FIXME comment

* add one more METRICS-FIXME

* Add a SimpleCounter template for counter metrics (apple#12326)

* ignore TAGS (from etags/ctags)

* Add initial SimpleCounter interface/implementation/unit tests

* Add initial SimpleCounter interface/implementation/unit tests

* Fix unit test

* Improve clarity on unit test

* Update FIXME comments

* Address review comments.  Must use function local static mutex

* update comment

* Go back to template specializations to handle older C++ versions

* SimpleCounter: periodically log the counters to TraceEvent (apple#12329)

* SimpleCounter: periodically log the counters to TraceEvent.  Muck with hierarchical names to comply with random rules.

* Update doc about Prometheus metric names

* relax assertion about counter count, because unit test is actually running in fdbserver and that causes a unrelated counter to be created

* Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks (apple#12333)

* Add a pointed comment in UnitTest.h about some weaknesses

* Use counter names that will get converted to field names that TraceEvent does not complain about

* unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation

* update comment about caveats with unittests breaking simulation

* yet another field name fix

* just call validateField() directly from simple counters

* run report loop in unit tests

* fix build, fix comment

* blah blah blah

* always be munging metric names

* Instrument Arena, FastAlloc, Platform.cpp with SimpleCounter metrics to count allocations and bytes (apple#12339)

* emit a simpleCounterReport when we declare out of memory

* FastAlloc.h: initial pass of adding byte/object allocation/deallocation metrics

* Avoid conflict over the name SimpleCounter by eliminating this private definition of a name which is too valuable for this one random file to claim for its own use

* FastAlloc.cpp, Platform.actor.cpp: initial pass at adding SimpleCounter metrics to count allocations and bytes

* rename wrapper calls and update comments

* Count bytes copied in StringRef

* Arena.cpp: instrument allocations and some other stuff

* Arena.cpp: simplify use of SimpleCounter

* simpleCounterReport: generate TraceEvent in batches of MAX_TRACE_EVENT_LENGTH / 100 counters to avoid trace buffer overflow

* Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names (apple#12356)

Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of
XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input.

Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names
by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes
flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when
creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting
to future metrics platforms down the road.

Testing:
20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass

* replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with

* move Prometheus metric name validation to SimpleCounter.cpp.  Remove validation from Trace.cpp.  This stuff is going to Splunk.  Splunk takes what we give it.

* Use simple counter to replace recently added net2 counters (apple#12358)

* use simple counter to replace recent added net2 counters

* allow unit test to use SimpleCounter Trace event

---------

Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants