Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks#12333
Merged
gxglass merged 10 commits intoapple:mainfrom Aug 28, 2025
Merged
Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks#12333gxglass merged 10 commits intoapple:mainfrom
gxglass merged 10 commits intoapple:mainfrom
Conversation
…ent does not complain about
…t errors in simulation
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
spraza
approved these changes
Aug 28, 2025
gxglass
added a commit
to gxglass/foundationdb
that referenced
this pull request
Sep 15, 2025
…ace.cpp simulation-only checks (apple#12333) * Add a pointed comment in UnitTest.h about some weaknesses * Use counter names that will get converted to field names that TraceEvent does not complain about * unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation * update comment about caveats with unittests breaking simulation * yet another field name fix * just call validateField() directly from simple counters * run report loop in unit tests * fix build, fix comment * blah blah blah * always be munging metric names
gxglass
added a commit
that referenced
this pull request
Sep 16, 2025
… Net2 cherrypick to 7.4 (#12372) * Arena/FastAlloc: add comments where potential metrics can be added (#12306) * Arena/FastAlloc: add comments where potential metrics can be incremented. The intent is to count allocations and bytes. Remove a commented-out ifdef block that has been disabled for many years and which does not work (per the explanation in the comment). We don't need to keep reading about the results of a small failed experiment from many years ago. * add more one FIXME comment * add one more METRICS-FIXME * Add a SimpleCounter template for counter metrics (#12326) * ignore TAGS (from etags/ctags) * Add initial SimpleCounter interface/implementation/unit tests * Add initial SimpleCounter interface/implementation/unit tests * Fix unit test * Improve clarity on unit test * Update FIXME comments * Address review comments. Must use function local static mutex * update comment * Go back to template specializations to handle older C++ versions * SimpleCounter: periodically log the counters to TraceEvent (#12329) * SimpleCounter: periodically log the counters to TraceEvent. Muck with hierarchical names to comply with random rules. * Update doc about Prometheus metric names * relax assertion about counter count, because unit test is actually running in fdbserver and that causes a unrelated counter to be created * Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks (#12333) * Add a pointed comment in UnitTest.h about some weaknesses * Use counter names that will get converted to field names that TraceEvent does not complain about * unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation * update comment about caveats with unittests breaking simulation * yet another field name fix * just call validateField() directly from simple counters * run report loop in unit tests * fix build, fix comment * blah blah blah * always be munging metric names * Instrument Arena, FastAlloc, Platform.cpp with SimpleCounter metrics to count allocations and bytes (#12339) * emit a simpleCounterReport when we declare out of memory * FastAlloc.h: initial pass of adding byte/object allocation/deallocation metrics * Avoid conflict over the name SimpleCounter by eliminating this private definition of a name which is too valuable for this one random file to claim for its own use * FastAlloc.cpp, Platform.actor.cpp: initial pass at adding SimpleCounter metrics to count allocations and bytes * rename wrapper calls and update comments * Count bytes copied in StringRef * Arena.cpp: instrument allocations and some other stuff * Arena.cpp: simplify use of SimpleCounter * simpleCounterReport: generate TraceEvent in batches of MAX_TRACE_EVENT_LENGTH / 100 counters to avoid trace buffer overflow * Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names (#12356) Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input. Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting to future metrics platforms down the road. Testing: 20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass * replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with * move Prometheus metric name validation to SimpleCounter.cpp. Remove validation from Trace.cpp. This stuff is going to Splunk. Splunk takes what we give it. * Use simple counter to replace recently added net2 counters (#12358) * use simple counter to replace recent added net2 counters * allow unit test to use SimpleCounter Trace event --------- Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
gxglass
added a commit
to gxglass/foundationdb
that referenced
this pull request
Apr 7, 2026
… Net2 cherrypick to 7.4 (apple#12372) * Arena/FastAlloc: add comments where potential metrics can be added (apple#12306) * Arena/FastAlloc: add comments where potential metrics can be incremented. The intent is to count allocations and bytes. Remove a commented-out ifdef block that has been disabled for many years and which does not work (per the explanation in the comment). We don't need to keep reading about the results of a small failed experiment from many years ago. * add more one FIXME comment * add one more METRICS-FIXME * Add a SimpleCounter template for counter metrics (apple#12326) * ignore TAGS (from etags/ctags) * Add initial SimpleCounter interface/implementation/unit tests * Add initial SimpleCounter interface/implementation/unit tests * Fix unit test * Improve clarity on unit test * Update FIXME comments * Address review comments. Must use function local static mutex * update comment * Go back to template specializations to handle older C++ versions * SimpleCounter: periodically log the counters to TraceEvent (apple#12329) * SimpleCounter: periodically log the counters to TraceEvent. Muck with hierarchical names to comply with random rules. * Update doc about Prometheus metric names * relax assertion about counter count, because unit test is actually running in fdbserver and that causes a unrelated counter to be created * Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks (apple#12333) * Add a pointed comment in UnitTest.h about some weaknesses * Use counter names that will get converted to field names that TraceEvent does not complain about * unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation * update comment about caveats with unittests breaking simulation * yet another field name fix * just call validateField() directly from simple counters * run report loop in unit tests * fix build, fix comment * blah blah blah * always be munging metric names * Instrument Arena, FastAlloc, Platform.cpp with SimpleCounter metrics to count allocations and bytes (apple#12339) * emit a simpleCounterReport when we declare out of memory * FastAlloc.h: initial pass of adding byte/object allocation/deallocation metrics * Avoid conflict over the name SimpleCounter by eliminating this private definition of a name which is too valuable for this one random file to claim for its own use * FastAlloc.cpp, Platform.actor.cpp: initial pass at adding SimpleCounter metrics to count allocations and bytes * rename wrapper calls and update comments * Count bytes copied in StringRef * Arena.cpp: instrument allocations and some other stuff * Arena.cpp: simplify use of SimpleCounter * simpleCounterReport: generate TraceEvent in batches of MAX_TRACE_EVENT_LENGTH / 100 counters to avoid trace buffer overflow * Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names (apple#12356) Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input. Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting to future metrics platforms down the road. Testing: 20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass * replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with * move Prometheus metric name validation to SimpleCounter.cpp. Remove validation from Trace.cpp. This stuff is going to Splunk. Splunk takes what we give it. * Use simple counter to replace recently added net2 counters (apple#12358) * use simple counter to replace recent added net2 counters * allow unit test to use SimpleCounter Trace event --------- Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Expose validateField() from Trace.cpp and just call it directly from "unit tests" in SimpleCounter.cpp. Add some
comments in various places about the unit test execution environment. The updated test cases would catch the
prior issue where they themselves manufacture counters with invalid label names. Updated the tests to use
hierarchical metrix names that get munged into compliant field names.
Unit test:
[root@gglass-dev-okteto-56b45ddbf4-kkwcl flow]# git log -n 1
commit 4e5d9b2 (HEAD -> tracenames, origin/tracenames)
Author: Gideon Glass gxglassgithub@gmail.com
Date: Wed Aug 27 18:30:20 2025 -0700
[root@gglass-dev-okteto-56b45ddbf4-kkwcl flow]# !fdbserver
fdbserver -r unittests -f /flow/simplecounter
startingConfiguration: start
useDB: false
Run test:UnitTests start
setting up test (UnitTests)...
Test received trigger for setup...
running test (UnitTests)...
Found 2 tests
Testing /flow/simplecounter/double
Testing /flow/simplecounter/int64
UnitTests complete
UnitTests complete
checking test (UnitTests)...
fetching metrics (UnitTests)...
Metric (0, 0): UnitTests.Test Cases Available, 2.000000, 2
Metric (0, 1): UnitTests.Test Cases Executed, 2.000000, 2
Metric (0, 2): UnitTests.Test Cases Failed, 0.000000, 0
Metric (0, 3): UnitTests.Total wall clock time (s), 1.590469, 1.59
Metric (0, 4): UnitTests.Total flow time (s), 0.000000, 0
Test complete
1 test clients passed; 0 test clients failed
Run test:UnitTests Done.
1 tests passed; 0 tests failed.
[root@gglass-dev-okteto-56b45ddbf4-kkwcl flow]#
Simulation:
[root@gglass-dev-okteto-56b45ddbf4-kkwcl ~]# j list --stopped | grep gglass | tail -1
20250828-013346-gglass-ce2142694517650b compressed=True data_size=60675767 duration=6423706 ended=100000 fail=2 fail_fast=10 max_runs=100000 pass=99998 priority=100 remaining=0 runtime=0:35:35 sanity=False started=100000 stopped=20250828-020921 submitted=20250828-013346 timeout=5400 username=gglass
Here are the 2 failures:
[root@gglass-dev-okteto-56b45ddbf4-kkwcl ~]# j_errors_last | grep TestFile 2>/dev/null
Results for test ensemble: 20250828-013346-gglass-ce2142694517650b
.
.
.
meh, 2>/dev/null did not result in no garbage warnings on the terminal so I will just edit them out
.
.
.
FaultInjectionEnabled="1" TestFile="tests/slow/BulkDumpingS3.toml"
TestFile="tests/restarting/from_7.3.0/ConfigureTestRestart-1.toml"
TestFile="tests/restarting/from_7.3.0/ConfigureTestRestart-2.toml"
[root@gglass-dev-okteto-56b45ddbf4-kkwcl ~]#
With the old bug this would have failed in tests/fast/RandomUnitTests.toml I think.