Merge SimpleCounters related PRs into release-7.3#12937
Merge SimpleCounters related PRs into release-7.3#12937gxglass merged 5 commits intoapple:release-7.3from
Conversation
… Net2 cherrypick to 7.4 (apple#12372) * Arena/FastAlloc: add comments where potential metrics can be added (apple#12306) * Arena/FastAlloc: add comments where potential metrics can be incremented. The intent is to count allocations and bytes. Remove a commented-out ifdef block that has been disabled for many years and which does not work (per the explanation in the comment). We don't need to keep reading about the results of a small failed experiment from many years ago. * add more one FIXME comment * add one more METRICS-FIXME * Add a SimpleCounter template for counter metrics (apple#12326) * ignore TAGS (from etags/ctags) * Add initial SimpleCounter interface/implementation/unit tests * Add initial SimpleCounter interface/implementation/unit tests * Fix unit test * Improve clarity on unit test * Update FIXME comments * Address review comments. Must use function local static mutex * update comment * Go back to template specializations to handle older C++ versions * SimpleCounter: periodically log the counters to TraceEvent (apple#12329) * SimpleCounter: periodically log the counters to TraceEvent. Muck with hierarchical names to comply with random rules. * Update doc about Prometheus metric names * relax assertion about counter count, because unit test is actually running in fdbserver and that causes a unrelated counter to be created * Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks (apple#12333) * Add a pointed comment in UnitTest.h about some weaknesses * Use counter names that will get converted to field names that TraceEvent does not complain about * unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation * update comment about caveats with unittests breaking simulation * yet another field name fix * just call validateField() directly from simple counters * run report loop in unit tests * fix build, fix comment * blah blah blah * always be munging metric names * Instrument Arena, FastAlloc, Platform.cpp with SimpleCounter metrics to count allocations and bytes (apple#12339) * emit a simpleCounterReport when we declare out of memory * FastAlloc.h: initial pass of adding byte/object allocation/deallocation metrics * Avoid conflict over the name SimpleCounter by eliminating this private definition of a name which is too valuable for this one random file to claim for its own use * FastAlloc.cpp, Platform.actor.cpp: initial pass at adding SimpleCounter metrics to count allocations and bytes * rename wrapper calls and update comments * Count bytes copied in StringRef * Arena.cpp: instrument allocations and some other stuff * Arena.cpp: simplify use of SimpleCounter * simpleCounterReport: generate TraceEvent in batches of MAX_TRACE_EVENT_LENGTH / 100 counters to avoid trace buffer overflow * Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names (apple#12356) Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input. Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting to future metrics platforms down the road. Testing: 20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass * replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with * move Prometheus metric name validation to SimpleCounter.cpp. Remove validation from Trace.cpp. This stuff is going to Splunk. Splunk takes what we give it. * Use simple counter to replace recently added net2 counters (apple#12358) * use simple counter to replace recent added net2 counters * allow unit test to use SimpleCounter Trace event --------- Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
…pent executing them (apple#12824) It was observed that compute management platforms do provide CPU usage graphs, but retention is not necessarily as much as we'd like. CPU scheduling is directly controlled by FDB and is very, very easy to instrument to count how busy we are, so just do it. Testing: ../build_output4/bin/fdbserver -r test -f tests/noSim/RandomUnitTests.toml ../build_output4/bin/fdbserver -r unittests -f 'noSim' then reviewed trace files.
* count Arenas created * remove a deleted method signature * Remove out of date guidance in header file
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
saintstack
left a comment
There was a problem hiding this comment.
Great. Nits in below.
| countConnEstablished.init("Net2.CountConnEstablished"_sr); | ||
| countConnClosedWithError.init("Net2.CountConnClosedWithError"_sr); | ||
| countConnClosedWithoutError.init("Net2.CountConnClosedWithoutError"_sr); | ||
| countConnIncompatible.init("Net2.CountConnIncompatible"_sr); |
There was a problem hiding this comment.
What's the thinking here.? The change is totally internal. No chance of an external dependency on say, these counter names (is there a map of old counter to new? Do we need one? Are any counters dropped? Might be good to list in commit message at least)? Asking because this change will appear in patch release late in a mature release line.
There was a problem hiding this comment.
IIRC this replaces an earlier mid-2025 attempt to add metrics that Zhe had put in and is subject to higher per-metric overhead, so we took it out.
| // and vc++ rejects "true". | ||
| // FIXME: this has been set to true for 4+ years. We probably do not need the "not thread safe" | ||
| // version of the code. Consider removing this and just making it thread safe. | ||
| // Also, explain why thread safety is required here and not elsewhere (e.g. Arena and ArenaBlock). |
| // | ||
| // Counters are periodically logged as "SimpleCounters". | ||
| // | ||
| // More background: https://quip-apple.com/PyfZA6Qkbc7w |
There was a problem hiding this comment.
Is this doc. publicly available? Wiki page on fdb GitHub?
There was a problem hiding this comment.
I'm unifying 7.3 with code already in main for 6+ months. On the doc specifically, it wasn't written for external consumption and may (I can't remember) have a bunch of Apple internal content. We don't have to address that question here.
| * simulation, then you may break simulation even though your unit | ||
| * test itself runs fine via fdbserver -r unittests. As a result, to test | ||
| * that your unit tests themselves do not break simulation, you should | ||
| * also run a 100k simulation run. If you think this sounds |
| "ClientTLSHandshakeLocked", | ||
| (netData.countClientTLSHandshakeLocked - statState->networkState.countClientTLSHandshakeLocked) / | ||
| currentStats.elapsed) | ||
| .detail("IncomingConnConnected", |
There was a problem hiding this comment.
We are going to give notice that metrics have moved in logs?
There was a problem hiding this comment.
No plans to. Going out of the way to avoid possibly breaking logging on a feature that itself basically does not work very well is not really a high priority I don't think.
For observability around memory and CPU usage, this merges 3 prior diffs:
#12372: a combined commit into release-7.4
#12824: CPU metrics
#12930: a bug fix and some comment cleanups from Claude
Testing:
20260407-171501-gglass-21941d607b0eec14 compressed=True data_size=35378210 duration=5339652 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:53:31 sanity=False started=100000 stopped=20260407-180832 submitted=20260407-171501 timeout=5400 username=gglass
Also ran ctest:
.
.
.
Start 66: update_bindings_go_src_fdb_generated_go
66/66 Test #66: update_bindings_go_src_fdb_generated_go ............................. Passed 0.00 sec
100% tests passed, 0 tests failed out of 66
Total Test time (real) = 872.03 sec