Skip to content

Merge SimpleCounters related PRs into release-7.3#12937

Merged
gxglass merged 5 commits intoapple:release-7.3from
gxglass:simplecounters73-try2
Apr 9, 2026
Merged

Merge SimpleCounters related PRs into release-7.3#12937
gxglass merged 5 commits intoapple:release-7.3from
gxglass:simplecounters73-try2

Conversation

@gxglass
Copy link
Copy Markdown
Collaborator

@gxglass gxglass commented Apr 7, 2026

For observability around memory and CPU usage, this merges 3 prior diffs:

#12372: a combined commit into release-7.4

#12824: CPU metrics

#12930: a bug fix and some comment cleanups from Claude

Testing:
20260407-171501-gglass-21941d607b0eec14 compressed=True data_size=35378210 duration=5339652 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:53:31 sanity=False started=100000 stopped=20260407-180832 submitted=20260407-171501 timeout=5400 username=gglass

Also ran ctest:

.
.
.
Start 66: update_bindings_go_src_fdb_generated_go
66/66 Test #66: update_bindings_go_src_fdb_generated_go ............................. Passed 0.00 sec

100% tests passed, 0 tests failed out of 66

Total Test time (real) = 872.03 sec

gxglass and others added 5 commits April 7, 2026 09:23
… Net2 cherrypick to 7.4 (apple#12372)

* Arena/FastAlloc: add comments where potential metrics can be added (apple#12306)

* Arena/FastAlloc: add comments where potential metrics can be incremented.
The intent is to count allocations and bytes.

Remove a commented-out ifdef block that has been disabled for many years and
which does not work (per the explanation in the comment).  We don't need to
keep reading about the results of a small failed experiment from many years ago.

* add more one FIXME comment

* add one more METRICS-FIXME

* Add a SimpleCounter template for counter metrics (apple#12326)

* ignore TAGS (from etags/ctags)

* Add initial SimpleCounter interface/implementation/unit tests

* Add initial SimpleCounter interface/implementation/unit tests

* Fix unit test

* Improve clarity on unit test

* Update FIXME comments

* Address review comments.  Must use function local static mutex

* update comment

* Go back to template specializations to handle older C++ versions

* SimpleCounter: periodically log the counters to TraceEvent (apple#12329)

* SimpleCounter: periodically log the counters to TraceEvent.  Muck with hierarchical names to comply with random rules.

* Update doc about Prometheus metric names

* relax assertion about counter count, because unit test is actually running in fdbserver and that causes a unrelated counter to be created

* Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks (apple#12333)

* Add a pointed comment in UnitTest.h about some weaknesses

* Use counter names that will get converted to field names that TraceEvent does not complain about

* unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation

* update comment about caveats with unittests breaking simulation

* yet another field name fix

* just call validateField() directly from simple counters

* run report loop in unit tests

* fix build, fix comment

* blah blah blah

* always be munging metric names

* Instrument Arena, FastAlloc, Platform.cpp with SimpleCounter metrics to count allocations and bytes (apple#12339)

* emit a simpleCounterReport when we declare out of memory

* FastAlloc.h: initial pass of adding byte/object allocation/deallocation metrics

* Avoid conflict over the name SimpleCounter by eliminating this private definition of a name which is too valuable for this one random file to claim for its own use

* FastAlloc.cpp, Platform.actor.cpp: initial pass at adding SimpleCounter metrics to count allocations and bytes

* rename wrapper calls and update comments

* Count bytes copied in StringRef

* Arena.cpp: instrument allocations and some other stuff

* Arena.cpp: simplify use of SimpleCounter

* simpleCounterReport: generate TraceEvent in batches of MAX_TRACE_EVENT_LENGTH / 100 counters to avoid trace buffer overflow

* Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names (apple#12356)

Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of
XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input.

Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names
by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes
flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when
creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting
to future metrics platforms down the road.

Testing:
20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass

* replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with

* move Prometheus metric name validation to SimpleCounter.cpp.  Remove validation from Trace.cpp.  This stuff is going to Splunk.  Splunk takes what we give it.

* Use simple counter to replace recently added net2 counters (apple#12358)

* use simple counter to replace recent added net2 counters

* allow unit test to use SimpleCounter Trace event

---------

Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
…pent executing them (apple#12824)

It was observed that compute management platforms do provide CPU usage graphs, but retention is not necessarily as much as we'd like.  CPU scheduling is directly controlled by FDB and is very, very easy to instrument to count how busy we are, so just do it.

Testing:
../build_output4/bin/fdbserver -r test -f tests/noSim/RandomUnitTests.toml
../build_output4/bin/fdbserver -r unittests -f 'noSim'

then reviewed trace files.
* count Arenas created
* remove a deleted method signature
* Remove out of date guidance in header file
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 33a7828
  • Duration 0:04:18
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 33a7828
  • Duration 0:08:24
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 33a7828
  • Duration 0:08:49
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 33a7828
  • Duration 0:08:59
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass gxglass requested review from saintstack and spraza April 7, 2026 18:59
@gxglass gxglass changed the title Merge SImpleCounters related PRs into release-7.3 Merge SimpleCounters related PRs into release-7.3 Apr 7, 2026
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 33a7828
  • Duration 0:32:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 33a7828
  • Duration 0:39:50
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Copy Markdown
Contributor

@saintstack saintstack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Nits in below.

countConnEstablished.init("Net2.CountConnEstablished"_sr);
countConnClosedWithError.init("Net2.CountConnClosedWithError"_sr);
countConnClosedWithoutError.init("Net2.CountConnClosedWithoutError"_sr);
countConnIncompatible.init("Net2.CountConnIncompatible"_sr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the thinking here.? The change is totally internal. No chance of an external dependency on say, these counter names (is there a map of old counter to new? Do we need one? Are any counters dropped? Might be good to list in commit message at least)? Asking because this change will appear in patch release late in a mature release line.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this replaces an earlier mid-2025 attempt to add metrics that Zhe had put in and is subject to higher per-metric overhead, so we took it out.

// and vc++ rejects "true".
// FIXME: this has been set to true for 4+ years. We probably do not need the "not thread safe"
// version of the code. Consider removing this and just making it thread safe.
// Also, explain why thread safety is required here and not elsewhere (e.g. Arena and ArenaBlock).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

//
// Counters are periodically logged as "SimpleCounters".
//
// More background: https://quip-apple.com/PyfZA6Qkbc7w
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this doc. publicly available? Wiki page on fdb GitHub?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unifying 7.3 with code already in main for 6+ months. On the doc specifically, it wasn't written for external consumption and may (I can't remember) have a bunch of Apple internal content. We don't have to address that question here.

* simulation, then you may break simulation even though your unit
* test itself runs fine via fdbserver -r unittests. As a result, to test
* that your unit tests themselves do not break simulation, you should
* also run a 100k simulation run. If you think this sounds
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smile

Comment thread flow/SystemMonitor.cpp
"ClientTLSHandshakeLocked",
(netData.countClientTLSHandshakeLocked - statState->networkState.countClientTLSHandshakeLocked) /
currentStats.elapsed)
.detail("IncomingConnConnected",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to give notice that metrics have moved in logs?

Copy link
Copy Markdown
Collaborator Author

@gxglass gxglass Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No plans to. Going out of the way to avoid possibly breaking logging on a feature that itself basically does not work very well is not really a high priority I don't think.

@gxglass gxglass merged commit b8605bb into apple:release-7.3 Apr 9, 2026
1 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants