Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names#12356
Conversation
…alias. Also, move _WIN32 ifdefd code out of general purpose source files and into Platform.cpp
…_backtrace to be defined for _WIN32
…forces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with
…alidation from Trace.cpp. This stuff is going to Splunk. Splunk takes what we give it.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
| // Convert hierarchical metric names into Prometheus-compatible metric | ||
| // names. Do this by a) removing initial '/' chars, and b) converting | ||
| // remaining '/' chars to '_' chars. | ||
| static std::string hierarchicalToPrometheus(const std::string input) { |
There was a problem hiding this comment.
Any thing dependent upon on old format? Tooling?
There was a problem hiding this comment.
Not that I can tell. I think we'll find out as we test in production, which seems to be the only way forward.
| @@ -266,7 +266,6 @@ struct TraceLog { | |||
| }; | |||
| void action(WriteBuffer& a) { | |||
| for (const auto& event : a.events) { | |||
There was a problem hiding this comment.
Because assert has been moved elsewhere?
There was a problem hiding this comment.
No, because this code should not have an opinion about this stuff unless it is well justified opinion. No rationale was ever given in the code for the prior check. It would have cost somebody <= 3 minutes to type out a comment consisting of <= 3 sentences to explain this, but no, whoever wrote it couldn't be bothered to do this. Perhaps the reason for that is the common case that they were not actually confident that what they were doing was justifiable, but they felt they had to do SOMETHING, so they typed out some code that embodied some notion of what they vaguely thought the right thing might well have possibly been at the time they were writing the code.
And so I am removing it with something midway between prejudice and spite because it's getting in the way of progress.
@saintstack this is failing because of s3_backup_tests again. This is why I want this stuff to be optional |
|
reopening to rerun CIs |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Ugh. Let me fix. |
…mpleCounter to emit Prometheus-compatible metric names (apple#12356) Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input. Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting to future metrics platforms down the road. Testing: 20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass * replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with * move Prometheus metric name validation to SimpleCounter.cpp. Remove validation from Trace.cpp. This stuff is going to Splunk. Splunk takes what we give it.
… Net2 cherrypick to 7.4 (#12372) * Arena/FastAlloc: add comments where potential metrics can be added (#12306) * Arena/FastAlloc: add comments where potential metrics can be incremented. The intent is to count allocations and bytes. Remove a commented-out ifdef block that has been disabled for many years and which does not work (per the explanation in the comment). We don't need to keep reading about the results of a small failed experiment from many years ago. * add more one FIXME comment * add one more METRICS-FIXME * Add a SimpleCounter template for counter metrics (#12326) * ignore TAGS (from etags/ctags) * Add initial SimpleCounter interface/implementation/unit tests * Add initial SimpleCounter interface/implementation/unit tests * Fix unit test * Improve clarity on unit test * Update FIXME comments * Address review comments. Must use function local static mutex * update comment * Go back to template specializations to handle older C++ versions * SimpleCounter: periodically log the counters to TraceEvent (#12329) * SimpleCounter: periodically log the counters to TraceEvent. Muck with hierarchical names to comply with random rules. * Update doc about Prometheus metric names * relax assertion about counter count, because unit test is actually running in fdbserver and that causes a unrelated counter to be created * Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks (#12333) * Add a pointed comment in UnitTest.h about some weaknesses * Use counter names that will get converted to field names that TraceEvent does not complain about * unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation * update comment about caveats with unittests breaking simulation * yet another field name fix * just call validateField() directly from simple counters * run report loop in unit tests * fix build, fix comment * blah blah blah * always be munging metric names * Instrument Arena, FastAlloc, Platform.cpp with SimpleCounter metrics to count allocations and bytes (#12339) * emit a simpleCounterReport when we declare out of memory * FastAlloc.h: initial pass of adding byte/object allocation/deallocation metrics * Avoid conflict over the name SimpleCounter by eliminating this private definition of a name which is too valuable for this one random file to claim for its own use * FastAlloc.cpp, Platform.actor.cpp: initial pass at adding SimpleCounter metrics to count allocations and bytes * rename wrapper calls and update comments * Count bytes copied in StringRef * Arena.cpp: instrument allocations and some other stuff * Arena.cpp: simplify use of SimpleCounter * simpleCounterReport: generate TraceEvent in batches of MAX_TRACE_EVENT_LENGTH / 100 counters to avoid trace buffer overflow * Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names (#12356) Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input. Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting to future metrics platforms down the road. Testing: 20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass * replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with * move Prometheus metric name validation to SimpleCounter.cpp. Remove validation from Trace.cpp. This stuff is going to Splunk. Splunk takes what we give it. * Use simple counter to replace recently added net2 counters (#12358) * use simple counter to replace recent added net2 counters * allow unit test to use SimpleCounter Trace event --------- Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
… Net2 cherrypick to 7.4 (apple#12372) * Arena/FastAlloc: add comments where potential metrics can be added (apple#12306) * Arena/FastAlloc: add comments where potential metrics can be incremented. The intent is to count allocations and bytes. Remove a commented-out ifdef block that has been disabled for many years and which does not work (per the explanation in the comment). We don't need to keep reading about the results of a small failed experiment from many years ago. * add more one FIXME comment * add one more METRICS-FIXME * Add a SimpleCounter template for counter metrics (apple#12326) * ignore TAGS (from etags/ctags) * Add initial SimpleCounter interface/implementation/unit tests * Add initial SimpleCounter interface/implementation/unit tests * Fix unit test * Improve clarity on unit test * Update FIXME comments * Address review comments. Must use function local static mutex * update comment * Go back to template specializations to handle older C++ versions * SimpleCounter: periodically log the counters to TraceEvent (apple#12329) * SimpleCounter: periodically log the counters to TraceEvent. Muck with hierarchical names to comply with random rules. * Update doc about Prometheus metric names * relax assertion about counter count, because unit test is actually running in fdbserver and that causes a unrelated counter to be created * Update SimpleCounter unit tests not to use metric names that break Trace.cpp simulation-only checks (apple#12333) * Add a pointed comment in UnitTest.h about some weaknesses * Use counter names that will get converted to field names that TraceEvent does not complain about * unit test: do not use a counter name that will cause Trace.cpp to emit errors in simulation * update comment about caveats with unittests breaking simulation * yet another field name fix * just call validateField() directly from simple counters * run report loop in unit tests * fix build, fix comment * blah blah blah * always be munging metric names * Instrument Arena, FastAlloc, Platform.cpp with SimpleCounter metrics to count allocations and bytes (apple#12339) * emit a simpleCounterReport when we declare out of memory * FastAlloc.h: initial pass of adding byte/object allocation/deallocation metrics * Avoid conflict over the name SimpleCounter by eliminating this private definition of a name which is too valuable for this one random file to claim for its own use * FastAlloc.cpp, Platform.actor.cpp: initial pass at adding SimpleCounter metrics to count allocations and bytes * rename wrapper calls and update comments * Count bytes copied in StringRef * Arena.cpp: instrument allocations and some other stuff * Arena.cpp: simplify use of SimpleCounter * simpleCounterReport: generate TraceEvent in batches of MAX_TRACE_EVENT_LENGTH / 100 counters to avoid trace buffer overflow * Eliminate poorly motivated trace field name validation, and change SimpleCounter to emit Prometheus-compatible metric names (apple#12356) Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input. Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting to future metrics platforms down the road. Testing: 20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass * replace undocumented trace event field name rules with a rule that enforces that field names must be valid Prometheus metric names. No idea why the old code declines to even state what it is trying to be compatible with * move Prometheus metric name validation to SimpleCounter.cpp. Remove validation from Trace.cpp. This stuff is going to Splunk. Splunk takes what we give it. * Use simple counter to replace recently added net2 counters (apple#12358) * use simple counter to replace recent added net2 counters * allow unit test to use SimpleCounter Trace event --------- Co-authored-by: Zhe Wang <zhe.wang@wustl.edu>
Trace.cpp does not provide a rationale for validateField() and validateFormat(). It appears to be some kind of
XML related validation. Why we should care about this is not clear. The output is going to Splunk. As far as I know, Splunk is supposed to be pretty liberal in what it accepts as input.
Add logic in SimpleCounter.cpp to convert hierarchical metric names to Prometheus compatible metric names
by the simple rule of converting intermediate '/' chars into '_', i.e. something like /flow/arena/bytesAllocated becomes
flow_arena_bytesAllocated. I feel hierarchical names are still slightly better, and very easy to reason about when
creating new metric names on the fly, but ensuring that they are at least Prometheus compatible should allow targeting
to future metrics platforms down the road.
Testing:
20250905-010257-gglass-25f3ef43ccc1c130 compressed=True data_size=41538755 duration=6389116 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:00:34 sanity=False started=100000 stopped=20250905-020331 submitted=20250905-010257 timeout=5400 username=gglass