Describe the bug
Since commit 12fb539 ("Use monotonic time instead of wall time" / #3268, merged around 1.17.0-rc1), the bvar sampler thread no longer sleeps between sampling rounds and prints
sampler.cpp:192 bvar is busy at sampling for 2 seconds!
continuously (tens of thousands of log lines per minute on a process with a moderate number of bvars).
Root cause is a unit-conversion mistake in src/bvar/detail/sampler.cpp::SamplerCollector::run():
while (!_stop) {
int64_t abstime = butil::cpuwide_time_ns(); // ns (was gettimeofday_us() / us)
...
int64_t now = butil::cpuwide_time_ns(); // ns
_cumulated_time_us += now - abstime;
abstime += 1000000L; // BUG: + 1 ms, not + 1 s
while (abstime > now) {
::usleep(abstime - now); // BUG: usleep() takes us, value is ns
slept = true;
now = butil::cpuwide_time_ns();
}
if (slept) { consecutive_nosleep = 0; }
else if (++consecutive_nosleep >= WARN_NOSLEEP_THRESHOLD) { LOG(WARNING) ... }
}
PR #3268 switched the three time-source calls from gettimeofday_us() (microseconds) to cpuwide_time_ns() (nanoseconds), but left abstime += 1000000L and ::usleep(abstime - now) untouched. The intended "sample once per second" becomes "advance abstime by 1 ms"; the inner while loop almost never enters (so slept stays false), and after two consecutive nosleep iterations the warning fires. In practice the sampler runs at ~1 kHz instead of 1 Hz, burning CPU and flooding logs.
To Reproduce
-
Build any service linking brpc >= 1.17.0-rc1 (commit 12fb539 or newer).
-
Register any bvar (e.g. bvar::LatencyRecorder).
-
Start the process and tail its log:
grep "bvar is busy at sampling" <logfile> | wc -l
Observed: tens of thousands of warnings per minute.
Expected behavior
The sampler runs once per second; no "busy" warning unless the host is truly overloaded (the prior <=1.16.x behavior).
Versions
- brpc: 1.17.0-rc1 / current master (commit 12fb539 and descendants)
- OS: Linux 5.14 (x86_64)
- Compiler: gcc 12, C++20
Fix
Follow-up PR to be linked below.
Describe the bug
Since commit 12fb539 ("Use monotonic time instead of wall time" / #3268, merged around 1.17.0-rc1), the bvar sampler thread no longer sleeps between sampling rounds and prints
continuously (tens of thousands of log lines per minute on a process with a moderate number of bvars).
Root cause is a unit-conversion mistake in
src/bvar/detail/sampler.cpp::SamplerCollector::run():PR #3268 switched the three time-source calls from
gettimeofday_us()(microseconds) tocpuwide_time_ns()(nanoseconds), but leftabstime += 1000000Land::usleep(abstime - now)untouched. The intended "sample once per second" becomes "advance abstime by 1 ms"; the innerwhileloop almost never enters (sosleptstaysfalse), and after two consecutive nosleep iterations the warning fires. In practice the sampler runs at ~1 kHz instead of 1 Hz, burning CPU and flooding logs.To Reproduce
Build any service linking brpc >= 1.17.0-rc1 (commit 12fb539 or newer).
Register any bvar (e.g.
bvar::LatencyRecorder).Start the process and tail its log:
Observed: tens of thousands of warnings per minute.
Expected behavior
The sampler runs once per second; no "busy" warning unless the host is truly overloaded (the prior <=1.16.x behavior).
Versions
Fix
Follow-up PR to be linked below.