Skip to content

bvar: sampler busy-loops at ~1 kHz after switch to cpuwide_time_ns (regression from #3268) #3277

@hjwsm1989

Description

@hjwsm1989

Describe the bug

Since commit 12fb539 ("Use monotonic time instead of wall time" / #3268, merged around 1.17.0-rc1), the bvar sampler thread no longer sleeps between sampling rounds and prints

sampler.cpp:192 bvar is busy at sampling for 2 seconds!

continuously (tens of thousands of log lines per minute on a process with a moderate number of bvars).

Root cause is a unit-conversion mistake in src/bvar/detail/sampler.cpp::SamplerCollector::run():

while (!_stop) {
    int64_t abstime = butil::cpuwide_time_ns();   // ns  (was gettimeofday_us() / us)
    ...
    int64_t now = butil::cpuwide_time_ns();       // ns
    _cumulated_time_us += now - abstime;
    abstime += 1000000L;                          // BUG: + 1 ms, not + 1 s
    while (abstime > now) {
        ::usleep(abstime - now);                  // BUG: usleep() takes us, value is ns
        slept = true;
        now = butil::cpuwide_time_ns();
    }
    if (slept) { consecutive_nosleep = 0; }
    else if (++consecutive_nosleep >= WARN_NOSLEEP_THRESHOLD) { LOG(WARNING) ... }
}

PR #3268 switched the three time-source calls from gettimeofday_us() (microseconds) to cpuwide_time_ns() (nanoseconds), but left abstime += 1000000L and ::usleep(abstime - now) untouched. The intended "sample once per second" becomes "advance abstime by 1 ms"; the inner while loop almost never enters (so slept stays false), and after two consecutive nosleep iterations the warning fires. In practice the sampler runs at ~1 kHz instead of 1 Hz, burning CPU and flooding logs.

To Reproduce

  1. Build any service linking brpc >= 1.17.0-rc1 (commit 12fb539 or newer).

  2. Register any bvar (e.g. bvar::LatencyRecorder).

  3. Start the process and tail its log:

    grep "bvar is busy at sampling" <logfile> | wc -l

    Observed: tens of thousands of warnings per minute.

Expected behavior

The sampler runs once per second; no "busy" warning unless the host is truly overloaded (the prior <=1.16.x behavior).

Versions

  • brpc: 1.17.0-rc1 / current master (commit 12fb539 and descendants)
  • OS: Linux 5.14 (x86_64)
  • Compiler: gcc 12, C++20

Fix

Follow-up PR to be linked below.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions