Fix reëntrant events/spans escaping per-layer filters (round 2)#2
Conversation
Hamilton Chapman (hamchapman)
left a comment
There was a problem hiding this comment.
Not gonna lie, it took me about 4 complete read throughs before I feel like I really grokked it but I feel confident I do now 😅
Nicely done getting this solved - it makes sense to me 👍
At least once I got hung up on the fact that there's self.enabled which then has a disabled bitmask. My simple brain was struggling to reconcile that for a while
Oh, thanks for saying something about that - that's just a missed rename from when I basically re-applied that part of the diff from the old PR, but from inside my brain. I'll rename |
…irectives (tokio-rs#3487) ## Motivation `EnvFilter` maintains two `RwLock`-protected `HashMap`s (`by_id` and `by_cs`) for tracking per-span field-level match state. These are only populated when `has_dynamics` is `true` — i.e., when the filter includes span-name or field-level directives like `my_crate[span_name]=debug`. However, the span lifecycle callbacks (`on_new_span`, `on_enter`, `on_exit`, `on_close`, `on_record`) unconditionally acquire the `by_id` read lock (or write lock for `on_close`) on every invocation, even when `has_dynamics` is `false` and the maps are guaranteed empty. For the common case of pure `target=level` directives (e.g. `RUST_LOG=info,hyper=warn`), these locks are acquired and released on every span enter/exit/close/record with no effect. ## Problem Under high concurrency this causes measurable lock contention. A GDB backtrace captured from a Tokio-based HTTP server (80+ async workers, 3 `EnvFilter` instances as per-layer filters) during a load test showed ~25 threads contending on `RwLock::read` inside `EnvFilter::on_enter`, `on_exit`, and `cares_about_span`, with threads blocked in `read_contended` by `on_close`'s write lock: ``` #0 std::sys::sync::rwlock::futex::RwLock::read_contended #1 RwLock::read (self=0x55e0c7694178) #2 RwLock<HashMap<SpanId, MatchSet<SpanMatch>>>::read #3 EnvFilter::on_enter at filter/env/mod.rs:585 ``` ## Solution Add an early `if !self.has_dynamics { return; }` guard to each of the five span lifecycle methods. When dynamic directives are present, the existing code path runs unchanged. When they're absent (the common case), the lock is never touched.
Replaces #1, still a fork-internal PR. Still motivated by getditto/ditto#13306, still see tokio-rs#2704 for context & discussion.
This fix includes two major changes, the second of which was also in #1:
did_enable()) is called, it will check that bit, see that the filter state has been cleared, and recalculate the enablement status based on the current event/span's metadata. In that way, the filter state is now treated as a cache, which is invalidated when we would have read from stale state. Reëntrant events can therefore no longer clobber the filter state.Filteredlayers, which are layers that have a per-layer filter attached, no longer ever store their interest as "always", instead knocking down an "always" interest to "sometimes". This likely represents a performance regression, but it's a necessary one in order to force the reëvaluation ofenabled()for every span/event - if we use "always" and short-circuit that call, we see incorrect behavior for some kinds of reëntrant event scenarios (such as the new testenabled_event_inside_event)