Skip to content

Conversation

@AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Aug 8, 2025

Resolves a race condition in the monitoring.New* functions that could cause a panic during concurrent variable creation.

The Problem

The New* constructor functions are intended to be goroutine-safe. However, the logic to first check if a variable exists and then create it is not an atomic operation. This creates a race condition where multiple goroutines can simultaneously determine that a variable does not exist and then all attempt to create it.

This leads to a panic in the following scenario:

  1. Goroutine A calls NewUint, checks for the variable, and sees it doesn't exist.
  2. Goroutine B calls NewUint, also checks for the variable, and sees it doesn't exist.
  3. Goroutine B proceeds to call addVar and successfully creates the new variable.
  4. Goroutine A then attempts to call addVar. Since the variable now exists (created by Goroutine B), addVar panics.

The Solution

This PR resolves the race condition by wrapping the entire "check-and-create" sequence within a single lock. This ensures that the check for a variable's existence and its subsequent creation are performed atomically, preventing other goroutines from interfering and causing a panic.

Also a benchmark was added to ensure the new synchronisation does not have a significant impact:

❯ benchstat without-fix with-fix

goos: linux
goarch: amd64
pkg: github.com/elastic/elastic-agent-libs/monitoring
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
                        │ without-fix │              with-fix               │
                        │   sec/op    │   sec/op     vs base                │
NewVars/NewInt-16         674.1n ± 4%   681.3n ± 3%        ~ (p=0.247 n=10)
NewVars/NewUint-16        655.9n ± 2%   698.3n ± 6%   +6.47% (p=0.000 n=10)
NewVars/NewFloat-16       654.8n ± 3%   680.2n ± 6%   +3.88% (p=0.002 n=10)
NewVars/NewBool-16        653.9n ± 2%   749.0n ± 8%  +14.54% (p=0.000 n=10)
NewVars/NewString-16      717.0n ± 2%   790.9n ± 2%  +10.31% (p=0.000 n=10)
NewVars/NewFunc-16        665.6n ± 2%   718.1n ± 3%   +7.90% (p=0.000 n=10)
NewVars/NewTimestamp-16   723.8n ± 6%   820.8n ± 3%  +13.40% (p=0.000 n=10)
geomean                   677.3n        732.3n        +8.13%

                        │ without-fix │             with-fix              │
                        │    B/op     │    B/op     vs base               │
NewVars/NewInt-16          205.5 ± 2%   207.5 ± 3%       ~ (p=0.107 n=10)
NewVars/NewUint-16         204.5 ± 2%   209.5 ± 2%  +2.44% (p=0.000 n=10)
NewVars/NewFloat-16        205.0 ± 1%   207.5 ± 2%  +1.22% (p=0.007 n=10)
NewVars/NewBool-16         205.5 ± 2%   218.5 ± 6%  +6.33% (p=0.002 n=10)
NewVars/NewString-16       254.0 ± 1%   265.0 ± 2%  +4.33% (p=0.001 n=10)
NewVars/NewFunc-16         189.5 ± 2%   198.0 ± 3%  +4.49% (p=0.000 n=10)
NewVars/NewTimestamp-16    271.5 ± 1%   282.5 ± 2%  +4.05% (p=0.000 n=10)
geomean                    217.6        225.0       +3.39%

                        │ without-fix │              with-fix               │
                        │  allocs/op  │ allocs/op   vs base                 │
NewVars/NewInt-16          6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
NewVars/NewUint-16         6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
NewVars/NewFloat-16        6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
NewVars/NewBool-16         6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
NewVars/NewString-16       6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
NewVars/NewFunc-16         5.000 ± 0%   5.000 ± 0%       ~ (p=1.000 n=10) ¹
NewVars/NewTimestamp-16    6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                    5.846        5.846       +0.00%
¹ all samples are equal

To reproduce the results:

  • on this branch
go test -bench=^BenchmarkNewVars -run=^$ -benchmem -timeout=0 -count 10 > with-fix
  • on main
go test -bench=^BenchmarkNewVars -run=^$ -benchmem -timeout=0 -count 10 > without-fix
  • then
benchstat without-fix with-fix

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works

Related issues

@AndersonQ AndersonQ self-assigned this Aug 8, 2025
@AndersonQ AndersonQ added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Aug 8, 2025
@AndersonQ AndersonQ force-pushed the monitoring-fix-race-condition branch from 41795e7 to 22489d8 Compare August 8, 2025 16:24
@AndersonQ AndersonQ marked this pull request as ready for review August 8, 2025 16:24
@AndersonQ AndersonQ requested a review from a team as a code owner August 8, 2025 16:24
@AndersonQ AndersonQ requested review from mauri870 and rdner and removed request for a team August 8, 2025 16:24
mauri870
mauri870 previously approved these changes Aug 11, 2025
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @AndersonQ

@AndersonQ AndersonQ merged commit 10b7865 into elastic:main Aug 13, 2025
2 of 5 checks passed
@cmacknz
Copy link
Member

cmacknz commented Aug 15, 2025

What branches are affected by this bug, and is elastic-agent-libs updated with this fix in each of them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Race condition on monitoring.New[Uint, Int, ...]

4 participants