Skip to content

Comments

Fix race condition when we init lambda#746

Merged
tiedotguy merged 3 commits intomasterfrom
tdg/fix-lambda-race
Mar 24, 2025
Merged

Fix race condition when we init lambda#746
tiedotguy merged 3 commits intomasterfrom
tdg/fix-lambda-race

Conversation

@tiedotguy
Copy link
Collaborator

There's a couple of problems here:

  1. We subscribe a telemetry endpoint before the service is ready. If we receive a flush to that endpoint, then we try to perform a flush and get an NPE.

  2. The access to the flusher is unprotected across multiple goroutines. We just use an RWMutex for this, because it's rarely accessed and doesn't need anything fancy.

  3. There's a race condition in the test because the nop write is (was) async.

@tiedotguy tiedotguy requested a review from irisgve March 18, 2025 16:53
@tiedotguy tiedotguy marked this pull request as draft March 18, 2025 17:51
@tiedotguy
Copy link
Collaborator Author

Got some test problems, I'll force-push the fix.

There's a couple of problems here:

1. We subscribe a telemetry endpoint before the service is ready.  If we
   receive a flush to that endpoint, then we try to perform a flush and
   get an NPE.

2. The access to the flusher is unprotected across multiple goroutines.  We
   just use an RWMutex for this, because it's rarely accessed and doesn't
   need anything fancy.

3. There's a race condition in the test because the nop write is (was) async.
@tiedotguy tiedotguy force-pushed the tdg/fix-lambda-race branch from 0e95a08 to fb71920 Compare March 18, 2025 18:06
Because we start the http server in the background, sometimes it's not ready
when we try to call it.
@tiedotguy tiedotguy marked this pull request as ready for review March 18, 2025 18:56

ch := &channeledHandler{
chMaps: make(chan *gostatsd.MetricMap),
chMaps: make(chan *gostatsd.MetricMap, 1),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because the sendNop is now sync. If that doesn't go in to the channel, HttpForwarderHandlerV2.Run never starts using the virtual clock, and fixtures.NextStep() waits forever (because it's waiting for something to use the clock)

func (hfh *HttpForwarderHandlerV2) Run(ctx context.Context) {
var wg wait.Group
wg.StartWithContext(ctx, hfh.sendNop)
hfh.sendNop(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my context, why do we need to send a nop before we do actual flushes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deepcheck checks if we've ever sent metrics - if we can't send metrics, we don't want to start.

We don't check how longs it's been though, since this is about misconfiguration, not backend going down.

@tiedotguy
Copy link
Collaborator Author

Actually, I guess I should include a changelog.

@tiedotguy tiedotguy merged commit a5459f6 into master Mar 24, 2025
5 checks passed
@tiedotguy tiedotguy deleted the tdg/fix-lambda-race branch May 16, 2025 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants