-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in FQDN values during startup #35736
Conversation
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
if p.config.CacheTTL <= 0 { | ||
return true | ||
} | ||
|
||
if p.lastUpdate.Add(p.config.CacheTTL).After(time.Now()) { | ||
return false | ||
} | ||
p.lastUpdate.Time = time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like this is still a race -- if the cache is expired, then calling expired()
twice in a row will return true the first time and false the second time. But to give a correct result, once expired()
becomes true, it should stay true until the next loadData
completes.
The PR description mentions not wanting everything to grind to a halt on every loadData
, and that makes sense, but everything should grind to a halt when the cache is expired, until at least one loadData
completes. The tricky part here is making sure that we don't do redundant work -- the first caller to notice that the cache is expired should call loadData
, then any additional callers should notice that the cache is expired and the result is already being recomputed, and wait for it to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my thinking is that with loadData()
, there's a tradeoff involved:
- Normal cache expiry doesn't hold the mutex: A certain number of events get stale hostname data, but updating the hostname will only block one event.
- Normal Cache expiry holds the mutex: No events get stale hostname data, but we globally block events until the hostname is updated.
The latter case makes me pretty nervous. Note that the call to FQDN()
that happens in loadData()
can, depending on the platform & setup, involve some network calls to look up DNS info. Blocking all event ingestion while we do network calls is kinda scary, so I figured the tradeoff was worth it. Open to other ideas, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so one simplifying factor that I think we weren't relying on: a processor is run on a single goroutine. On the one hand, this means that some things that look like race conditions aren't: updating the cache TTL in expired
when it is checked rather than in loadData
when it is updated makes the code confusing, but it doesn't actually break anything, because the only important caller of expired()
is loadData
, and if expired()
returns true then loadData
synchronously reloads everything.
(Important exception: the tests. The existing tests used expired()
to "prime" the cache's lastUpdated value, even though expired()
itself doesn't load any data. If the tests need a special call to mark the cache as non-expired, then there should be separate helper functions for checking expiration and resetting expiration time, instead of making expired()
itself do both, otherwise at the very least it's hard to reason about whether they're testing the right thing.)
The upshot though is: the internal behavior was confusing but sequentially correct. The error comes in with FQDN flag updates, which are the only changes that happen on a different goroutine than the processor itself. The previous behavior was almost correct: while holding a mutex, set the cache update time to zero, so the next call to expired()
would return true, which would trigger a synchronous update of FQDN on the next loadData()
call. Since the processor itself is locked to a single goroutine, no other callers will try using loadData()
while that update is happening.
I think the core problem might be the callback issue I pointed out elsewhere: the FQDN callbacks overwrite each other, so only one copy of add_host_metadata
receives the update, and the rest just have to wait until their cacheTTL
expires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you're right that the behavior is "almost" correct. When I was still trying to figure out a fix, one thing I experimented with is how many data streams it takes to reproduce the issue. If you just have one datastream installed, you don't hit the bug, as there's only one goroutine involved, like you said. Issue happens when we have multiple data streams, and thus multiple threads.
Not sure what you mean by "the FQDN callbacks overwrite each other". Do you mean how the cache expiry happens between threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean the issue mentioned in this comment -- when add_host_metadata
registers its fqdn callback, it is just keyed by the string "add_host_metadata"
, so only one callback can be active at a time. That means whenever the FQDN flag changes, one lucky add_host_metadata
instance will receive the callback and update its cache, while the rest will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. I did change that. Now it returns an error, and the processor will use a keyname with the UUID.
I could go back and run some tests, but my understanding is that that's largely unrelated, and that under agent only one processor instance will register itself globally during startup
@faec think that's all the changes you mentioned over zoom. Also did the panic-on-register test, no issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, despite the scary context. I wish I were more confident that there aren't other synchronization errors around this code, but I believe your change strictly improves the situation 😜
/test |
What does this PR do?
Fixes elastic/elastic-agent#2697
This fixes a rather complex race condition that can happen when FQDN hostnames are enabled during startup:
add_host_metadata
starts up before everything else, sets the hostname value without FQDNexpireCache()
inadd_host_metadata()
, telling the processor to update the FQDN value next time it gets an eventloadData()
, which checks to see if the cache is expired. It is, soexpired()
returnstrue
and then resets the cache update timeloadData()
again.loadData()
callsexpired()
, which returns false.loadData()
then returns without updating the FQDN, as the cache is not expired. An event with the non-FQDN value is returned from the processor.add_host_metadata
's processor to the correct FQDN.This fix makes two major changes:
expireCache()
now synchronously callsloadData()
, holding a mutex until the load completes or a timeout is reached. This was done inexpireCache()
in order to minimize the amount of time we spend synchronously updatingadd_host_metadata()
's state. I didn't want to create a fix where all the threads grind to a halt whenever one thread callsloadData()
.features
. This splits apart the library into two mutexes, one for the features flag, and another for the callback hashmap. This was done out of an abundance of caution, so if a callback blocks for a certain period of time, other threads can continue to callfeatures.FQDN()
.Note that this only fixes the races I could reproduce directly, there's still a few other conditions I'm worried about:
features.UpdateFromConfig
after the inputs have started. From a Agent client perspective, I'm not sure if this is possible, since iffqdn
is set totrue
, we should receive that feature flag for all units (I think), but there's nothing that strictly prevents it from happening.features.UpdateFromConfig()
before the processor callsAddFQDNOnChangeCallback()
. I'm not sure this is possible, since processors are initialized rather early, but again, there's nothing that formally stops it from happening.Why is it important?
This is a race condition that can produce incorrect data.
Checklist
- [ ] I have made corresponding changes to the documentationCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
There are reproduction steps in elastic/elastic-agent#2697, although this is a race condition and it's a bit hard to reproduce (or not reproduce) reliably. My steps for development were: