Fix race in FQDN values during startup #35736

fearful-symmetry · 2023-06-09T18:44:14Z

What does this PR do?

Fixes elastic/elastic-agent#2697

This fixes a rather complex race condition that can happen when FQDN hostnames are enabled during startup:

The add_host_metadata starts up before everything else, sets the hostname value without FQDN
the config manager gets a unit update with FQDN: true. It sets the global FQDN flag
The config manager calls expireCache() in add_host_metadata(), telling the processor to update the FQDN value next time it gets an event
A new event reaches the processor, this new event calls loadData(), which checks to see if the cache is expired. It is, so expired() returns true and then resets the cache update time
Another event comes in. This event calls loadData() again. loadData() calls expired(), which returns false. loadData() then returns without updating the FQDN, as the cache is not expired. An event with the non-FQDN value is returned from the processor.
Repeat step 5 between 1-3 times
The original event that first triggered the cache time reset finally loads the FQDN, setting add_host_metadata's processor to the correct FQDN.

This fix makes two major changes:

In order to force an update to the hostname, expireCache() now synchronously calls loadData(), holding a mutex until the load completes or a timeout is reached. This was done in expireCache() in order to minimize the amount of time we spend synchronously updating add_host_metadata()'s state. I didn't want to create a fix where all the threads grind to a halt whenever one thread calls loadData().
Refactor the way mutexes work in features. This splits apart the library into two mutexes, one for the features flag, and another for the callback hashmap. This was done out of an abundance of caution, so if a callback blocks for a certain period of time, other threads can continue to call features.FQDN().

Note that this only fixes the races I could reproduce directly, there's still a few other conditions I'm worried about:

We might call features.UpdateFromConfig after the inputs have started. From a Agent client perspective, I'm not sure if this is possible, since if fqdn is set to true, we should receive that feature flag for all units (I think), but there's nothing that strictly prevents it from happening.
The beat agent client somehow calls features.UpdateFromConfig() before the processor calls AddFQDNOnChangeCallback(). I'm not sure this is possible, since processors are initialized rather early, but again, there's nothing that formally stops it from happening.

Why is it important?

This is a race condition that can produce incorrect data.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

There are reproduction steps in elastic/elastic-agent#2697, although this is a race condition and it's a bit hard to reproduce (or not reproduce) reliably. My steps for development were:

Build elastic-agent with this branch
Run elastic-agent with:

agent.features:
  fqdn:
    enabled: true

elasticmachine · 2023-06-09T18:44:18Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

mergify · 2023-06-09T18:44:49Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

libbeat/features/features.go

faec · 2023-06-09T19:44:34Z

libbeat/processors/add_host_metadata/add_host_metadata.go

+	if p.config.CacheTTL <= 0 {
+		return true
+	}
+
 	if p.lastUpdate.Add(p.config.CacheTTL).After(time.Now()) {
 		return false
 	}
 	p.lastUpdate.Time = time.Now()


It looks to me like this is still a race -- if the cache is expired, then calling expired() twice in a row will return true the first time and false the second time. But to give a correct result, once expired() becomes true, it should stay true until the next loadData completes.

The PR description mentions not wanting everything to grind to a halt on every loadData, and that makes sense, but everything should grind to a halt when the cache is expired, until at least one loadData completes. The tricky part here is making sure that we don't do redundant work -- the first caller to notice that the cache is expired should call loadData, then any additional callers should notice that the cache is expired and the result is already being recomputed, and wait for it to finish.

So, my thinking is that with loadData(), there's a tradeoff involved:

Normal cache expiry doesn't hold the mutex: A certain number of events get stale hostname data, but updating the hostname will only block one event.

Normal Cache expiry holds the mutex: No events get stale hostname data, but we globally block events until the hostname is updated.

The latter case makes me pretty nervous. Note that the call to FQDN() that happens in loadData() can, depending on the platform & setup, involve some network calls to look up DNS info. Blocking all event ingestion while we do network calls is kinda scary, so I figured the tradeoff was worth it. Open to other ideas, though.

Hmm, so one simplifying factor that I think we weren't relying on: a processor is run on a single goroutine. On the one hand, this means that some things that look like race conditions aren't: updating the cache TTL in expired when it is checked rather than in loadData when it is updated makes the code confusing, but it doesn't actually break anything, because the only important caller of expired() is loadData, and if expired() returns true then loadData synchronously reloads everything.

(Important exception: the tests. The existing tests used expired() to "prime" the cache's lastUpdated value, even though expired() itself doesn't load any data. If the tests need a special call to mark the cache as non-expired, then there should be separate helper functions for checking expiration and resetting expiration time, instead of making expired() itself do both, otherwise at the very least it's hard to reason about whether they're testing the right thing.)

The upshot though is: the internal behavior was confusing but sequentially correct. The error comes in with FQDN flag updates, which are the only changes that happen on a different goroutine than the processor itself. The previous behavior was almost correct: while holding a mutex, set the cache update time to zero, so the next call to expired() would return true, which would trigger a synchronous update of FQDN on the next loadData() call. Since the processor itself is locked to a single goroutine, no other callers will try using loadData() while that update is happening.

I think the core problem might be the callback issue I pointed out elsewhere: the FQDN callbacks overwrite each other, so only one copy of add_host_metadata receives the update, and the rest just have to wait until their cacheTTL expires.

So, you're right that the behavior is "almost" correct. When I was still trying to figure out a fix, one thing I experimented with is how many data streams it takes to reproduce the issue. If you just have one datastream installed, you don't hit the bug, as there's only one goroutine involved, like you said. Issue happens when we have multiple data streams, and thus multiple threads.

Not sure what you mean by "the FQDN callbacks overwrite each other". Do you mean how the cache expiry happens between threads?

No, I mean the issue mentioned in this comment -- when add_host_metadata registers its fqdn callback, it is just keyed by the string "add_host_metadata", so only one callback can be active at a time. That means whenever the FQDN flag changes, one lucky add_host_metadata instance will receive the callback and update its cache, while the rest will be ignored.

Ah, okay. I did change that. Now it returns an error, and the processor will use a keyname with the UUID.

I could go back and run some tests, but my understanding is that that's largely unrelated, and that under agent only one processor instance will register itself globally during startup

elasticmachine · 2023-06-09T19:57:35Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-06-13T21:37:09.842+0000
Duration: 67 min 0 sec

Test stats 🧪

Test	Results
Failed	0
Passed	27338
Skipped	2013
Total	29351

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

fearful-symmetry · 2023-06-12T20:57:12Z

@faec think that's all the changes you mentioned over zoom. Also did the panic-on-register test, no issues.

faec

Looks good, despite the scary context. I wish I were more confident that there aren't other synchronization errors around this code, but I believe your change strictly improves the situation 😜

fearful-symmetry · 2023-06-13T15:56:30Z

/test

first pass at fixing FQDN race issue

790d7e0

fearful-symmetry added bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jun 9, 2023

fearful-symmetry requested a review from faec June 9, 2023 18:44

fearful-symmetry requested a review from a team as a code owner June 9, 2023 18:44

fearful-symmetry self-assigned this Jun 9, 2023

fearful-symmetry requested a review from rdner June 9, 2023 18:44

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 9, 2023

faec reviewed Jun 9, 2023

View reviewed changes

libbeat/features/features.go Outdated Show resolved Hide resolved

faec reviewed Jun 9, 2023

View reviewed changes

libbeat/features/features.go Outdated Show resolved Hide resolved

faec reviewed Jun 9, 2023

View reviewed changes

libbeat/features/features.go Show resolved Hide resolved

faec reviewed Jun 9, 2023

View reviewed changes

fix linter, add checks for features, use atomic bool

62d9a75

fearful-symmetry mentioned this pull request Jun 9, 2023

Data is generated with host name under Discover tab if agent is directly installed with policy having FQDN host name already selected. elastic/elastic-agent#2697

Closed

fearful-symmetry added 3 commits June 9, 2023 15:25

format

59069e7

add comments, clean up code names

91353ad

Merge remote-tracking branch 'upstream/main' into fqdn-race

f2c5002

fearful-symmetry requested a review from faec June 12, 2023 20:56

faec approved these changes Jun 13, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fqdn-race

dd1ec79

fearful-symmetry merged commit 8cb4def into elastic:main Jun 13, 2023

jlind23 mentioned this pull request Jun 14, 2023

Data is visible with host name under metrics-* index temporarily after upgrade, if fqdn type is set before upgrade. elastic/elastic-agent#2693

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in FQDN values during startup #35736

Fix race in FQDN values during startup #35736

fearful-symmetry commented Jun 9, 2023 •

edited

Loading

elasticmachine commented Jun 9, 2023

mergify bot commented Jun 9, 2023

faec Jun 9, 2023

fearful-symmetry Jun 9, 2023 •

edited

Loading

faec Jun 12, 2023

fearful-symmetry Jun 12, 2023

faec Jun 12, 2023

fearful-symmetry Jun 12, 2023 •

edited

Loading

elasticmachine commented Jun 9, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

fearful-symmetry commented Jun 12, 2023

faec left a comment

fearful-symmetry commented Jun 13, 2023

Fix race in FQDN values during startup #35736

Fix race in FQDN values during startup #35736

Conversation

fearful-symmetry commented Jun 9, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

elasticmachine commented Jun 9, 2023

mergify bot commented Jun 9, 2023

faec Jun 9, 2023

Choose a reason for hiding this comment

fearful-symmetry Jun 9, 2023 • edited Loading

Choose a reason for hiding this comment

faec Jun 12, 2023

Choose a reason for hiding this comment

fearful-symmetry Jun 12, 2023

Choose a reason for hiding this comment

faec Jun 12, 2023

Choose a reason for hiding this comment

fearful-symmetry Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

elasticmachine commented Jun 9, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

fearful-symmetry commented Jun 12, 2023

faec left a comment

Choose a reason for hiding this comment

fearful-symmetry commented Jun 13, 2023

fearful-symmetry commented Jun 9, 2023 •

edited

Loading

fearful-symmetry Jun 9, 2023 •

edited

Loading

fearful-symmetry Jun 12, 2023 •

edited

Loading

elasticmachine commented Jun 9, 2023 •

edited by jenkins-beats-ci bot

Loading