Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in FQDN values during startup #35736

Merged
merged 6 commits into from
Jun 13, 2023

Conversation

fearful-symmetry
Copy link
Contributor

@fearful-symmetry fearful-symmetry commented Jun 9, 2023

What does this PR do?

Fixes elastic/elastic-agent#2697

This fixes a rather complex race condition that can happen when FQDN hostnames are enabled during startup:

  1. The add_host_metadata starts up before everything else, sets the hostname value without FQDN
  2. the config manager gets a unit update with FQDN: true. It sets the global FQDN flag
  3. The config manager calls expireCache() in add_host_metadata(), telling the processor to update the FQDN value next time it gets an event
  4. A new event reaches the processor, this new event calls loadData(), which checks to see if the cache is expired. It is, so expired() returns true and then resets the cache update time
  5. Another event comes in. This event calls loadData() again. loadData() calls expired(), which returns false. loadData() then returns without updating the FQDN, as the cache is not expired. An event with the non-FQDN value is returned from the processor.
  6. Repeat step 5 between 1-3 times
  7. The original event that first triggered the cache time reset finally loads the FQDN, setting add_host_metadata's processor to the correct FQDN.

This fix makes two major changes:

  • In order to force an update to the hostname, expireCache() now synchronously calls loadData(), holding a mutex until the load completes or a timeout is reached. This was done in expireCache() in order to minimize the amount of time we spend synchronously updating add_host_metadata()'s state. I didn't want to create a fix where all the threads grind to a halt whenever one thread calls loadData().
  • Refactor the way mutexes work in features. This splits apart the library into two mutexes, one for the features flag, and another for the callback hashmap. This was done out of an abundance of caution, so if a callback blocks for a certain period of time, other threads can continue to call features.FQDN().

Note that this only fixes the races I could reproduce directly, there's still a few other conditions I'm worried about:

  • We might call features.UpdateFromConfig after the inputs have started. From a Agent client perspective, I'm not sure if this is possible, since if fqdn is set to true, we should receive that feature flag for all units (I think), but there's nothing that strictly prevents it from happening.
  • The beat agent client somehow calls features.UpdateFromConfig() before the processor calls AddFQDNOnChangeCallback(). I'm not sure this is possible, since processors are initialized rather early, but again, there's nothing that formally stops it from happening.

Why is it important?

This is a race condition that can produce incorrect data.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

There are reproduction steps in elastic/elastic-agent#2697, although this is a race condition and it's a bit hard to reproduce (or not reproduce) reliably. My steps for development were:

  • Build elastic-agent with this branch
  • Run elastic-agent with:
agent.features:
  fqdn:
    enabled: true

@fearful-symmetry fearful-symmetry added bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jun 9, 2023
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner June 9, 2023 18:44
@fearful-symmetry fearful-symmetry self-assigned this Jun 9, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 9, 2023
@mergify
Copy link
Contributor

mergify bot commented Jun 9, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

libbeat/features/features.go Outdated Show resolved Hide resolved
libbeat/features/features.go Outdated Show resolved Hide resolved
if p.config.CacheTTL <= 0 {
return true
}

if p.lastUpdate.Add(p.config.CacheTTL).After(time.Now()) {
return false
}
p.lastUpdate.Time = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this is still a race -- if the cache is expired, then calling expired() twice in a row will return true the first time and false the second time. But to give a correct result, once expired() becomes true, it should stay true until the next loadData completes.

The PR description mentions not wanting everything to grind to a halt on every loadData, and that makes sense, but everything should grind to a halt when the cache is expired, until at least one loadData completes. The tricky part here is making sure that we don't do redundant work -- the first caller to notice that the cache is expired should call loadData, then any additional callers should notice that the cache is expired and the result is already being recomputed, and wait for it to finish.

Copy link
Contributor Author

@fearful-symmetry fearful-symmetry Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my thinking is that with loadData(), there's a tradeoff involved:

  • Normal cache expiry doesn't hold the mutex: A certain number of events get stale hostname data, but updating the hostname will only block one event.
  • Normal Cache expiry holds the mutex: No events get stale hostname data, but we globally block events until the hostname is updated.

The latter case makes me pretty nervous. Note that the call to FQDN() that happens in loadData() can, depending on the platform & setup, involve some network calls to look up DNS info. Blocking all event ingestion while we do network calls is kinda scary, so I figured the tradeoff was worth it. Open to other ideas, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so one simplifying factor that I think we weren't relying on: a processor is run on a single goroutine. On the one hand, this means that some things that look like race conditions aren't: updating the cache TTL in expired when it is checked rather than in loadData when it is updated makes the code confusing, but it doesn't actually break anything, because the only important caller of expired() is loadData, and if expired() returns true then loadData synchronously reloads everything.

(Important exception: the tests. The existing tests used expired() to "prime" the cache's lastUpdated value, even though expired() itself doesn't load any data. If the tests need a special call to mark the cache as non-expired, then there should be separate helper functions for checking expiration and resetting expiration time, instead of making expired() itself do both, otherwise at the very least it's hard to reason about whether they're testing the right thing.)

The upshot though is: the internal behavior was confusing but sequentially correct. The error comes in with FQDN flag updates, which are the only changes that happen on a different goroutine than the processor itself. The previous behavior was almost correct: while holding a mutex, set the cache update time to zero, so the next call to expired() would return true, which would trigger a synchronous update of FQDN on the next loadData() call. Since the processor itself is locked to a single goroutine, no other callers will try using loadData() while that update is happening.

I think the core problem might be the callback issue I pointed out elsewhere: the FQDN callbacks overwrite each other, so only one copy of add_host_metadata receives the update, and the rest just have to wait until their cacheTTL expires.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you're right that the behavior is "almost" correct. When I was still trying to figure out a fix, one thing I experimented with is how many data streams it takes to reproduce the issue. If you just have one datastream installed, you don't hit the bug, as there's only one goroutine involved, like you said. Issue happens when we have multiple data streams, and thus multiple threads.

Not sure what you mean by "the FQDN callbacks overwrite each other". Do you mean how the cache expiry happens between threads?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean the issue mentioned in this comment -- when add_host_metadata registers its fqdn callback, it is just keyed by the string "add_host_metadata", so only one callback can be active at a time. That means whenever the FQDN flag changes, one lucky add_host_metadata instance will receive the callback and update its cache, while the rest will be ignored.

Copy link
Contributor Author

@fearful-symmetry fearful-symmetry Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. I did change that. Now it returns an error, and the processor will use a keyname with the UUID.

I could go back and run some tests, but my understanding is that that's largely unrelated, and that under agent only one processor instance will register itself globally during startup

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 9, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-06-13T21:37:09.842+0000

  • Duration: 67 min 0 sec

Test stats 🧪

Test Results
Failed 0
Passed 27338
Skipped 2013
Total 29351

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@fearful-symmetry
Copy link
Contributor Author

@faec think that's all the changes you mentioned over zoom. Also did the panic-on-register test, no issues.

Copy link
Contributor

@faec faec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, despite the scary context. I wish I were more confident that there aren't other synchronization errors around this code, but I believe your change strictly improves the situation 😜

@fearful-symmetry
Copy link
Contributor Author

/test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
3 participants