Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identity/cache: don't panic in CachingIdentityAllocator.Close() #24694

Merged
merged 1 commit into from Apr 13, 2023

Conversation

lmb
Copy link
Contributor

@lmb lmb commented Apr 3, 2023

Closing a CachingIdentityAllocator currently panics if InitIdentityAllocator() has not been called. With the ongoing hiveification this causes a problem:

level=panic msg="Close() called without calling InitIdentityAllocator() first" subsys=identity-cache
panic: (*logrus.Entry) 0xc000666b60

The reason this happens is that we now create a CachingIdentityAllocator using hive, before newDaemon is invoked. Essentially:

  • Create a new CachingIdentityAllocator using hive
  • Call newDaemon(allocator)
    • Do a bunch of unrelated initialisation
    • Call allocator.InitIdentityAllocator

If any of the unrelated initialization fails, hive executes the stop hooks of all objects it has created. In the case of the allocator this invokes CachingIdentityAllocator.Close(). That function panics since we haven't reached InitIdentityAllocator in the initialization order yet.

Fix this by allowing to close an uninitialized allocator.

Fixes: d6503fa ("daemon: move circular initialization of policy.Repository to hive")

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 3, 2023
@lmb lmb added the release-note/misc This PR makes changes that have no direct user impact. label Apr 3, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 3, 2023
@lmb
Copy link
Contributor Author

lmb commented Apr 3, 2023

/test

1 similar comment
@lmb
Copy link
Contributor Author

lmb commented Apr 3, 2023

/test

@lmb lmb force-pushed the policy-trifecta-shutdown-crash branch from c97b3b1 to b04f5d9 Compare April 3, 2023 17:26
@lmb
Copy link
Contributor Author

lmb commented Apr 3, 2023

/test

@lmb
Copy link
Contributor Author

lmb commented Apr 4, 2023

ci-l4lb failure is a flake as far as I can tell: #24728

@lmb lmb marked this pull request as ready for review April 4, 2023 10:29
@lmb lmb requested a review from a team as a code owner April 4, 2023 10:29
@lmb lmb requested review from nebril and joestringer April 4, 2023 10:29
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this is an indication that CachingIdentityAllocator is partially converted to hive, such that its Close() method is registered for shutdown despite the object never being initialized. I'd argue that it shouldn't be in the Hive's cleanup list if it was never initialized in the first place. Or alternatively this packages should be properly converted into a Cell so that both the initialization and cleanup are properly ordered and guaranteed (probably the right long-term approach, but I expect this is a significant additional effort over this proposal so I don't think it's reasonable to require that proposal to solve the current issue).

As far as I can tell, the previous panic here was primarily introduced as a warning sign that the code elsewhere is structured incorrectly such that the CachingIdentityAllocator is somehow available for referencing across other parts of the codebase. The danger is that potentially any methods could be called against this object, including the Close() method but also including other methods that will not behave correctly if the object is uninitialized.

Removing the panic is removing that safeguard, so there's a small potential that subsequent refactoring could lead to attempting to use the Uninitialized CachingIdentityAllocator. If the initialization logic is lined up to avoid this, we have no problem. If we ever break that assumption because the CachingIdentityAllocator reference is available for use in the agent prior to initialization, then we'll start seeing some other weird behaviour because the rest of this package would assume that the CachingIdentityAllocator is initialized.

That said, if we did introduce a bug like the above, then this existing check in the Close() method wouldn't guarantee to detect it anyway. Close is only called on shutdown, so if there was a race condition where this object is used prior to an InitIdentityAllocator() call, then we'd still hit some buggy behaviour during any regular agent startup.

Long term, the right solution is probably to have the InitIdentityAllocator() call happen in a Start() method, matching the Close() call from a correspondind Cell Stop() method. I anticipate that may require some significant and potentially risky code changes (though I'd be happy to be proven wrong).

In the mean time, this proposal as-is will resolve a pain point where a failure during agent startup will print misleading logs about the cause of the failure, since the panic occurs and suggests that the problem is in pkg/identity/cache/allocator rather than the actual code that returns an error during startup. I think that's a net benefit to the codebase even if we do have inconsistencies elsewhere that are driving the need for this code.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to click Approve. 👍

@lmb
Copy link
Contributor Author

lmb commented Apr 11, 2023

Or alternatively this packages should be properly converted into a Cell so that both the initialization and cleanup are properly ordered and guaranteed (probably the right long-term approach, but I expect this is a significant additional effort over this proposal so I don't think it's reasonable to require that proposal to solve the current issue).

I started down this path a little bit, but it's very tricky to do I think. The problem is that there is a circular dependency in the "policy trifecta" as I called it. Getting rid of that is hard.

That said, I checked where the allocator is fully initialized:

cilium/daemon/cmd/daemon.go

Lines 1184 to 1194 in 2216111

if option.Config.DatapathMode != datapathOption.DatapathModeLBOnly {
// This needs to be done after the node addressing has been configured
// as the node address is required as suffix.
// well known identities have already been initialized above.
// Ignore the channel returned by this function, as we want the global
// identity allocator to run asynchronously.
realIdentityAllocator := d.identityAllocator
realIdentityAllocator.InitIdentityAllocator(clientset, nil)
d.bootstrapClusterMesh(nodeMngr)
}

clientset is already provided via hive, so maybe moving the initialization call into hive is not as hard as it seems? Do you know where "node addressing" is configured?

Closing a CachingIdentityAllocator currently panics if InitIdentityAllocator()
has not been called. With the ongoing hiveification this causes a problem:

    level=panic msg="Close() called without calling InitIdentityAllocator() first" subsys=identity-cache
    panic: (*logrus.Entry) 0xc000666b60

The reason this happens is that we now create a CachingIdentityAllocator
using hive, before newDaemon is invoked. Essentially:

* Create a new CachingIdentityAllocator using hive
* Call newDaemon(allocator)
  * Do a bunch of unrelated initialisation
  * Call allocator.InitIdentityAllocator

If any of the unrelated initialization fails, hive executes the stop hooks
of all objects it has created. In the case of the allocator this invokes
CachingIdentityAllocator.Close(). That function panics since we haven't
reached InitIdentityAllocator in the initialization order yet.

Fix this by allowing to close an uninitialized allocator. We still print an
error so that unit tests fail.

Fixes: d6503fa ("daemon: move circular initialization of policy.Repository to hive")
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
@lmb lmb force-pushed the policy-trifecta-shutdown-crash branch from b04f5d9 to 61a873b Compare April 12, 2023 15:25
@lmb
Copy link
Contributor Author

lmb commented Apr 12, 2023

Spoke with Jussi, fixing this correctly is going to be hard (TM). His suggestion was to change the Panic to an Error, which will also make our tests fail and avoids turning this into a silent error.

@lmb
Copy link
Contributor Author

lmb commented Apr 12, 2023

/test

@lmb lmb added kind/bug This is a bug in the Cilium logic. area/modularization labels Apr 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 12, 2023
@dylandreimerink dylandreimerink merged commit 640590e into cilium:master Apr 13, 2023
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/modularization kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants