New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
identity/cache: don't panic in CachingIdentityAllocator.Close() #24694
identity/cache: don't panic in CachingIdentityAllocator.Close() #24694
Conversation
/test |
1 similar comment
/test |
c97b3b1
to
b04f5d9
Compare
/test |
ci-l4lb failure is a flake as far as I can tell: #24728 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this is an indication that CachingIdentityAllocator
is partially converted to hive, such that its Close()
method is registered for shutdown despite the object never being initialized. I'd argue that it shouldn't be in the Hive's cleanup list if it was never initialized in the first place. Or alternatively this packages should be properly converted into a Cell so that both the initialization and cleanup are properly ordered and guaranteed (probably the right long-term approach, but I expect this is a significant additional effort over this proposal so I don't think it's reasonable to require that proposal to solve the current issue).
As far as I can tell, the previous panic here was primarily introduced as a warning sign that the code elsewhere is structured incorrectly such that the CachingIdentityAllocator
is somehow available for referencing across other parts of the codebase. The danger is that potentially any methods could be called against this object, including the Close()
method but also including other methods that will not behave correctly if the object is uninitialized.
Removing the panic is removing that safeguard, so there's a small potential that subsequent refactoring could lead to attempting to use the Uninitialized CachingIdentityAllocator. If the initialization logic is lined up to avoid this, we have no problem. If we ever break that assumption because the CachingIdentityAllocator
reference is available for use in the agent prior to initialization, then we'll start seeing some other weird behaviour because the rest of this package would assume that the CachingIdentityAllocator
is initialized.
That said, if we did introduce a bug like the above, then this existing check in the Close()
method wouldn't guarantee to detect it anyway. Close is only called on shutdown, so if there was a race condition where this object is used prior to an InitIdentityAllocator()
call, then we'd still hit some buggy behaviour during any regular agent startup.
Long term, the right solution is probably to have the InitIdentityAllocator()
call happen in a Start()
method, matching the Close()
call from a correspondind Cell Stop()
method. I anticipate that may require some significant and potentially risky code changes (though I'd be happy to be proven wrong).
In the mean time, this proposal as-is will resolve a pain point where a failure during agent startup will print misleading logs about the cause of the failure, since the panic occurs and suggests that the problem is in pkg/identity/cache/allocator
rather than the actual code that returns an error during startup. I think that's a net benefit to the codebase even if we do have inconsistencies elsewhere that are driving the need for this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to click Approve. 👍
I started down this path a little bit, but it's very tricky to do I think. The problem is that there is a circular dependency in the "policy trifecta" as I called it. Getting rid of that is hard. That said, I checked where the allocator is fully initialized: Lines 1184 to 1194 in 2216111
|
Closing a CachingIdentityAllocator currently panics if InitIdentityAllocator() has not been called. With the ongoing hiveification this causes a problem: level=panic msg="Close() called without calling InitIdentityAllocator() first" subsys=identity-cache panic: (*logrus.Entry) 0xc000666b60 The reason this happens is that we now create a CachingIdentityAllocator using hive, before newDaemon is invoked. Essentially: * Create a new CachingIdentityAllocator using hive * Call newDaemon(allocator) * Do a bunch of unrelated initialisation * Call allocator.InitIdentityAllocator If any of the unrelated initialization fails, hive executes the stop hooks of all objects it has created. In the case of the allocator this invokes CachingIdentityAllocator.Close(). That function panics since we haven't reached InitIdentityAllocator in the initialization order yet. Fix this by allowing to close an uninitialized allocator. We still print an error so that unit tests fail. Fixes: d6503fa ("daemon: move circular initialization of policy.Repository to hive") Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
b04f5d9
to
61a873b
Compare
Spoke with Jussi, fixing this correctly is going to be hard (TM). His suggestion was to change the Panic to an Error, which will also make our tests fail and avoids turning this into a silent error. |
/test |
Closing a CachingIdentityAllocator currently panics if InitIdentityAllocator() has not been called. With the ongoing hiveification this causes a problem:
The reason this happens is that we now create a CachingIdentityAllocator using hive, before newDaemon is invoked. Essentially:
If any of the unrelated initialization fails, hive executes the stop hooks of all objects it has created. In the case of the allocator this invokes CachingIdentityAllocator.Close(). That function panics since we haven't reached InitIdentityAllocator in the initialization order yet.
Fix this by allowing to close an uninitialized allocator.
Fixes: d6503fa ("daemon: move circular initialization of policy.Repository to hive")