-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid an empty instanceID on EC2 #15012
Conversation
68efba3
to
4d47cdf
Compare
test-me-please |
update() is used both to update the status or everything else. When logging an error, add this information as well. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
4d47cdf
to
32da1d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one tiny bit.
pkg/nodediscovery/nodediscovery.go
Outdated
@@ -16,6 +16,7 @@ package nodediscovery | |||
|
|||
import ( | |||
"context" | |||
goerrors "errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would rename the k8s package instead of the standard library one, but it's a very small not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit along with Tom's. Overall straightforward fix.
I am wondering about whether it makes sense to have the retry logic be a bit more specific for fetching instance metadata. The reason is because currently this will retry fetching the node from K8s when it likely already succeeded. Main motivation is wanting to avoid hitting the apiserver unnecessarily. WDYT?
pkg/nodediscovery/nodediscovery.go
Outdated
err = n.mutateNodeResource(nodeResource) | ||
if err != nil { | ||
log.WithError(err).WithField("retryCount", retryCount).Warning("unable to mutate nodeResource") | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can inline the error here and capitalize the first word of a log msg to conform on style with the rest of the log msgs.
err = n.mutateNodeResource(nodeResource) | |
if err != nil { | |
log.WithError(err).WithField("retryCount", retryCount).Warning("unable to mutate nodeResource") | |
continue | |
} | |
if err := n.mutateNodeResource(nodeResource); err != nil { | |
log.WithError(err).WithField("retryCount", retryCount).Warning("Unable to mutate nodeResource") | |
continue | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking requesting for changes as I think we need stronger guarantees that we have a non-empty instanceID
before proceeding. If we aren't able to fetch it, I believe we should fatal.
However, I still am trying to convince myself that that's the right thing to do. If we aren't able to fetch the instanceID
from the AWS metadata service which is node-local1, I'm not sure we have much choice besides continuing to retry. What I'm trying to answer is does fataling even help with that? Is that the right "alert" to the user? What can they reasonably do to remedy the situation?
If we proceed forward without fataling, then we will eventually run into issues with CiliumNode resource corruption as the IPAM code will allocate all available ENIs from all instances into the local CiliumNode resource.
Personally, I prefer having a single retry loop. We are in a failure path at that point and hitting the apiserver gives us a better chance of avoiding conflicts and succeeding at the update. That being said, I don't feel strongly about it so if you prefer either a double retry loop or some state to skip the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced offline. Retracting my request for changes as the code as-is already guarantees that we won't proceed with an empty instanceID
.
32da1d4
to
b98cbe6
Compare
Pushed a new version to address feedback. |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending capitalization nit for the log msg
An empty instanceID will cause issues. Add a check to ensure that instanceID is not empty. If it is empty, retry. We also increase the number of retries from 5 to 10. An option I considered was to also avoid the `Get()` operation if there is a conflict. I did not do this since it would go against the explicit intention of the original code: cilium#11673. Related: cilium#14991 Suggested-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
b98cbe6
to
cb0728c
Compare
Forgot about that :( Pushed new version that addresses it. Diff: diff --git a/pkg/nodediscovery/nodediscovery.go b/pkg/nodediscovery/nodediscovery.go
index 368756c6a..4d0b7b0e8 100644
--- a/pkg/nodediscovery/nodediscovery.go
+++ b/pkg/nodediscovery/nodediscovery.go
@@ -290,9 +290,9 @@ func (n *NodeDiscovery) UpdateCiliumNodeResource() {
performGet := true
for retryCount := 0; retryCount < maxRetryCount; retryCount++ {
var nodeResource *ciliumv2.CiliumNode
- var err error
performUpdate := true
if performGet {
+ var err error
nodeResource, err = ciliumClient.CiliumV2().CiliumNodes().Get(context.TODO(), nodeTypes.GetName(), metav1.GetOptions{})
if err != nil {
performUpdate = false
@@ -306,9 +306,8 @@ func (n *NodeDiscovery) UpdateCiliumNodeResource() {
}
}
- err = n.mutateNodeResource(nodeResource)
- if err != nil {
- log.WithError(err).WithField("retryCount", retryCount).Warning("unable to mutate nodeResource")
+ if err := n.mutateNodeResource(nodeResource); err != nil {
+ log.WithError(err).WithField("retryCount", retryCount).Warning("Unable to mutate nodeResource")
continue
}
@@ -327,7 +326,7 @@ func (n *NodeDiscovery) UpdateCiliumNodeResource() {
return
}
} else {
- if _, err = ciliumClient.CiliumV2().CiliumNodes().Create(context.TODO(), nodeResource, metav1.CreateOptions{}); err != nil {
+ if _, err := ciliumClient.CiliumV2().CiliumNodes().Create(context.TODO(), nodeResource, metav1.CreateOptions{}); err != nil {
if k8serrors.IsConflict(err) {
log.WithError(err).Warn("Unable to create CiliumNode resource, will retry")
continue |
test-me-please |
retest-gke |
Please see commits.