Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid an empty instanceID on EC2 #15012

Merged
merged 2 commits into from
Feb 19, 2021
Merged

Conversation

kkourt
Copy link
Contributor

@kkourt kkourt commented Feb 17, 2021

Please see commits.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 17, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Feb 17, 2021
@kkourt kkourt added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Feb 17, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 17, 2021
@kkourt
Copy link
Contributor Author

kkourt commented Feb 17, 2021

test-me-please

@kkourt kkourt marked this pull request as ready for review February 17, 2021 15:39
@kkourt kkourt requested a review from a team February 17, 2021 15:39
@kkourt kkourt requested a review from a team as a code owner February 17, 2021 15:39
@kkourt kkourt requested a review from twpayne February 17, 2021 15:39
update() is used both to update the status or everything else. When
logging an error, add this information as well.

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Copy link
Contributor

@twpayne twpayne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one tiny bit.

@@ -16,6 +16,7 @@ package nodediscovery

import (
"context"
goerrors "errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would rename the k8s package instead of the standard library one, but it's a very small not.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit along with Tom's. Overall straightforward fix.

I am wondering about whether it makes sense to have the retry logic be a bit more specific for fetching instance metadata. The reason is because currently this will retry fetching the node from K8s when it likely already succeeded. Main motivation is wanting to avoid hitting the apiserver unnecessarily. WDYT?

Comment on lines 302 to 312
err = n.mutateNodeResource(nodeResource)
if err != nil {
log.WithError(err).WithField("retryCount", retryCount).Warning("unable to mutate nodeResource")
continue
}
Copy link
Member

@christarazi christarazi Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can inline the error here and capitalize the first word of a log msg to conform on style with the rest of the log msgs.

Suggested change
err = n.mutateNodeResource(nodeResource)
if err != nil {
log.WithError(err).WithField("retryCount", retryCount).Warning("unable to mutate nodeResource")
continue
}
if err := n.mutateNodeResource(nodeResource); err != nil {
log.WithError(err).WithField("retryCount", retryCount).Warning("Unable to mutate nodeResource")
continue
}

@christarazi christarazi added area/eni Impacts ENI based IPAM. integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. sig/ipam IP address management, including cloud IPAM labels Feb 17, 2021
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking requesting for changes as I think we need stronger guarantees that we have a non-empty instanceID before proceeding. If we aren't able to fetch it, I believe we should fatal.

However, I still am trying to convince myself that that's the right thing to do. If we aren't able to fetch the instanceID from the AWS metadata service which is node-local1, I'm not sure we have much choice besides continuing to retry. What I'm trying to answer is does fataling even help with that? Is that the right "alert" to the user? What can they reasonably do to remedy the situation?

If we proceed forward without fataling, then we will eventually run into issues with CiliumNode resource corruption as the IPAM code will allocate all available ENIs from all instances into the local CiliumNode resource.

@kkourt
Copy link
Contributor Author

kkourt commented Feb 17, 2021

I am wondering about whether it makes sense to have the retry logic be a bit more specific for fetching instance metadata. The reason is because currently this will retry fetching the node from K8s when it likely already succeeded. Main motivation is wanting to avoid hitting the apiserver unnecessarily. WDYT?

Personally, I prefer having a single retry loop. We are in a failure path at that point and hitting the apiserver gives us a better chance of avoiding conflicts and succeeding at the update. That being said, I don't feel strongly about it so if you prefer either a double retry loop or some state to skip the Get(), I'm fine with that.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline. Retracting my request for changes as the code as-is already guarantees that we won't proceed with an empty instanceID.

@kkourt
Copy link
Contributor Author

kkourt commented Feb 17, 2021

Pushed a new version to address feedback.

@kkourt
Copy link
Contributor Author

kkourt commented Feb 17, 2021

test-me-please

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending capitalization nit for the log msg

An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
cilium#11673.

Related: cilium#14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
@kkourt
Copy link
Contributor Author

kkourt commented Feb 18, 2021

pending capitalization nit for the log msg

Forgot about that :( Pushed new version that addresses it.

Diff:

diff --git a/pkg/nodediscovery/nodediscovery.go b/pkg/nodediscovery/nodediscovery.go
index 368756c6a..4d0b7b0e8 100644
--- a/pkg/nodediscovery/nodediscovery.go
+++ b/pkg/nodediscovery/nodediscovery.go
@@ -290,9 +290,9 @@ func (n *NodeDiscovery) UpdateCiliumNodeResource() {
 	performGet := true
 	for retryCount := 0; retryCount < maxRetryCount; retryCount++ {
 		var nodeResource *ciliumv2.CiliumNode
-		var err error
 		performUpdate := true
 		if performGet {
+			var err error
 			nodeResource, err = ciliumClient.CiliumV2().CiliumNodes().Get(context.TODO(), nodeTypes.GetName(), metav1.GetOptions{})
 			if err != nil {
 				performUpdate = false
@@ -306,9 +306,8 @@ func (n *NodeDiscovery) UpdateCiliumNodeResource() {
 			}
 		}
 
-		err = n.mutateNodeResource(nodeResource)
-		if err != nil {
-			log.WithError(err).WithField("retryCount", retryCount).Warning("unable to mutate nodeResource")
+		if err := n.mutateNodeResource(nodeResource); err != nil {
+			log.WithError(err).WithField("retryCount", retryCount).Warning("Unable to mutate nodeResource")
 			continue
 		}
 
@@ -327,7 +326,7 @@ func (n *NodeDiscovery) UpdateCiliumNodeResource() {
 				return
 			}
 		} else {
-			if _, err = ciliumClient.CiliumV2().CiliumNodes().Create(context.TODO(), nodeResource, metav1.CreateOptions{}); err != nil {
+			if _, err := ciliumClient.CiliumV2().CiliumNodes().Create(context.TODO(), nodeResource, metav1.CreateOptions{}); err != nil {
 				if k8serrors.IsConflict(err) {
 					log.WithError(err).Warn("Unable to create CiliumNode resource, will retry")
 					continue

@kkourt
Copy link
Contributor Author

kkourt commented Feb 18, 2021

test-me-please

@kkourt
Copy link
Contributor Author

kkourt commented Feb 18, 2021

retest-gke

@tgraf tgraf merged commit 91e68c2 into cilium:master Feb 19, 2021
1.10.0 automation moved this from In progress to Done Feb 19, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.5 Feb 19, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.5 Feb 20, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.5 Feb 22, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.5 Feb 22, 2021
This was referenced Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/ipam IP address management, including cloud IPAM
Projects
No open projects
1.9.5
Backport done to v1.9
Development

Successfully merging this pull request may close these issues.

None yet

7 participants