Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ENI compatibility regression between 1.7 <-> 1.8 #14991

Merged
merged 2 commits into from
Feb 16, 2021

Conversation

tgraf
Copy link
Member

@tgraf tgraf commented Feb 16, 2021

No description provided.

In 1.8, some ENI specifics fields have been moved from spec.ENI to
spec.IPAM to make them available for Azure IPAM. The old fields still
need to be set to account for:

 * Downgrade of operator from 1.8 to 1.7 while agents remain on 1.8.

 * Race condition when the agents get upgraded from 1.7 to 1.8 before
   the operator gets upgraded to 1.8.

While this version mismatch exists, the operator will assume a value of
MinAllocate of 0 and a value of InstanceID of "". This leads to the
following incorrect behavior:

 * The operator will not allocate up to MinAllocate

 * The InstanceID "" value will result in incorrect attachments of ENIs
   to the Status.ENI field, i.e., ENIs of other nodes will get reported
   on a node. This in turn will result in IPs being used on nodes for
   ENIs which are not attached to that node.

Fixes: 04d2538 ("eni: No longer set Spec.ENI.PreAllocate")
Fixes: f0dfcb4 ("k8s: CiliumNode: Move generic IPAM parmeters into Spec.IPAM")

Spotted-by: André Martins <andre@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
Fix ForeachInterface to not return all ENIs if the instanceID is not
known. None of the callers expect this behavior and if the instanceID is
not known for some reason (for example due to the recent bug in 1.7 <->
1.8 compatibility handling), then all ENIs are iterated over instead of
just the ENI of the desired instance.

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. needs-backport/1.8 labels Feb 16, 2021
@tgraf tgraf requested review from a team as code owners February 16, 2021 09:33
@tgraf
Copy link
Member Author

tgraf commented Feb 16, 2021

test-me-please

@tgraf
Copy link
Member Author

tgraf commented Feb 16, 2021

GKE CI failure:

error: index-pack died of signal 15
fatal: index-pack failed

	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2042)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1761)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$400(CliGitAPIImpl.java:72)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:442)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:655)
	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
	at hudson.remoting.UserRequest.perform(UserRequest.java:212)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:369)
	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to jenkins-node-gke-fixed-1
		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743)
		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
		at hudson.remoting.Channel.call(Channel.java:957)
		at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
		at sun.reflect.GeneratedMethodAccessor468.invoke(Unknown Source)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:498)
		at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
		at com.sun.proxy.$Proxy89.execute(Unknown Source)
		at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1152)
		at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1192)
		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124)
		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
		at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		... 4 more

@tgraf
Copy link
Member Author

tgraf commented Feb 16, 2021

1.12-kernel-4.9 failure:

Unable to restart unmanaged pods with 'kubectl -n kube-system delete pods coredns-867bf6789f-cw72b hubble-relay-8fdf656bb-kvnfk': Exitcode: 1 
Err: exit status 1
Stdout:
 	 pod "coredns-867bf6789f-cw72b" deleted
	 
Stderr:
 	 Error from server: etcdserver: request timed out

Both are unrelated to the change.

@tgraf
Copy link
Member Author

tgraf commented Feb 16, 2021

retest-gke

@tgraf
Copy link
Member Author

tgraf commented Feb 16, 2021

retest-4.9

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice catch

@christarazi christarazi added area/eni Impacts ENI based IPAM. sig/ipam IP address management, including cloud IPAM labels Feb 16, 2021
@tgraf tgraf merged commit ce0ae42 into cilium:master Feb 16, 2021
@tgraf tgraf deleted the pr/tgraf/fix-eni-regression branch February 16, 2021 22:25
@christarazi christarazi removed their assignment Feb 16, 2021
kkourt added a commit to kkourt/cilium that referenced this pull request Feb 18, 2021
An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
cilium#11673.

Related: cilium#14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
tgraf pushed a commit that referenced this pull request Feb 19, 2021
An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
#11673.

Related: #14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
nathanjsweet pushed a commit that referenced this pull request Feb 19, 2021
[ upstream commit 91e68c2 ]

An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
#11673.

Related: #14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
@tgraf
Copy link
Member Author

tgraf commented Feb 19, 2021

@christarazi I think it's just generic code that is intended to be used from ENI, Azure, and potentially GCP in the future. It looks like we could change it. The merged PR already did so for ENI.

jrajahalme pushed a commit that referenced this pull request Feb 20, 2021
[ upstream commit 91e68c2 ]

An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
#11673.

Related: #14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
aanm pushed a commit that referenced this pull request Feb 22, 2021
[ upstream commit 91e68c2 ]

An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
#11673.

Related: #14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
nathanjsweet pushed a commit that referenced this pull request Feb 22, 2021
[ upstream commit 91e68c2 ]

An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
#11673.

Related: #14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
christarazi pushed a commit that referenced this pull request Feb 22, 2021
[ upstream commit 91e68c2 ]

An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
#11673.

Related: #14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
lyveng pushed a commit to lyveng/cilium that referenced this pull request Mar 4, 2021
An empty instanceID will cause issues. Add a check to ensure that
instanceID is not empty. If it is empty, retry. We also increase the
number of retries from 5 to 10.

An option I considered was to also avoid the `Get()` operation if there
is a conflict. I did not do this since it would go against the explicit
intention of the original code:
cilium#11673.

Related: cilium#14991

Suggested-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. kind/bug This is a bug in the Cilium logic. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants