Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.13] endpoint: don't hold the endpoint lock while generating policy #26735

Merged
merged 7 commits into from Aug 10, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Jul 10, 2023

I manually backported this PR to generate the hotfix build, might as well save the backporter some bother and submit it as well.

Once this PR is merged, you can update the PR labels via:

$ for pr in 26242; do contrib/backporting/set-labels.py $pr done 1.13; done

@squeed squeed requested a review from a team as a code owner July 10, 2023 11:54
@squeed squeed added kind/backports This PR provides functionality previously merged into master. backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. labels Jul 10, 2023
@squeed squeed changed the title v1.13 backports 2023-07-10 [v1.13] endpoint: don't hold the endpoint lock while generating policy Jul 10, 2023
@squeed
Copy link
Contributor Author

squeed commented Jul 10, 2023

/test-backport-1.13

@squeed
Copy link
Contributor Author

squeed commented Jul 10, 2023

Note: a hotfix build has been provided to the end-user who initially reported the issue, and they noticed that it solved their problem.

@julianwiedmann
Copy link
Member

Travis seems to call out a genuine bug

pkg/endpoint/policy_test.go:78:91: too many arguments in call to policy.NewPolicyRepository

have (*testidentity.MockIdentityAllocator, "github.com/cilium/cilium/pkg/identity/cache".IdentityCache, nil, nil)

want ("github.com/cilium/cilium/pkg/identity/cache".IdentityAllocator, "github.com/cilium/cilium/pkg/identity/cache".IdentityCache, policy.CertificateManager)

@julianwiedmann
Copy link
Member

@squeed if this isn't a straight-forward cherry-pick, could you please ask one of the reviewers of the upstream PR to have a closer look? Thank you!

@squeed squeed force-pushed the pr/v1.13-backport-2023-07-10 branch from 9122c3e to ae337cb Compare July 12, 2023 07:36
@squeed
Copy link
Contributor Author

squeed commented Jul 12, 2023

@julianwiedmann good catch, oops. Fixed that. The backport is, well, medium-trivial, it can't be automated just because of some behavior-unaffecting method signature changes in tests.

@squeed squeed force-pushed the pr/v1.13-backport-2023-07-10 branch 2 times, most recently from 0d1573e to ad6d081 Compare July 12, 2023 09:25
@jibi
Copy link
Member

jibi commented Jul 13, 2023

/test-backport-1.13

Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed:

Click to show.

Test Name

K8sAgentIstioTest Istio Bookinfo Demo Tests bookinfo inter-service connectivity

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.19/78/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

Job 'Cilium-PR-K8s-1.18-kernel-4.19' failed:

Click to show.

Test Name

K8sAgentIstioTest Istio Bookinfo Demo Tests bookinfo inter-service connectivity

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.18-kernel-4.19/78/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.18-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@squeed
Copy link
Contributor Author

squeed commented Jul 13, 2023

The test failures are not flakes. Specifically, the error-message checker is complaining that this error message was logged:

⚠️ Found "2023-07-13T10:20:45.249876176Z level=error msg=\"endpoint regeneration failed\" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2536 error=\"endpoint 2536 SecurityIdentity changed during policy regeneration\" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint" in logs 1 times

This error is not critical; we handle this case in the code and trigger regeneration again and everything proceeds as expected. I'm going to consider ignoring this error in the tests.

I don't immediately understand why this is not an issue in main.

@squeed squeed force-pushed the pr/v1.13-backport-2023-07-10 branch from ad6d081 to b723eec Compare July 24, 2023 10:43
@squeed
Copy link
Contributor Author

squeed commented Jul 24, 2023

OK, added an exception for this error message. I have never seen it happen in v1.14 and later, I'm investigating why that is the case.

Requested an endpoint reviewer.

@squeed squeed requested review from a team and tommyp1ckles and removed request for a team July 24, 2023 10:47
@squeed
Copy link
Contributor Author

squeed commented Jul 26, 2023

Chatted with @joestringer a bit about this, it seems the error message is coming from regenerating node endpoints (since the node labels, in v1.13, seem to come while the node's endpoint is first being regenerated). I'll see if there's a reasonable way to prevent the error from being output.

@squeed
Copy link
Contributor Author

squeed commented Aug 9, 2023

I wasn't able to find a good way to filter that error message; it's just too far away from it's context.

I think this PR as written is OK. I'll rebase.

[ upstream commit 3ca309d ]

This function is called deep within the policy generation hierarchy, and
is at a risk of causing deadlocks.

Given that it's just reading a pointer to a never-mutated map, we can
safely stash this behind an atomic Pointer and remove the lock.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit e20b16d ]

It turns out that most of the endpoint identities, e.g. pod name /
namespace, are actually immutable. So, there's no need to grab a lock
before reading them.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit b63115b ]

These methods are no longer used; remove them from the
EndpointInfoSource interface.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit f048a6a ]

As preparation for other refactors of the policy engine, no longer hold
the endpoint lock while calculating policy. This is safe to do, since
the only input is the endpoint's security identity. Furthermore, if,
somehow, policy were to be calculated in parallel, we can reject an
update if its revision is too old.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit 9623641 ]

This ensures the generated ID works like IDs allocated normally - that
their LabelArray is set.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit 8e163a9 ]

This adds a small test that ensures incremental updates are never lost,
even in the face of significant identity churn.

It simulates a churning ipcache flinging identities in to the policy
engine, and similarly recalculates policy constantly.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
It is possible for an endpoint's identity to change during policy
calculation. If this happens, we need to error-out and try again.
Unfortunately we don't have a good way to prevent this error message,
since it is logged far away from the context.

So, sadly, we have to ignore it.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed force-pushed the pr/v1.13-backport-2023-07-10 branch from b723eec to 6016d7f Compare August 9, 2023 15:43
@squeed
Copy link
Contributor Author

squeed commented Aug 9, 2023

/test

@joestringer
Copy link
Member

joestringer commented Aug 9, 2023

/test-backport-1.13

Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Tests NodePort inside cluster (kube-proxy) with IPSec and externalTrafficPolicy=Local

Failure Output

FAIL: Request from k8s1 to service tftp://[fd04::12]:31030/hello failed

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.19/112/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed:

Click to show.

Test Name

K8sAgentIstioTest Istio Bookinfo Demo Tests bookinfo inter-service connectivity

Failure Output

FAIL: unable to download https://github.com/cilium/istio/releases/download/1.10.6-1/cilium-istioctl-1.10.6-1-linux-amd64.tar.gz

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.19/113/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@nebril
Copy link
Member

nebril commented Aug 10, 2023

/test-1.21-4.19

1 similar comment
@nebril
Copy link
Member

nebril commented Aug 10, 2023

/test-1.21-4.19

@nebril nebril merged commit b5900f5 into cilium:v1.13 Aug 10, 2023
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants