Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13: CI: Conformance E2E: egress-gateway test fails with "Failed to get identity labels for endpoint, skipping update to egress policy." #31174

Closed
joestringer opened this issue Mar 5, 2024 · 6 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! feature/egress-gateway Impacts the egress IP gateway feature. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@joestringer
Copy link
Member

CI failure

https://github.com/cilium/cilium/actions/runs/8153579410/job/22285262328#step:13:178
sysdump: too big

[=] Test [egress-gateway] [45/69]
W0305 09:18:34.360102    4895 warnings.go:70] unknown field "spec.excludedCIDRs"
W0305 09:18:34.367202    4895 warnings.go:70] unknown field "spec.excludedCIDRs"

  ℹ️  📜 Applying CiliumEgressGatewayPolicy 'cegp-sample-client' to namespace ''..
  ℹ️  📜 Applying CiliumEgressGatewayPolicy 'cegp-sample-echo' to namespace ''..
  [-] Scenario [egress-gateway/egress-gateway]
  🟥 Failed to ensure egress gateway policy map is properly populated: timeout while waiting for condition, last error: Could not find egress gateway policy entry matching {SourceIP:10.244.2.93 DestCIDR:0.0.0.0/0 EgressIP:0.0.0.0 GatewayIP:172.18.0.4}

...

[=] Test [check-log-errors] [69/69]
................
  [-] Scenario [check-log-errors/no-errors-in-logs]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-operator-6744bbb599-hgc54 (cilium-operator)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-s926v (cilium-agent)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-s926v (config)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-s926v (mount-cgroup)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-s926v (apply-sysctl-overwrites)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-s926v (mount-bpf-fs)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-s926v (clean-cilium-state)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-s926v (install-cni-binaries)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-drq7c (cilium-agent)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-drq7c (config)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-drq7c (mount-cgroup)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-drq7c (apply-sysctl-overwrites)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-drq7c (mount-bpf-fs)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-drq7c (clean-cilium-state)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-drq7c (install-cni-binaries)]
  [.] Action [check-log-errors/no-errors-in-logs/kind-kind/kube-system/cilium-mcp7s (cilium-agent)]
  ❌ Found 1 logs in kind-kind/kube-system/cilium-mcp7s (cilium-agent) matching list of errors that must be investigated:
level=error msg="Failed to get identity labels for endpoint, skipping update to egress policy." error="identity 26610 not found" k8sEndpointName=client-69748f45d8-lp8g4 k8sNamespace=cilium-test subsys=egressgateway (1 occurrences)
@joestringer joestringer added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Mar 5, 2024
@julianwiedmann julianwiedmann added the feature/egress-gateway Impacts the egress IP gateway feature. label Mar 6, 2024
@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Mar 8, 2024

Hmm, looks like it is missing from the map on the first agent pod (i.e. 20 00 00 00 0a f4 02 5d) so this must be coming from Cilium - likely related to that identity lookup error.

find . | grep egress_gw_policy | xargs cat                                                         tom@Toms-MBP
key: 20 00 00 00 0a f4 01 50  00 00 00 00  value: 00 00 00 00 ac 12 00 04
key: 20 00 00 00 0a f4 01 f3  00 00 00 00  value: 00 00 00 00 ac 12 00 04
key: 20 00 00 00 0a f4 02 27  00 00 00 00  value: 00 00 00 00 ac 12 00 04
key: 20 00 00 00 0a f4 02 f7  00 00 00 00  value: 00 00 00 00 ac 12 00 04
Found 4 elements
key: 20 00 00 00 0a f4 01 50  00 00 00 00  value: ac 12 00 04 ac 12 00 04
key: 20 00 00 00 0a f4 01 f3  00 00 00 00  value: ac 12 00 04 ac 12 00 04
key: 20 00 00 00 0a f4 02 27  00 00 00 00  value: ac 12 00 04 ac 12 00 04
key: 20 00 00 00 0a f4 02 5d  00 00 00 00  value: ac 12 00 04 ac 12 00 04
key: 20 00 00 00 0a f4 02 f7  00 00 00 00  value: ac 12 00 04 ac 12 00 04
Found 5 elements
key: 20 00 00 00 0a f4 01 50  00 00 00 00  value: 00 00 00 00 ac 12 00 04
key: 20 00 00 00 0a f4 01 f3  00 00 00 00  value: 00 00 00 00 ac 12 00 04
key: 20 00 00 00 0a f4 02 27  00 00 00 00  value: 00 00 00 00 ac 12 00 04
key: 20 00 00 00 0a f4 02 5d  00 00 00 00  value: 00 00 00 00 ac 12 00 04
key: 20 00 00 00 0a f4 02 f7  00 00 00 00  value: 00 00 00 00 ac 12 00 04
Found 5 elements

update

Hmm, it's also in the ipcache:

10.244.2.93/32                identity=26610 encryptkey=0 tunnelendpoint=172.18.0.4

But there's an identity label update happening at almost the exact same time, perhaps a race condition between the egress-gateway plumbing and updating the ipcache/identities?

2024-03-05T09:07:59.613912716Z level=debug msg="Upserting IP into ipcache layer" identity="{26610 custom-resource false}" ipAddr=10.244.2.93 k8sNamespace=cilium-test k8sPodName=client-69748f45d8-lp8g4 key=0 namedPorts="map[]" subsys=ipcach
2024-03-05T09:07:59.613919118Z level=debug msg="Daemon notified of IP-Identity cache state change" identity="{26610 custom-resource false}" ipAddr="{10.244.2.93 ffffffff}" modification=Upsert subsys=datapath-ipcache

@tommyp1ckles
Copy link
Contributor

Looking at: https://github.com/cilium/cilium/blob/v1.13/pkg/egressgateway/manager.go#L197

I'm wondering what would prevent a potential race where the remote identity is inserted locally following the endpoint, thus creating a situation where the identity lookup fails when reconciling egressgateway.

(I looked at both v1.13 and main, the codes changed a lot since 1.13 however the same question remains).

@julianwiedmann @joestringer any ideas?

@tommyp1ckles
Copy link
Contributor

If there is indeed potential for a race condition, it seems like we'd see what we saw in this test, a failed reconciliation of egress gateway map data.

@julianwiedmann
Copy link
Member

Sounds like what was fixed with #26457, and not backported to v1.13.

It's awfully late to now fix this in v1.13, but not much to be done if CI is hitting it :/.

Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label May 10, 2024
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! feature/egress-gateway Impacts the egress IP gateway feature. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

3 participants