v1.12 backports 2023-01-13 #23092

ti-mo · 2023-01-13T16:15:19Z

test/k8s: remove l7_demos test #20619 -- test/k8s: remove l7_demos test (@tklauser)
Adding/fixing DNSProxy metrics #21267 -- Adding/fixing DNSProxy metrics (@rahulkjoshi)
Update Cilium install guide about EKS aws-node DaemonSet potential connectivity problem on uninstall #22620 -- Update Cilium install guide about EKS aws-node DaemonSet potential connectivity problem on uninstall (@NikAleksandrov)
Add sphinxcontrib-googleanalytics to doc requirements #22821 -- Add sphinxcontrib-googleanalytics to doc requirements (@chalin)
bpf: nodeport: wire up trace aggregation for rev_nodeport_lb6() #22794 -- bpf: nodeport: wire up trace aggregation for rev_nodeport_lb6() (@julianwiedmann)
-> please double-check if this makes sense, rev_nodeport_lb6 changed quite a bit
Update Layer 7 Protocol Visibility Document. #22807 -- Update Layer 7 Protocol Visibility Document. (@obaranov1)
Improve fqdn events logging management #22745 -- Improve fqdn events logging management (@pippolo84)
bpf: nodeport: fix tracing for handle_nat_fwd() #22678 -- bpf: nodeport: fix tracing for handle_nat_fwd() (@julianwiedmann)
Fix missing node neigh metric for counting arping requests #22930 -- Fix missing node neigh metric for counting arping requests (@christarazi)
datapath: Fix L7 ingress with XDP #22985 -- datapath: Fix L7 ingress with XDP (@brb)
Fix crash of CES queue delay metric when CESTracker is nil #22884 -- Fix crash of CES queue delay metric when CESTracker is nil (@dlapcevic)
iptables: skip reverse IP lookup #22977 -- iptables: skip reverse IP lookup (@jibi)
bpf: Fix LB loopback path with ingress policy #22972 -- bpf: Fix LB loopback path with ingress policy (@aditighag)
bpf: test: fix xdp_lb4_forward_to_other_node test #23018 -- bpf: test: fix xdp_lb4_forward_to_other_node test (@julianwiedmann)
ctmap: fix-up host_local flag in the DSR NAT entry for GC test #23037 -- ctmap: fix-up host_local flag in the DSR NAT entry for GC test (@julianwiedmann)
clustermesh: Add missing brackets of etcd option #22962 -- clustermesh: Add missing brackets of etcd option (@YutaroHayakawa)
test: net_policies: delete custom IP routes after test completion #21857 -- test: net_policies: delete custom IP routes after test completion (@julianwiedmann)

Once this PR is merged, you can update the PR labels via:

$ for pr in 20619 21267 22620 22821 22794 22807 22745 22678 22930 22985 22884 22977 22972 23018 23037 22962 21857; do contrib/backporting/set-labels.py $pr done 1.12; done

PRs dropped from this round:

CODEOWNERS: Cover test/bpf_tests by sig-datapath #22928 -- CODEOWNERS: Cover test/bpf_tests by sig-datapath (@christarazi)
-> dropped, maintenance branches have different CODEOWNERS
Endpoint/CEP Ownership Fix #21768 -- Endpoint/CEP Ownership Fix (@tommyp1ckles)
-> relies on missing node.GetCiliumEndpointNodeIP()
nodeport: fix CT accounting in revDNAT path #22810 -- nodeport: fix CT accounting in revDNAT path (@julianwiedmann)
-> generated quite a large diff I wasn't sure what to do with
Make cilium pprof listen address configurable #22768 -- Make cilium pprof listen address configurable (@chancez)
-> can probably be made to fit with some work from the author; do we really want/need to backport this?
Enable Google Analytics 4 #22220 -- Enable Google Analytics 4 (@chalin)
datapath: Fix L7 reply to outside when endpoint routes disabled #21980 -- datapath: Fix L7 reply to outside when endpoint routes disabled (@brb)

Manual fix added to skip check-missing-tags-in-tests.sh in v1.12 branch due to 7435adb.

ti-mo · 2023-01-13T16:23:24Z

/test-backport-1.12

Job 'Cilium-PR-K8s-1.22-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from k8s1 to service http://[fd04::12]:30603 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from testclient-fglxk pod to service http://[fd04::11]:31546 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN and endpoint routes

Failure Output

FAIL: Failed to reach 10.0.1.33:80 from testclient-host-255wm

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

tklauser

Looks good for my change, thanks.

aditighag

Backport changes for my PR look good. Thanks!

chancez · 2023-01-13T18:25:26Z

PR created to backport:

Make cilium pprof listen address configurable #22768 -- Make cilium pprof listen address configurable (@chancez)

pippolo84

LGTM for PR #22745, thanks! 💯

julianwiedmann

bpf: nodeport: wire up trace aggregation for rev_nodeport_lb6() #22794 -- bpf: nodeport: wire up trace aggregation for rev_nodeport_lb6() (@julianwiedmann)
-> please double-check if this makes sense, rev_nodeport_lb6 changed quite a bit

Looks good, thanks!

julianwiedmann · 2023-01-16T05:55:20Z

The error in 1.24-net-next

RTNETLINK answers: File exists

makes me think we need #21857 for v1.12.

YutaroHayakawa

My change looks good. Thanks!

[ upstream commit 1db1156 ] With cilium/cilium-cli#962 in place in cilium-cli v0.12.0 and the CI update to use that version in cilium#20617, the connectivity tests cover all functionality tested by the tests in l7_demos.go. Moreover, the cilium-cli connectivity tests can be run against arbitrary clusters with Cilium deployed, while this test is specific for the test setup based on vagrant VMs. Thus, drop this test. Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Timo Beckers <timo@isovalent.com>

[ upstream commit baf7f34 ] Fredrik Björkman has reported that the BPF L7 Ingress (and thus Gateway API) does not work when the Cilium's XDP is enabled. A quick glimps into the nodeport_lb4/6 revealed that in the case of an L7 service the ctx_redirect_to_proxy_hairpin_ipv4/6() is called. The latter invokes the ctx_redirect() to the cilium_host. Unfortunately, the redirect does not work when called from the bpf_xdp, as the cilium_host doesn't have any XDP prog attached. Fix this by offloading the L7 handling to the bpf_host at the TC BPF ingress. Reported-by: Fredrik Björkman <fredrik.bjorkman1@ingka.ikea.com> Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Timo Beckers <timo@isovalent.com>

aanm · 2023-01-18T00:54:31Z

/test-backport-1.12

Job 'Cilium-PR-K8s-1.22-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from k8s1 to service http://[fd03::6e08]:10080 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from testclient-bmwjp pod to service http://[fd04::12]:30130 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.9 so I can create one.

[ upstream commit 91d52e4 ] Avoid operator crashing from trying to access a field of a nil CESTracker. This can happen rarely with two consecutive CEP deletions that belong to the same CES. While the first CES update is being processed, the second CEP deletion causes the CES in cache to become empty, and CES update gets enqueued anyway, before the first CES update is finished. The first CES update then deletes CES tracker as the CES has 0 CEPs in it. Finally, the second CES update runs into the error. Signed-off-by: Dorde Lapcevic <dordel@google.com> Signed-off-by: Timo Beckers <timo@isovalent.com>

[ upstream commit a0a04db ] When listing a table or chain with '-L', iptables will by default try to perform a reverse IP lookup for each IP referenced in the ruleset that is being listed. This adds a useless dependency on DNS, which can lead to an increased initialization time for the agent in case the DNS server is slow to respond. As the reverse lookup is not needed, switch to the '-S' command, which does not perform any reverse lookup. Signed-off-by: Gilberto Bertin <jibi@cilium.io> Signed-off-by: Timo Beckers <timo@isovalent.com>

[ upstream commit 2c9c8c1 ] When a service backend endpoint connects to itself via its service cluster IP address, the traffic is hairpin'd using a loopback IP address. We skip policy enforcement on egress, as a pod is allowed to connect to itself, and users don't have to specify additional policies to allow the hairpin'd traffic. We have a similar expectation on the ingress side. The ingress side was broken because of a regression due to which replies were dropped on the ingress policy enforcement path. The patch that introduced the regression split per packet load-balancing logic into a separate tail call, where (limited) LB state is stored in packet context, and restored later while handling rest of the packet processing including conntrack, and policy enforcement. When a service pod connects to itself via its clusterIP, post the forward service translation, conntrack state update is done on the conntrack entry with loopback address (the restored state). As a result, when a reply is reverse SNAT'd and DNAT'd, the policy verdict for the `CT_NEW` entry is denied, and the reply packet is dropped. Prior to the regression, the original conntrack entry would have the updated state, and loopback flag set. Fix: When hairpin'd traffic is reverse translated, and sent for a local delivery in case of bpf_lxc, we should directly redirect it to the endpoint, thereby skipping the tail call with ingress policy enforcement altogether. Here is the packet flow - Request path: podA -> clusterIP --> svc xlate (dAddr=podA) --> SNAT using loopback address (saddr=loopback IP) --> conntrack entry created (loopback IP, backend IP) with loopback flag state Reply path: cluster IP -> podA --> svc rev-xlate (saddr=podA address) --> SNAT using loopback (daddr=loopback IP) --> CT_REPLY match --> local_delivery Fixes: 7575ba0 ("datapath: Reduce from LXC complexity") Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Timo Beckers <timo@isovalent.com>

[ upstream commit a064973 ] The test is meant to check that a TCPv4 packet addressed to FRONTEND_IP:FRONTEND_PORT gets DNATed to BACKEND_IP:BACKEND_PORT and TXed back out. But right now we actually assert that the destination port *doesn't* get DNATed, and the test still works! The reason is that the .frag_off field in the IPv4 header isn't built correctly (it has the wrong endianness), so the LB code in lb4_xlate() believes that the packet doesn't have a L4 header and consequently also doesn't apply L4 DNAT on the packet. Fix both aspects. df1bc96 ("bpf: ported existing unit tests to new testing framework") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Timo Beckers <timo@isovalent.com>

[ upstream commit 5c522b2 ] DSR NAT entries don't have the .host_local flag set, see eg. snat_v4_create_dsr(). This was presumably just copy&pasted from the other NAT entries in the file. It currently doesn't make a difference for the test, but let's still fix it. Fixes: 7904650 ("ctmap: add support for GC of DSR orphaned entries") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Timo Beckers <timo@isovalent.com>

[ upstream commit b006a84 ] When clustermesh-apiserver Pod is deployed on the IPv6 single-stack cluster, etcd fails to startup with the error like this. ``` invalid value "https://127.0.0.1:2379,https://a:1:0:3::fc75:2379" for flag -listen-client-urls: URL address does not have the form "host:port": https://a:1:0:3::fc75:2379 ``` This happens because we don't put brackets around the IPv6 address. Fix Helm template to correctly handle that. Fixes: cilium#22952 Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Timo Beckers <timo@isovalent.com>

This requirement was removed in 7435adb, but new test files are now trickling down into 1.12 through backports. Skip checking tags here as well. Signed-off-by: Timo Beckers <timo@isovalent.com>

[ upstream commit 6554eff ] Don't leave dangling IP routes behind, they can impact subsequent tests. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: André Martins <andre@cilium.io>

[ upstream commit 87a916c ] This most likely just papers over bugs in other tests that fail to tear down their custom routes. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: André Martins <andre@cilium.io>

aanm · 2023-01-18T10:35:54Z

/test-backport-1.12

Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L4 policy Tests NodePort with L4 Policy

Failure Output

FAIL: Request from k8s1 to service tftp://[fd04::12]:31589/hello failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed:

Click to show.

Test Name

K8sKafkaPolicyTest Kafka Policy Tests KafkaPolicies

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-4.9 so I can create one.

ldelossa · 2023-01-18T22:16:32Z

ConformanceGKE (ci-gke-1.12) — Connectivity test failed - Unrelated to changes:

Run while [ "$(gcloud container operations list --filter="status=RUNNING AND targetLink~cilium-cilium-3948015237" --format="value(name)")" ];do
ERROR: (gcloud.container.clusters.delete) Some requests did not succeed:
 - args: ["ResponseError: code=404, message=Not found: projects/***/zones/us-west2-a/clusters/cilium-cilium-3948015237.\nNo cluster named 'cilium-cilium-3948015237' in ***."]
   exit_code: 1

k8s-1.19-kernel-4.9 (test-1.19-4.9) — Build finished. - #22578

k8s-1.21-kernel-4.9 (test-1.21-4.9) — Build finished. - #22578

christarazi · 2023-01-18T22:29:16Z

@ldelossa Are the underlying errors the same in the failure vs the flake? The failure reports a level=error log msg which is typically something we should scrutinize closer.

ldelossa · 2023-01-18T22:41:09Z

BPF checks / build datapath (pull_request) Failing after 6m - looks legitimate. The Go driven BPF tests are complaining about a minimum kernel version.

That test is using a Ubuntu 22.04 image: https://github.com/cilium/cilium/actions/runs/3947994288/workflow#L75
That should be a newer kernel then >= 4.12.

The error message bpf_test.go:261: error while running setup prog: can't test program: BPF_PROG_TEST_RUN not supported (requires >= v4.12) is returned from cilium/ebpf.Program.Test().

ldelossa · 2023-01-18T22:47:21Z

@christarazi Those tests fail consistently with:

12:08:34 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-vt8gd': Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
	 
	 command terminated with exit code

Which appear to be a known flake.

joestringer · 2023-01-18T22:55:18Z

@ldelossa where did you get the immediate above comment from? Are you sure that's the failure and not an intermediate report for a specific STEP in the test? I don't see FAIL:., and when I follow the links from the checks box below for k8s-1.x-kernel-4.9 failures, I don't see the above as the cause.

christarazi · 2023-01-18T22:55:23Z

@ldelossa That was just a transient error that happened earlier. If you scroll down further, you should see

=== Test Finished at 2022-12-06T12:09:26Z====
12:09:26 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathConfig
FAIL: Found 1 io.cilium/app=operator logs matching list of errors that must be investigated:
level=error

Edit: @joestringer beat me to it. He's pointing out the same thing.

ldelossa · 2023-01-18T23:48:21Z

Aha I see.

In that case:

k8s-1.21-kernel-4.9 (test-1.21-4.9) — Build finished. - Existing flake: #21279

Will re-run for 1.19 error, since I do not see any issue for that Kafka related fail.

ldelossa · 2023-01-18T23:48:31Z

/test-1.19-4.9

ldelossa · 2023-01-19T02:43:14Z

/test-1.19-4.9

Flake, I'll track a ticket with that previous failure.

ldelossa · 2023-01-19T18:59:09Z

This backport needed some sugar on top: #23190

The original PR is opened on Timo's fork which I can't push to, so we'll use the new PR above to merge in the backports.

ti-mo · 2023-01-20T12:58:22Z

Superseded by #23190.

@ldelossa For posterity, this PR had 'Allow edits by maintainers' enabled, so a gh pr checkout ti-mo:pr/v1.12-backport-2023-01-13 should've set things up for you to push to this branch without issues.

ldelossa · 2023-01-20T14:30:07Z

@ti-mo yup, now I know, my bad. I should have just gave it a try, but wasn't sure if you wanted pushes directly to your fork.

ti-mo requested review from a team as code owners January 13, 2023 16:15

ti-mo requested a review from nbusseneau January 13, 2023 16:15

ti-mo added backport/1.12 This PR represents a backport for Cilium 1.12.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jan 13, 2023

ti-mo requested review from aditighag, brb, christarazi, jibi, julianwiedmann, NikAleksandrov, pippolo84, tklauser and YutaroHayakawa January 13, 2023 16:15

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 13, 2023

tklauser approved these changes Jan 13, 2023

View reviewed changes

aditighag approved these changes Jan 13, 2023

View reviewed changes

pippolo84 approved these changes Jan 13, 2023

View reviewed changes

christarazi approved these changes Jan 13, 2023

View reviewed changes

julianwiedmann approved these changes Jan 16, 2023

View reviewed changes

jibi approved these changes Jan 16, 2023

View reviewed changes

NikAleksandrov approved these changes Jan 16, 2023

View reviewed changes

julianwiedmann mentioned this pull request Jan 16, 2023

test: net_policies: delete custom IP routes after test completion #21857

Merged

nbusseneau approved these changes Jan 16, 2023

View reviewed changes

YutaroHayakawa approved these changes Jan 17, 2023

View reviewed changes

aanm force-pushed the pr/v1.12-backport-2023-01-13 branch from ca91ae1 to 132a953 Compare January 18, 2023 00:53

dlapcevic and others added 9 commits January 18, 2023 11:33

Makefile: skip check-missing-tags-in-tests.sh

0367f47

This requirement was removed in 7435adb, but new test files are now trickling down into 1.12 through backports. Skip checking tags here as well. Signed-off-by: Timo Beckers <timo@isovalent.com>

test: net_policies: delete custom IP routes

ff030d9

[ upstream commit 6554eff ] Don't leave dangling IP routes behind, they can impact subsequent tests. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: André Martins <andre@cilium.io>

test: net_policies: Add custom IP routes without 'replace'

51a0e7c

[ upstream commit 87a916c ] This most likely just papers over bugs in other tests that fail to tear down their custom routes. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: André Martins <andre@cilium.io>

aanm force-pushed the pr/v1.12-backport-2023-01-13 branch from 132a953 to 51a0e7c Compare January 18, 2023 10:33

YutaroHayakawa approved these changes Jan 18, 2023

View reviewed changes

ldelossa mentioned this pull request Jan 19, 2023

v1.12 backports 2023-01-13 (replacement) #23190

Closed

ti-mo closed this Jan 20, 2023

ldelossa mentioned this pull request Jan 23, 2023

Pr/v1.12 backports #23260

Merged

v1.12 backports 2023-01-13 #23092

v1.12 backports 2023-01-13 #23092

Conversation

ti-mo commented Jan 13, 2023 • edited by aanm Loading

ti-mo commented Jan 13, 2023 • edited by maintainer-s-little-helper bot Loading

Test Name

Failure Output

Test Name

Failure Output

Test Name

Failure Output

tklauser left a comment

Choose a reason for hiding this comment

aditighag left a comment

Choose a reason for hiding this comment

chancez commented Jan 13, 2023

pippolo84 left a comment

Choose a reason for hiding this comment

julianwiedmann left a comment

Choose a reason for hiding this comment

julianwiedmann commented Jan 16, 2023

YutaroHayakawa left a comment

Choose a reason for hiding this comment

aanm commented Jan 18, 2023 • edited by maintainer-s-little-helper bot Loading

Test Name

Failure Output

Test Name

Failure Output

aanm commented Jan 18, 2023 • edited by maintainer-s-little-helper bot Loading

Test Name

Failure Output

Test Name

Failure Output

ldelossa commented Jan 18, 2023

christarazi commented Jan 18, 2023

ldelossa commented Jan 18, 2023

ldelossa commented Jan 18, 2023

joestringer commented Jan 18, 2023

christarazi commented Jan 18, 2023 • edited Loading

ldelossa commented Jan 18, 2023

ldelossa commented Jan 18, 2023

ldelossa commented Jan 19, 2023

ldelossa commented Jan 19, 2023

ti-mo commented Jan 20, 2023

ldelossa commented Jan 20, 2023

ti-mo commented Jan 13, 2023 •

edited by aanm

Loading

ti-mo commented Jan 13, 2023 •

edited by maintainer-s-little-helper bot

Loading

aanm commented Jan 18, 2023 •

edited by maintainer-s-little-helper bot

Loading

aanm commented Jan 18, 2023 •

edited by maintainer-s-little-helper bot

Loading

christarazi commented Jan 18, 2023 •

edited

Loading