Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.12 backports 2023-01-13 #23092

Closed
wants to merge 22 commits into from
Closed

Conversation

ti-mo
Copy link
Contributor

@ti-mo ti-mo commented Jan 13, 2023

Once this PR is merged, you can update the PR labels via:

$ for pr in 20619 21267 22620 22821 22794 22807 22745 22678 22930 22985 22884 22977 22972 23018 23037 22962 21857; do contrib/backporting/set-labels.py $pr done 1.12; done

PRs dropped from this round:

Manual fix added to skip check-missing-tags-in-tests.sh in v1.12 branch due to 7435adb.

@ti-mo ti-mo requested review from a team as code owners January 13, 2023 16:15
@ti-mo ti-mo requested a review from nbusseneau January 13, 2023 16:15
@ti-mo ti-mo added backport/1.12 This PR represents a backport for Cilium 1.12.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jan 13, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 13, 2023
@ti-mo
Copy link
Contributor Author

ti-mo commented Jan 13, 2023

/test-backport-1.12

Job 'Cilium-PR-K8s-1.22-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from k8s1 to service http://[fd04::12]:30603 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from testclient-fglxk pod to service http://[fd04::11]:31546 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN and endpoint routes

Failure Output

FAIL: Failed to reach 10.0.1.33:80 from testclient-host-255wm

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my change, thanks.

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backport changes for my PR look good. Thanks!

@chancez
Copy link
Contributor

chancez commented Jan 13, 2023

PR created to backport:

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for PR #22745, thanks! 💯

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@julianwiedmann
Copy link
Member

The error in 1.24-net-next

RTNETLINK answers: File exists

makes me think we need #21857 for v1.12.

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My change looks good. Thanks!

[ upstream commit 1db1156 ]

With cilium/cilium-cli#962 in place in
cilium-cli v0.12.0 and the CI update to use that version in
cilium#20617, the connectivity tests
cover all functionality tested by the tests in l7_demos.go. Moreover,
the cilium-cli connectivity tests can be run against arbitrary clusters
with Cilium deployed, while this test is specific for the test setup
based on vagrant VMs. Thus, drop this test.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Timo Beckers <timo@isovalent.com>
[ upstream commit baf7f34 ]

Fredrik Björkman has reported that the BPF L7 Ingress (and thus Gateway
API) does not work when the Cilium's XDP is enabled.

A quick glimps into the nodeport_lb4/6 revealed that in the case of an
L7 service the ctx_redirect_to_proxy_hairpin_ipv4/6() is called. The
latter invokes the ctx_redirect() to the cilium_host. Unfortunately,
the redirect does not work when called from the bpf_xdp, as the
cilium_host doesn't have any XDP prog attached.

Fix this by offloading the L7 handling to the bpf_host at the TC BPF
ingress.

Reported-by: Fredrik Björkman <fredrik.bjorkman1@ingka.ikea.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Timo Beckers <timo@isovalent.com>
@aanm aanm force-pushed the pr/v1.12-backport-2023-01-13 branch from ca91ae1 to 132a953 Compare January 18, 2023 00:53
@aanm
Copy link
Member

aanm commented Jan 18, 2023

/test-backport-1.12

Job 'Cilium-PR-K8s-1.22-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from k8s1 to service http://[fd03::6e08]:10080 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from testclient-bmwjp pod to service http://[fd04::12]:30130 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.9 so I can create one.

dlapcevic and others added 9 commits January 18, 2023 11:33
[ upstream commit 91d52e4 ]

Avoid operator crashing from trying to access a field of a nil CESTracker. This can happen rarely with two consecutive CEP deletions that belong to the same CES. While the first CES update is being processed, the second CEP deletion causes the CES in cache to become empty, and CES update gets enqueued anyway, before the first CES update is finished. The first CES update then deletes CES tracker as the CES has 0 CEPs in it. Finally, the second CES update runs into the error.

Signed-off-by: Dorde Lapcevic <dordel@google.com>
Signed-off-by: Timo Beckers <timo@isovalent.com>
[ upstream commit a0a04db ]

When listing a table or chain with '-L', iptables will by default try to
perform a reverse IP lookup for each IP referenced in the ruleset that
is being listed.

This adds a useless dependency on DNS, which can lead to an increased
initialization time for the agent in case the DNS server is slow to
respond.

As the reverse lookup is not needed, switch to the '-S' command, which
does not perform any reverse lookup.

Signed-off-by: Gilberto Bertin <jibi@cilium.io>
Signed-off-by: Timo Beckers <timo@isovalent.com>
[ upstream commit 2c9c8c1 ]

When a service backend endpoint connects to itself via its service cluster IP
address, the traffic is hairpin'd using a loopback IP address. We skip policy
enforcement on egress, as a pod is allowed to connect to itself, and users don't
have to specify additional policies to allow the hairpin'd traffic. We have a
similar expectation on the ingress side.
The ingress side was broken because of a regression due to which replies were
dropped on the ingress policy enforcement path. The patch that introduced the
regression split per packet load-balancing logic into a separate tail
call, where (limited) LB state is stored in packet context, and restored later
while handling rest of the packet processing including conntrack, and policy
enforcement.
When a service pod connects to itself via its clusterIP, post the forward
service translation, conntrack state update is done on the conntrack entry
with loopback address (the restored state). As a result, when a reply is
reverse SNAT'd and DNAT'd, the policy verdict for the `CT_NEW` entry is
denied, and the reply packet is dropped.  Prior to the regression, the
original conntrack entry would have the updated state, and loopback flag set.

Fix: When hairpin'd traffic is reverse translated, and sent for a local delivery
in case of bpf_lxc, we should directly redirect it to the endpoint, thereby
skipping the tail call with ingress policy enforcement altogether.

Here is the packet flow -

Request path: podA -> clusterIP --> svc xlate (dAddr=podA) --> SNAT using
loopback address (saddr=loopback IP) --> conntrack entry created (loopback IP,
backend IP) with loopback flag state

Reply path: cluster IP -> podA --> svc rev-xlate (saddr=podA address) --> SNAT
using loopback (daddr=loopback IP) --> CT_REPLY match --> local_delivery

Fixes: 7575ba0 ("datapath: Reduce from LXC complexity")
Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Timo Beckers <timo@isovalent.com>
[ upstream commit a064973 ]

The test is meant to check that a TCPv4 packet addressed to
FRONTEND_IP:FRONTEND_PORT gets DNATed to BACKEND_IP:BACKEND_PORT and
TXed back out.

But right now we actually assert that the destination port *doesn't* get
DNATed, and the test still works! The reason is that the .frag_off field
in the IPv4 header isn't built correctly (it has the wrong endianness),
so the LB code in lb4_xlate() believes that the packet doesn't have a
L4 header and consequently also doesn't apply L4 DNAT on the packet.

Fix both aspects.

df1bc96 ("bpf: ported existing unit tests to new testing framework")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Timo Beckers <timo@isovalent.com>
[ upstream commit 5c522b2 ]

DSR NAT entries don't have the .host_local flag set, see eg.
snat_v4_create_dsr().

This was presumably just copy&pasted from the other NAT entries in the
file. It currently doesn't make a difference for the test, but let's still
fix it.

Fixes: 7904650 ("ctmap: add support for GC of DSR orphaned entries")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Timo Beckers <timo@isovalent.com>
[ upstream commit b006a84 ]

When clustermesh-apiserver Pod is deployed on the IPv6 single-stack
cluster, etcd fails to startup with the error like this.

```
invalid value "https://127.0.0.1:2379,https://a:1:0:3::fc75:2379" for flag -listen-client-urls: URL address does not have the form "host:port": https://a:1:0:3::fc75:2379
```

This happens because we don't put brackets around the IPv6 address. Fix
Helm template to correctly handle that.

Fixes: cilium#22952

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Timo Beckers <timo@isovalent.com>
This requirement was removed in 7435adb, but new test files are now
trickling down into 1.12 through backports. Skip checking tags here
as well.

Signed-off-by: Timo Beckers <timo@isovalent.com>
[ upstream commit 6554eff ]

Don't leave dangling IP routes behind, they can impact subsequent tests.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: André Martins <andre@cilium.io>
[ upstream commit 87a916c ]

This most likely just papers over bugs in other tests that fail to tear
down their custom routes.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: André Martins <andre@cilium.io>
@aanm aanm force-pushed the pr/v1.12-backport-2023-01-13 branch from 132a953 to 51a0e7c Compare January 18, 2023 10:33
@aanm
Copy link
Member

aanm commented Jan 18, 2023

/test-backport-1.12

Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) with L4 policy Tests NodePort with L4 Policy

Failure Output

FAIL: Request from k8s1 to service tftp://[fd04::12]:31589/hello failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed:

Click to show.

Test Name

K8sKafkaPolicyTest Kafka Policy Tests KafkaPolicies

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-4.9 so I can create one.

@ldelossa
Copy link
Contributor

ConformanceGKE (ci-gke-1.12) — Connectivity test failed - Unrelated to changes:

Run while [ "$(gcloud container operations list --filter="status=RUNNING AND targetLink~cilium-cilium-3948015237" --format="value(name)")" ];do
ERROR: (gcloud.container.clusters.delete) Some requests did not succeed:
 - args: ["ResponseError: code=404, message=Not found: projects/***/zones/us-west2-a/clusters/cilium-cilium-3948015237.\nNo cluster named 'cilium-cilium-3948015237' in ***."]
   exit_code: 1

k8s-1.19-kernel-4.9 (test-1.19-4.9) — Build finished. - #22578

k8s-1.21-kernel-4.9 (test-1.21-4.9) — Build finished. - #22578

@christarazi
Copy link
Member

@ldelossa Are the underlying errors the same in the failure vs the flake? The failure reports a level=error log msg which is typically something we should scrutinize closer.

@ldelossa
Copy link
Contributor

BPF checks / build datapath (pull_request) Failing after 6m - looks legitimate. The Go driven BPF tests are complaining about a minimum kernel version.

That test is using a Ubuntu 22.04 image: https://github.com/cilium/cilium/actions/runs/3947994288/workflow#L75
That should be a newer kernel then >= 4.12.

The error message bpf_test.go:261: error while running setup prog: can't test program: BPF_PROG_TEST_RUN not supported (requires >= v4.12) is returned from cilium/ebpf.Program.Test().

@ldelossa
Copy link
Contributor

@christarazi Those tests fail consistently with:

12:08:34 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-vt8gd': Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
	 
	 command terminated with exit code 

Which appear to be a known flake.

@joestringer
Copy link
Member

@ldelossa where did you get the immediate above comment from? Are you sure that's the failure and not an intermediate report for a specific STEP in the test? I don't see FAIL:., and when I follow the links from the checks box below for k8s-1.x-kernel-4.9 failures, I don't see the above as the cause.

@christarazi
Copy link
Member

christarazi commented Jan 18, 2023

@ldelossa That was just a transient error that happened earlier. If you scroll down further, you should see

=== Test Finished at 2022-12-06T12:09:26Z====
12:09:26 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathConfig
FAIL: Found 1 io.cilium/app=operator logs matching list of errors that must be investigated:
level=error

Edit: @joestringer beat me to it. He's pointing out the same thing.

@ldelossa
Copy link
Contributor

Aha I see.

In that case:

k8s-1.21-kernel-4.9 (test-1.21-4.9) — Build finished. - Existing flake: #21279

Will re-run for 1.19 error, since I do not see any issue for that Kafka related fail.

@ldelossa
Copy link
Contributor

/test-1.19-4.9

@ldelossa
Copy link
Contributor

/test-1.19-4.9

Flake, I'll track a ticket with that previous failure.

@ldelossa
Copy link
Contributor

This backport needed some sugar on top: #23190

The original PR is opened on Timo's fork which I can't push to, so we'll use the new PR above to merge in the backports.

@ti-mo
Copy link
Contributor Author

ti-mo commented Jan 20, 2023

Superseded by #23190.

@ldelossa For posterity, this PR had 'Allow edits by maintainers' enabled, so a gh pr checkout ti-mo:pr/v1.12-backport-2023-01-13 should've set things up for you to push to this branch without issues.

@ti-mo ti-mo closed this Jan 20, 2023
@ldelossa
Copy link
Contributor

@ti-mo yup, now I know, my bad. I should have just gave it a try, but wasn't sure if you wanted pushes directly to your fork.

@ldelossa ldelossa mentioned this pull request Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.12 This PR represents a backport for Cilium 1.12.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. kind/community-contribution This was a contribution made by a community member.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet