Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13 backport 2023-01-05 #22948

Merged
merged 18 commits into from
Jan 13, 2023
Merged

v1.13 backport 2023-01-05 #22948

merged 18 commits into from
Jan 13, 2023

Conversation

borkmann
Copy link
Member

@borkmann borkmann commented Jan 5, 2023

Manual 1.13 backport of:

Once this PR is merged, you can update the PR labels via:

$ for pr in 22421 22814 22915 23012; do contrib/backporting/set-labels.py $pr done 1.13; done

borkmann and others added 16 commits January 5, 2023 12:25
[ upstream commit 6d8f4a8 ]

This work completes the standalone egress GW functionality by implementing
generic nat64 without having to configure specific services for remap:

On the Gateway node:

  # ./daemon/cilium-agent --enable-ipv4=true --enable-ipv6=true \
    --datapath-mode=lb-only --bpf-lb-algorithm=maglev \
    --bpf-lb-maglev-table-size=2039 --bpf-lb-acceleration=disabled \
    --devices=enp5s0 --disable-envoy-version-check=true \
    --enable-stateless-nat46x64=true

  # ./cilium/cilium service list
  ID   Frontend   Service Type   Backend
  #

  # tcpdump -i enp5s0 port 80 -n
  [...]
  12:06:05.823733 IP6 2a02:168:f656:0:1ac0:4dff:ff01:d5e6.54054 > 64:ff9b::8c52:7903.80: Flags [S], seq 2743947568, win 43200, options [mss 1440,sackOK,TS val 3724070640 ecr 0,nop,wscale 9], length 0
  12:06:05.823745 IP 192.168.2.11.54054 > 140.82.121.3.80: Flags [S], seq 2743947568, win 43200, options [mss 1440,sackOK,TS val 3724070640 ecr 0,nop,wscale 9], length 0
  12:06:05.830454 IP 140.82.121.3.80 > 192.168.2.11.54054: Flags [S.], seq 433570079, ack 2743947569, win 65535, options [mss 1436,sackOK,TS val 1565141015 ecr 3724070640,nop,wscale 10], length 0
  12:06:05.830479 IP6 64:ff9b::8c52:7903.80 > 2a02:168:f656:0:1ac0:4dff:ff01:d5e6.54054: Flags [S.], seq 433570079, ack 2743947569, win 65535, options [mss 1436,sackOK,TS val 1565141015 ecr 3724070640,nop,wscale 10], length 0
  [...]

The client only needs to add a next hop entry in its routing table, no
special source address selection is needed:

  # ip -6 r
  [...]
  64:ff9b::/96 via 2a02:168:f656:0:1ac0:4dff:ff01:c164 dev enp5s0 metric 1024 pref medium
  [...]

  # curl --verbose "http://[64:ff9b::101:101]"
  *   Trying 64:ff9b::101:101:80...
  * TCP_NODELAY set
  * Connected to 64:ff9b::101:101 (64:ff9b::101:101) port 80 (#0)
  > GET / HTTP/1.1
  > Host: [64:ff9b::101:101]
  > User-Agent: curl/7.68.0
  > Accept: */*
  >
  * Mark bundle as not supporting multiuse
  < HTTP/1.1 403 Forbidden
  < Date: Tue, 29 Nov 2022 13:30:30 GMT
  < Content-Type: text/plain; charset=UTF-8
  < Content-Length: 16
  < Connection: close
  < X-Frame-Options: SAMEORIGIN
  < Referrer-Policy: same-origin
  < Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
  < Expires: Thu, 01 Jan 1970 00:00:01 GMT
  < Server: cloudflare
  < CF-RAY: 771bb2c739d5cc3e-ZRH
  <
  * Closing connection 0
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 225b31c ]

[ Fixed merge conflict in test suite to use updated helm config. ]

Rename the agent setting --enable-stateless-nat46x64 into a more generic
--enable-nat46x64-gateway, given the plan is to enable both stateless and
statefull translation in this case as both can be supported in parallel.
This has not seen a stable release yet, so rename is fine.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
…atus

[ upstream commit 64d7518 ]

For cilium status dump we want to reflect the state of services vs
gateway support on NAT46/64. Therefore extend both into objects which
can be extended further if needed. Also for the GW we expose the current
fixed default prefix. In future, this may be a list of prefixes.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
…status

[ upstream commit 80d9cbc ]

Auto-generated API code in here.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
…tatus

[ upstream commit 94003f2 ]

Add status dump and separate both Service/Gateway. Also dump the prefix for
the gateway. It's hard coded here, but subsequent commit will move this into
defaults package. In future this can be made configurable and potentially we
might support multiple prefixes.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 4631570 ]

Refactor the code such that we can easily expose it as a config knob and
have the underlying code be generic.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 148f71a ]

This cannot be true if err == nil and can also be removed in
validateIPv6ClusterAllocCIDR.

Also, s/CIDR length must be/prefix length must be/.

Suggested-by: Timo Beckers <timo@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit d8c7d93 ]

For new code, preference is to use net/netip package.

Suggested-by: Timo Beckers <timo@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 3f616d4 ]

Move neighbor handling code into helper functions as we're going to
reuse them also in other locations.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit f8ca297 ]

Given original requests might originate from the same L2 network, we
need to record them in the neighbor table as otherwise the gateway node
may not be aware of them since they are not in the kernel's neighbor
table.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 1582b48 ]

Remove any ambiguity around CB_NAT_46X64 when its read from the tail call
CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS by setting it for the two locations
where its not yet done.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 3f125d8 ]

For the case when IPv4 replies comes from the remote destination and we
need to retransform the packet into an IPv6 one via reverse SNAT, what
is needed is fib lookup for the target ifindex as well as figuring out
the next hop.

While the code hits rev_nodeport_lb6(), we don't have a CT entry there
given these NAT gateway entries are not service-related. What happened
was that it hit upper stack for recirculation given we did not enter the
branch where we do fib lookup and push it back out.

This was also why for XDP some traffic could be observed via tcpdump
which is obviously a bug. The two snat_remap* cases in IPv4 code path
differ in that the first case the IPv4 reply from remote comes back in,
and given a service must have been configured, it'll find a rev_nodeport_lb6()
CT lookup, but for the generic stateful NAT64 GW gateway this is not the
case.

However, we still need to enter the fib_lookup(): While there is no
reverse DNAT in this case, we need to find the ifindex, and lookup the
next hop with fallback to our BPF neighbor table if the original client
was not seen in the regular neighbor table of the kernel.

This fixes the bug, and also addresses another one at the same time, that
is, if the original request came from a client in the same L2, we can now
figure out the recorded neighbor address via BPF map as fallback if it was
not in the neighbor cache.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 433bddc ]

This is just duplicated code from when the Ingress and Egress NAT paths
were split up. In today's code, nothing ever sets the NAT46x64_MODE_XLATE
flag before calling to CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS.

Fixes: 536ad0c ("bpf/nodeport: split the cilium call ipv6 nodeport nat")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 57a59c4 ]

Use the packet's actual IP header for the FIB lookup (instead of the CT
tuple). They should be identical, but this
(1) clarifies that we really use the content of the translated packet, and
(2) allows us to run through the FIB code without building the CT tuple.

To avoid a revalidation after the maybe_add_l2_hdr() call, we re-use the
information already stored in the FIB param struct.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 443b75c ]

The NAT64 reply path piggy-backs on the FIB code in IPv6 RevDNAT. But
there's no more need to extract the CT tuple, the FIB code just uses the
packet's IPv6 header now.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit d329654 ]

Re-sync with bpf/lib/common.h

Fixes: 7a0a2c8 ("bpf: Add stateless RFC8215 NAT46/64 for standalone lb")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann borkmann requested review from a team as code owners January 5, 2023 16:10
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jan 5, 2023
@borkmann
Copy link
Member Author

borkmann commented Jan 5, 2023

/test-backport-1.13

Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed:

Click to show.

Test Name

K8sAgentPolicyTest Basic Test TLS policy

Failure Output

FAIL: Unable to retrieve all pods to restart unmanaged pods with 'kubectl -n kube-system get pods -o json': Exitcode: -1 

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed:

Click to show.

Test Name

K8sAgentIstioTest Istio Bookinfo Demo Tests bookinfo inter-service connectivity

Failure Output

FAIL: Found 1 io.cilium/app=operator logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.9 so I can create one.

@borkmann
Copy link
Member Author

borkmann commented Jan 6, 2023

k8s-1.19-kernel-4.9 pipeline is related to known issue #22849
k8s-1.20-kernel-4.9 pipeline is related to known issue #22578
k8s-1.22-kernel-4.9 pipeline is related to known issue #22850

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bpf parts look good, thanks!

@borkmann
Copy link
Member Author

borkmann commented Jan 9, 2023

(Currently investigating the l4lb issue, can repro locally on tc BPF in master too. XDP works. Other than that, all ci green except ci-l4lb.)

@borkmann borkmann force-pushed the pr/v1.13-backport-2023-01-05 branch 2 times, most recently from fc99e32 to cb33581 Compare January 10, 2023 00:07
[ upstream commit f9af2df ]

CB_SRC_IDENTITY is defined as 0 in nodeport header. This is actually the
same as CB_SRC_LABEL defined in common.h. Clean this up to make it more
obvious.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit ad11484 ]

Remap CB_NAT_46X64 to CB_IFINDEX as it does collide in some cases with
CB_SRC_LABEL inside nodeport_lb6():

  [...]
  ctx_store_meta(ctx, CB_NAT_46X64, 0);
  ctx_store_meta(ctx, CB_SRC_LABEL, src_identity);
  ep_tail_call(ctx, CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS);
  return DROP_MISSED_TAIL_CALL;

The issue has been observed in case of IPv4 frontend and IPv6 backend
address. When the IPv6 reply from remote comes back to the gateway, then
src_identity is 2 (world), and therefore CB_NAT_46X64 becomes true later
on in the tail call even though it's not supposed to be. This has been
observed in tc BPF while XDP BPF kept functioning given in tc BPF we
extract identities in bpf_host:

  [...]
  case bpf_htons(ETH_P_IPV6):
     identity = resolve_srcid_ipv6(ctx, identity, from_host);
     ctx_store_meta(ctx, CB_SRC_LABEL, identity);
     [...]
  case bpf_htons(ETH_P_IP):
     identity = resolve_srcid_ipv4(ctx, identity, &ipcache_srcid, from_host);
     ctx_store_meta(ctx, CB_SRC_LABEL, identity);
     [...]

The CB_IFINDEX does not collide with NAT46x64 given in case of LB it's only
used in case of DSR to transfer IPv4/IPv6 service address info via tail call,
whereas NAT46x64 gateway only operates in SNAT mode. Reproducer:

  # ./daemon/cilium-agent --enable-ipv4=true --enable-ipv6=true \
      --datapath-mode=lb-only --bpf-lb-algorithm=maglev --bpf-lb-maglev-table-size=2039 \
      --bpf-lb-acceleration=disabled --devices=enp5s0 --disable-envoy-version-check=true \
      --enable-nat46x64-gateway=true

The --enable-nat46x64-gateway must be enabled for this to trigger.

  # ./cilium/cilium service list
  ID   Frontend     Service Type   Backend
  2    2.2.2.2:80   ExternalIPs    1 => [2a03:2880:f11c:8183:face:b00c:0:25de]:80 (active)

Before fix:

  # curl -4 --verbose http://2.2.2.2:80
  *   Trying 2.2.2.2:80...
  * TCP_NODELAY set
  < fail >

After fix:

  # curl -4 --verbose http://2.2.2.2:80
  *   Trying 2.2.2.2:80...
  * TCP_NODELAY set
  * Connected to 2.2.2.2 (2.2.2.2) port 80 (#0)
  > GET / HTTP/1.1
  > Host: 2.2.2.2
  > User-Agent: curl/7.68.0
  > Accept: */*
  >
  * Mark bundle as not supporting multiuse
  < HTTP/1.1 301 Moved Permanently
  < Location: https://2.2.2.2/
  < Content-Type: text/plain
  < Server: proxygen-bolt
  [...]

Reported-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann
Copy link
Member Author

/ci-l4lb-1.13

@borkmann borkmann added the release-blocker/1.13 This issue will prevent the release of the next version of Cilium. label Jan 11, 2023
@borkmann
Copy link
Member Author

borkmann commented Jan 11, 2023

/test-backport-1.13

Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.56.11:80 from testclient-894nr

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.18-kernel-4.9' failed:

Click to show.

Test Name

K8sAgentIstioTest Istio Bookinfo Demo Tests bookinfo inter-service connectivity

Failure Output

FAIL: Cannot get reviewsV1 pods

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.18-kernel-4.9 so I can create one.

@borkmann
Copy link
Member Author

/test-upstream-k8s

@borkmann
Copy link
Member Author

gke:

Run gcloud container clusters create cilium-cilium-3895055065 \
Default change: During creation of nodepools or autoscaling configuration changes for cluster versions greater than 1.24.1-gke.800 a default location policy is applied. For Spot and PVM it defaults to ANY, and for all other VM kinds a BALANCED policy is used. To change the default values use the `--location-policy` flag.
Note: The Pod address range limits the maximum size of the cluster. Please refer to https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr to learn how to optimize IP address allocation.
ERROR: (gcloud.container.clusters.create) ResponseError: code=409, message=Already exists: projects/***/zones/us-west2-a/clusters/cilium-cilium-3895055065.
Error: Process completed with exit code 1.

@borkmann
Copy link
Member Author

/test-runtime

@borkmann borkmann requested review from a team as code owners January 12, 2023 08:34
@borkmann borkmann force-pushed the pr/v1.13-backport-2023-01-05 branch 2 times, most recently from 5e9db42 to e0b1dbe Compare January 13, 2023 12:09
@borkmann
Copy link
Member Author

borkmann commented Jan 13, 2023

(L4LB suite requires a full backport/sync with master. Related GH, discussing with Martynas about needed PRs, then I'll give it a go. All other tests were green or known issues (above mentioned).)

@borkmann borkmann added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 13, 2023
@borkmann borkmann merged commit 51645bc into v1.13 Jan 13, 2023
@borkmann borkmann deleted the pr/v1.13-backport-2023-01-05 branch January 13, 2023 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.13 This issue will prevent the release of the next version of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants