Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate tunnel map key and value structure #17106

Closed
wants to merge 1 commit into from

Conversation

vincentmli
Copy link
Contributor

@vincentmli vincentmli commented Aug 9, 2021

Tunnel map use same endpoint_key as key and value structure. in future
we may add tunnel map value structure fields for new features like
extending cilium VXLAN tunnel connectivity to external VXLAN devices and
add value fields like VNI and VTEP MAC...etc, We also need to add cilium
bpf tunnel command for users to maintain the tunnel map, thus separate the
tunnel map value and key structure as first step toward that goal

Suggested-by: Joe Stringer joe@cilium.io
Signed-off-by: Vincent Li vincent.mc.li@gmail.com

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Thanks for contributing!

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 9, 2021
@vincentmli
Copy link
Contributor Author

test-me-please

@vincentmli
Copy link
Contributor Author

@joestringer

Hi Joe, do you have any comment on this PR? thanks

@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Aug 13, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 13, 2021
@joestringer
Copy link
Member

@vincentmli I didn't look at the details yet, but I've triggered the CI for you. If the tests are passing then I think it would make sense to click the "ready for review" button at the bottom of the page, and then GitHub will assign reviewers to take a look at the PR.

@joestringer
Copy link
Member

@vincentmli Also, please take a look at the checkboxes and the release note section in the PR description - this is for you to fill out when you submit the PR.

@vincentmli
Copy link
Contributor Author

@vincentmli Also, please take a look at the checkboxes and the release note section in the PR description - this is for you to fill out when you submit the PR.

@joestringer thanks, I read the release note section, where or how do I add the release note label when I create PR? I am new to this :)

@joestringer
Copy link
Member

joestringer commented Aug 13, 2021

@vincentmli Either replace the text "<!-- Enter the release note text here if needed or remove this section! -->" or remove that part of the description (including the ```release-note bit)

@vincentmli
Copy link
Contributor Author

@joestringer ci-eks and ci-gke conformance test failed, seems not related to my PR based on the error, I see other PR seems also having same error. also the k8s-kernel test seems has not started

@joestringer
Copy link
Member

test-me-please

@vincentmli
Copy link
Contributor Author

@joestringer k8s-1.16-kernel-netnext (test-1.16-netnext) test failed, I looked at the detail here https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1289/testReport/junit/Suite-k8s-1/16/K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_vxlan_Test_NodePort_with_netfilterCompatMode_true/, it appears the test is for testing NodePort service across node with vxlan and netfilterCompatMode set to true, I tried to do similar test locally on my dev machine and tried to use netfilter-compatible-mode: "true" in cilium-config Configmap, or use helm helm template cilium cilium/cilium --version 1.10.3 --namespace kube-system --set kubeProxyReplacement=strict --set netfilterCompatMode=true > cilium-netfilter.yaml, what's odd is that I can't see netfilter-compatible-mode enabled when cilium agent start up, and no netfilter-compatible-mode in cilium agent log, but I see the netfilter-compatible-mode in the cilium-test.log from https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1289/testReport/junit/Suite-k8s-1/16/K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_vxlan_Test_NodePort_with_netfilterCompatMode_true/attachments/1ef46005_K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_vxlan_Test_NodePort_with_netfilterCompatMode%3Dtrue.zip

2021-08-18T02:49:09.787695617Z level=info msg="  --netfilter-compatible-mode='true'" subsys=daemon

I even grep search cilium source code, I could not find anything with the grep search word "netfilterCompatMode". I have no connection issue in my local dev machine with cross node nodeport service access from client pod with hostnetwork set to true in host namespace, I wonder if k8s-1.19-kernel-5.4 (test-1.19-5.4) success test runs the --netfilter-compatible-mode='true'" test, do you have clue?

@vincentmli
Copy link
Contributor Author

I even grep search cilium source code, I could not find anything with the grep search word "netfilterCompatMode". I have no

My dev machine has cilium source not synced with upstream master branch and missing the "netfilterCompatMode" flag, my bad :(, I will sync my dev machine with upstream master and re-test locally.

@vincentmli
Copy link
Contributor Author

I think I could reproduce nodeport connection issue with my PR in my local dev machine, so there is functional impact with this PR, not sure why, will dig into it

with upstream cilium v1.10.3 image and nodeport service:

# kubectl get svc -o wide
NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE     SELECTOR

nginxservice                        NodePort    10.107.115.135   <none>        80:32506/TCP   3m56s   app=nginx

# kubectl get no -o wide
NAME            STATUS   ROLES                  AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION         CONTAINER-RUNTIME
cilium-demo-1   Ready    control-plane,master   59d   v1.20.5   10.169.72.9    <none>        Ubuntu 20.04.2 LTS   5.8.1-050801-generic   docker://20.10.7
cilium-demo-2   Ready    <none>                 20d   v1.20.5   10.169.72.17   <none>        Ubuntu 20.04.2 LTS   5.8.1-050801-generic   docker://20.10.7
cilium-demo-3   Ready    <none>                 59d   v1.20.5   10.169.72.19   <none>        Ubuntu 20.04.2 LTS   5.8.1-050801-generic   docker://20.10.7

# kubectl get po -o wide
NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES

nginx-z8m75                                     1/1     Running   0          3m47s   10.0.2.187     cilium-demo-2   <none>           <none>

connection to the node that host the nginx pod failed, strange:

[root@bigip-cilium-2:Active:Disconnected] config # curl -v http://10.169.72.17:32506/
*   Trying 10.169.72.17...
^C

connection to another two node that does not host the nginx pod succeeded !

[root@bigip-cilium-2:Active:Disconnected] config # curl -v http://10.169.72.9:32506/
*   Trying 10.169.72.9...
* Connected to 10.169.72.9 (10.169.72.9) port 32506 (#0)
> GET / HTTP/1.1
> Host: 10.169.72.9:32506
> User-Agent: curl/7.47.1
> Accept: */*
> 
< HTTP/1.1 200 OK

[root@bigip-cilium-2:Active:Disconnected] config # curl -v http://10.169.72.9:32506/
*   Trying 10.169.72.9...
* Connected to 10.169.72.9 (10.169.72.9) port 32506 (#0)
> GET / HTTP/1.1
> Host: 10.169.72.9:32506
> User-Agent: curl/7.47.1
> Accept: */*
> 
< HTTP/1.1 200 OK

with this PR image, test to all node nodeport failed

[root@bigip-cilium-2:Active:Disconnected] config # curl -v http://10.169.72.17:32506/
*   Trying 10.169.72.17...
* connect to 10.169.72.17 port 32506 failed: Connection refused
* Failed to connect to 10.169.72.17 port 32506: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 10.169.72.17 port 32506: Connection refused
[root@bigip-cilium-2:Active:Disconnected] config # curl -v http://10.169.72.19:32506/
*   Trying 10.169.72.19...
* connect to 10.169.72.19 port 32506 failed: Connection refused
* Failed to connect to 10.169.72.19 port 32506: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 10.169.72.19 port 32506: Connection refused
[root@bigip-cilium-2:Active:Disconnected] config # curl -v http://10.169.72.9:32506/
*   Trying 10.169.72.9...
* connect to 10.169.72.9 port 32506 failed: Connection refused
* Failed to connect to 10.169.72.9 port 32506: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 10.169.72.9 port 32506: Connection refused

@vincentmli
Copy link
Contributor Author

@joestringer it appears cilium upstream master branch broke the nodeport network connectivity from external client, I built an image from cilium upstream master branch without any changes, the nodeport service connectivity from external client failed like in this PR, since this PR is made based on the cilium upstream master branch, it of course failed the same way, should I file bug report for the cilium upstream master branch code?

@vincentmli
Copy link
Contributor Author

@joestringer I rebased this PR on the v1.10.3 release code and rebuilt the image in lab, I do not have nodeport connection issue from external client, I assume test k8s-1.16-kernel-netnext (test-1.16-netnext) is pulling the cilium upstream master branch code with this PR and built the image to test, thus the nodeport connection failed?

@joestringer
Copy link
Member

it appears cilium upstream master branch broke the nodeport network connectivity from external client, I built an image from cilium upstream master branch without any changes, the nodeport service connectivity from external client failed like in this PR, since this PR is made based on the cilium upstream master branch, it of course failed the same way, should I file bug report for the cilium upstream master branch code?

@vincentmli thanks for investigating. Definitely if you can reproduce this with master but 1.10.3 seems to work fine then it sounds like it's a regression, so it could help to file a dedicated issue to track & follow up on that.

If you're open to pursue that further, then you could even do a git bisect to pinpoint when the tree broke. For every commit on master, there should be a corresponding docker image quay.io/cilium/cilium-ci:. It should be possible to start bisecting from master back to some known good commit and test which commits are good vs. bad. Presumably the 'known good' commit would be where 'master' and 'v1.10' branches diverge. The test itself for git bisect to use could pull the specific docker image for that commit from Quay, replace the image, wait for cilium pods to be ready, run the check, and return whether it was successful or not.

@joestringer
Copy link
Member

Regarding the netfilterCompatible mode, that is a new change recently introduced; adjusting your grep to git grep -i netfilter.*compat.* should help to find it. I know there is a filed CI issue #17060 already so it would be good to try to understand if it's the same error as that one, in which case it's clearly unrelated with this PR.

@joestringer
Copy link
Member

I see also K8sPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath failing, there there's a port allocation issue which sounds familiar, I think there may be an issue already filed that is similar, maybe worth checking for that provided port is already allocated message in GH issues.

@vincentmli
Copy link
Contributor Author

I see also K8sPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath failing, there there's a port allocation issue which sounds familiar, I think there may be an issue already filed that is similar, maybe worth checking for that provided port is already allocated message in GH issues.

Yes, similar #13071

@vincentmli
Copy link
Contributor Author

it appears cilium upstream master branch broke the nodeport network connectivity from external client, I built an image from cilium upstream master branch without any changes, the nodeport service connectivity from external client failed like in this PR, since this PR is made based on the cilium upstream master branch, it of course failed the same way, should I file bug report for the cilium upstream master branch code?

@vincentmli thanks for investigating. Definitely if you can reproduce this with master but 1.10.3 seems to work fine then it sounds like it's a regression, so it could help to file a dedicated issue to track & follow up on that.

@joestringer I filed #17192 to track the issue, the cilium_ipcache map is not synced between nodes when new pod is created, result in cilium_ipcache lookup miss and no encapulation happened between nodes, then hitting FIB lookup failure following the code logic flow

If you're open to pursue that further, then you could even do a git bisect to pinpoint when the tree broke. For every commit on master, there should be a corresponding docker image quay.io/cilium/cilium-ci:. It should be possible to start bisecting from master back to some known good commit and test which commits are good vs. bad. Presumably the 'known good' commit would be where 'master' and 'v1.10' branches diverge. The test itself for git bisect to use could pull the specific docker image for that commit from Quay, replace the image, wait for cilium pods to be ready, run the check, and return whether it was successful or not.

I think the change might be around ipcache and identity area, I will try bisect when I got time :)

@vincentmli
Copy link
Contributor Author

@joestringer
git bisect turned out to be quick enough to find the commit for #17192, I git bisect started with term fixed v1.10.3 and unfixed master, then build docker image in my lab quick enough to test the image, about 8 steps to find the commit

@vincentmli vincentmli marked this pull request as ready for review August 23, 2021 15:31
@vincentmli vincentmli requested review from a team and kkourt August 23, 2021 15:31
@vincentmli
Copy link
Contributor Author

@kkourt @joestringer
Since the two failed test seems not related to this PR so I think this PR is ready for review, do you have any comment on this PR? thanks!

Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I added some comments that relate to avoid duplicating code.

The intention of the PR makes sense to me assuming we would want to add information to the value parts in the future.

I'm somewhat confused as to why we used the same structure for the key and the value in the first place. I know this is not something introduced in this PR, but it might be a good opportunity to discuss and improve or at least document.

For example, the ->key part in endpoint_key does not seem be used anywhere.

Using the following patch:

$ git diff
diff --git a/bpf/lib/common.h b/bpf/lib/common.h
index 91471e5fcf..293dac7a3e 100644
--- a/bpf/lib/common.h
+++ b/bpf/lib/common.h
@@ -227,8 +227,8 @@ struct endpoint_key {
                union v6addr    ip6;
        };
        __u8 family;
-       __u8 key;
        __u16 pad5;
+       __u8 pad6;
 } __packed;

And make build_all in bpf produces no failures.

@@ -17,6 +19,9 @@ const (

// MaxEntries is the maximum entries in the tunnel endpoint map
MaxEntries = 65536

TunnelEndpointIPv4 uint8 = 1
TunnelEndpointIPv6 uint8 = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an expectation that this values should match:

// Must be in sync with ENDPOINT_KEY_* in <bpf/lib/common.h>
const (
EndpointKeyIPv4 uint8 = 1
EndpointKeyIPv6 uint8 = 2
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key in endpoint_key originally was used for IPSEC encryption

https://github.com/cilium/cilium/blob/master/bpf/lib/encap.h#L189
https://github.com/cilium/cilium/blob/master/bpf/lib/encap.h#L215

                                                                                 
        tunnel = map_lookup_elem(&TUNNEL_MAP, key);                              
        if (!tunnel)                                                             
                return DROP_NO_TUNNEL_ENDPOINT;                                  
                                                                                 
#ifdef ENABLE_IPSEC                                                              
        if (tunnel->key) {                                                       
                __u8 min_encrypt_key = get_min_encrypt_key(tunnel->key);         
                                                                                 
                return encap_and_redirect_ipsec(ctx, tunnel->ip4,                
                                                min_encrypt_key,                 
                                                seclabel);                       
        }                                                                        
#endif                                                                     

since now we split the tunnel map key and value structure, we can remove the key from endpoint_key structure and keep the key in value tunnel_endpoint_info structure, what do you think?

Yes, I can remove TunnelEndpointIPv4/TunnelEndpointIPv6 and just use EndpointKeyIPv4/EndpointKeyIPv6, they are same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I guess your original suggestion is to remove key from endpoint_key since it is not used anymore due to this PR, yes i will remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I guess your original suggestion is to remove key from endpoint_key since it is not used anymore due to this PR, yes i will remove it.

I was just curious. Not sure if we should remove it, but I think we should at least try to understand the code because now it is not clear to me what the purpose of ->key is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I guess your original suggestion is to remove key from endpoint_key since it is not used anymore due to this PR, yes i will remove it.

I was just curious. Not sure if we should remove it, but I think we should at least try to understand the code because now it is not clear to me what the purpose of ->key is.

oh, I see. ->key is used for IPSEC encryption, see
https://github.com/cilium/cilium/blob/master/bpf/lib/encap.h#L189-L223
we can remove the ->key from endpoint_key and keep it in tunnel_endpoint_info since ->key is required for IPSEC, agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to find all of the locations where this structure is used as a lookup key into the map, and those locations should tell us which fields are being populated in order to look up the values into the map; anything that is not being set there can be removed from the key version of the structure. Everything else can remain in the value.

endpoint_key ->key is only used in endpoint_key String(), nowhere else used as part of map lookup

func (k EndpointKey) String() string {
	if ip := k.ToIP(); ip != nil {
		return net.JoinHostPort(ip.String(), fmt.Sprintf("%d", k.Key))
	}
	return "nil"
}

I noticed above k.Key usage because I had compiling error about it after removing the ->key from endpoint_key, I would have compiling errors for other places if the ->key is used. and above return net.JoinHostPort(ip.String(), fmt.Sprintf("%d", k.Key)) is kind of strange to me, net.JoinHostPort() is suppose to join host and port, use k.Key as port does not make much sense to me, but I don't know the history of this code and why it is done that way. I actually tried to change k.Key to key.Family after removing ->key and tested in lab, seems no issue, so I have two options

1, remove the ->key and replace above k.Key with k.Family
2, keep ->key in endpoint_key

I prefer 2) unless someone know the history of above code and why use k.Key as port as part of the string generation, maybe k.Key has no any meaning here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found following to commits:

commit 070214f089e78bf047ce671bd15d762174654ab6
Author: Tobias Klauser <tklauser@distanz.ch>
Date:   Wed Aug 26 17:25:02 2020 +0200

    Use net.JoinHostPort to construct network address strings

and

commit b6989723a7cc0ebc07345d78d28b63bf3825f7ba
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Tue Mar 12 10:18:40 2019 -0700

    cilium: ipsec, support rolling updates
    

@jrfastab @tobiaskohlbau
is it safe to replace k.Key with k.Family here, we are considering removing the ->key from endpoint_key structure since we are trying to separate tunnel map with key and value structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely this was a false ping and you would like to ping @tklauser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrfastab I went through the code change in b6989723a7, I believe it is safe to remove the Key from endpoint_key struct after this tunnel map key and value structure separation, @tklauser I also believe it is safe to replace k.Key with key.Family after reading 070214f0, please let me know otherwise. @joestringer @kkourt , I will send a diff with Key removed from endpoint_key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joestringer @kkourt I updated the PR with your comment addressed, Please let me know if I am miss anything, if it looks ok, Joe, could you trigger the CI again to test?

pkg/maps/tunnel/tunnel.go Show resolved Hide resolved
vincentmli added a commit to f5devcentral/cilium that referenced this pull request Oct 25, 2021
In enterprise on-prem or hybrid cloud datacenter, traditional
load balancer devices like BIG-IP are used for north-south
load balancing to Kubernetes cluster POD either through routing
mode or VXLAN tunnel mode. These VXLAN devices usually has VNI
key based implementation https://datatracker.ietf.org/doc/html/rfc7348.
Cilium VXLAN implementation does not use VNI key and VTEP
IP/MAC mapping to direct tunnel traffic, This feature enables
Cilium VXLAN Tunnel Endpoint (VTEP) Integration. Add Cilium agent
option EnableVTEP to enable this feature and it is disabled by
default. This feature support Cilium tunnel and route mode

1 In VXLAN tunnel mode, the egress packets from Cilium-managed pod
before encapsulation use host namespace side MAC address as
the destination MAC address, when egress packet arrive at
VTEP device, the MAC address does not match the VTEP
MAC address of the VTEP device and the packet is
dropped. Thus we need to rewrite the inner packet destination
MAC address to remote VTEP MAC address.

2 Cilium VXLAN use pod identity as the VXLAN tunnel key that
does not match pre-configured VTEP device VNI key

This patch addresses above two points.
In cilium#17106 discussion, We decided to pre-populate IPCache map with
VTEP devices CIDR, VNI, MAC, IP. When packets egress to VTEP devices,
use the pre-populated VTEP device entry in IPCache map to
encapsulate the packet.

One issue observed when using eth_store_daddr() from
bpf/lib/eth.h to re-write the inner packet destination MAC
address,it failed to pass BPF verifier check with
“R1 invalid mem access ‘inv’”, see full detail in issue
kernel, both issues are resolved by initializing vtep_mac to 0

Example to enable this feature in configmap cilium-config:

enable-vtep: "true"
vtep-endpoint: "10.169.72.14    10.169.72.15"
vtep-cidr: "10.1.99.0/24        10.1.88.0/24"
vtep-mac: "52:54:00:3e:3f:c1   52:54:00:4e:01:a6"

VTEP devices must use cilium reserved world id "2" as VNI

Suggested-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Vincent Li <v.li@f5.com>
vincentmli added a commit to f5devcentral/cilium that referenced this pull request Feb 7, 2022
In enterprise on-prem or hybrid cloud datacenter, traditional
load balancer devices like BIG-IP are used for north-south
load balancing to Kubernetes cluster POD either through routing
mode or VXLAN tunnel mode. These VXLAN devices usually has VNI
key based implementation https://datatracker.ietf.org/doc/html/rfc7348.
Cilium VXLAN implementation does not use VNI key and VTEP
IP/MAC mapping to direct tunnel traffic, This feature enables
Cilium VXLAN Tunnel Endpoint (VTEP) Integration. Add Cilium agent
option EnableVTEP to enable this feature and it is disabled by
default. This feature support Cilium tunnel and route mode

1 In VXLAN tunnel mode, the egress packets from Cilium-managed pod
before encapsulation use host namespace side MAC address as
the destination MAC address, when egress packet arrive at
VTEP device, the MAC address does not match the VTEP
MAC address of the VTEP device and the packet is
dropped. Thus we need to rewrite the inner packet destination
MAC address to remote VTEP MAC address.

2 Cilium VXLAN use pod identity as the VXLAN tunnel key that
does not match pre-configured VTEP device VNI key

This patch addresses above two points.
In cilium#17106 discussion, We decided to pre-populate IPCache map with
VTEP devices CIDR, VNI, MAC, IP. When packets egress to VTEP devices,
use the pre-populated VTEP device entry in IPCache map to
encapsulate the packet.

One issue observed when using eth_store_daddr() from
bpf/lib/eth.h to re-write the inner packet destination MAC
address,it failed to pass BPF verifier check with
“R1 invalid mem access ‘inv’”, see full detail in issue
kernel, both issues are resolved by initializing vtep_mac to 0

Example to enable this feature in configmap cilium-config:

enable-vtep: "true"
vtep-endpoint: "10.169.72.14    10.169.72.15"
vtep-cidr: "10.1.99.0/24        10.1.88.0/24"
vtep-mac: "52:54:00:3e:3f:c1   52:54:00:4e:01:a6"

VTEP devices must use cilium reserved world id "2" as VNI

Suggested-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Vincent Li <v.li@f5.com>
joamaki pushed a commit that referenced this pull request Feb 8, 2022
In enterprise on-prem or hybrid cloud datacenter, traditional
load balancer devices like BIG-IP are used for north-south
load balancing to Kubernetes cluster POD either through routing
mode or VXLAN tunnel mode. These VXLAN devices usually has VNI
key based implementation https://datatracker.ietf.org/doc/html/rfc7348.
Cilium VXLAN implementation does not use VNI key and VTEP
IP/MAC mapping to direct tunnel traffic, This feature enables
Cilium VXLAN Tunnel Endpoint (VTEP) Integration. Add Cilium agent
option EnableVTEP to enable this feature and it is disabled by
default. This feature support Cilium tunnel and route mode

1 In VXLAN tunnel mode, the egress packets from Cilium-managed pod
before encapsulation use host namespace side MAC address as
the destination MAC address, when egress packet arrive at
VTEP device, the MAC address does not match the VTEP
MAC address of the VTEP device and the packet is
dropped. Thus we need to rewrite the inner packet destination
MAC address to remote VTEP MAC address.

2 Cilium VXLAN use pod identity as the VXLAN tunnel key that
does not match pre-configured VTEP device VNI key

This patch addresses above two points.
In #17106 discussion, We decided to pre-populate IPCache map with
VTEP devices CIDR, VNI, MAC, IP. When packets egress to VTEP devices,
use the pre-populated VTEP device entry in IPCache map to
encapsulate the packet.

One issue observed when using eth_store_daddr() from
bpf/lib/eth.h to re-write the inner packet destination MAC
address,it failed to pass BPF verifier check with
“R1 invalid mem access ‘inv’”, see full detail in issue
kernel, both issues are resolved by initializing vtep_mac to 0

Example to enable this feature in configmap cilium-config:

enable-vtep: "true"
vtep-endpoint: "10.169.72.14    10.169.72.15"
vtep-cidr: "10.1.99.0/24        10.1.88.0/24"
vtep-mac: "52:54:00:3e:3f:c1   52:54:00:4e:01:a6"

VTEP devices must use cilium reserved world id "2" as VNI

Suggested-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Vincent Li <v.li@f5.com>
@vincentmli vincentmli closed this Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants