-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate tunnel map key and value structure #17106
Conversation
428040f
to
c249597
Compare
test-me-please |
Hi Joe, do you have any comment on this PR? thanks |
@vincentmli I didn't look at the details yet, but I've triggered the CI for you. If the tests are passing then I think it would make sense to click the "ready for review" button at the bottom of the page, and then GitHub will assign reviewers to take a look at the PR. |
@vincentmli Also, please take a look at the checkboxes and the release note section in the PR description - this is for you to fill out when you submit the PR. |
@joestringer thanks, I read the release note section, where or how do I add the release note label when I create PR? I am new to this :) |
@vincentmli Either replace the text " |
@joestringer ci-eks and ci-gke conformance test failed, seems not related to my PR based on the error, I see other PR seems also having same error. also the k8s-kernel test seems has not started |
test-me-please |
@joestringer k8s-1.16-kernel-netnext (test-1.16-netnext) test failed, I looked at the detail here https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1289/testReport/junit/Suite-k8s-1/16/K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_vxlan_Test_NodePort_with_netfilterCompatMode_true/, it appears the test is for testing NodePort service across node with vxlan and netfilterCompatMode set to true, I tried to do similar test locally on my dev machine and tried to use netfilter-compatible-mode: "true" in cilium-config Configmap, or use helm helm template cilium cilium/cilium --version 1.10.3 --namespace kube-system --set kubeProxyReplacement=strict --set netfilterCompatMode=true > cilium-netfilter.yaml, what's odd is that I can't see netfilter-compatible-mode enabled when cilium agent start up, and no netfilter-compatible-mode in cilium agent log, but I see the netfilter-compatible-mode in the cilium-test.log from https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1289/testReport/junit/Suite-k8s-1/16/K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_vxlan_Test_NodePort_with_netfilterCompatMode_true/attachments/1ef46005_K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_vxlan_Test_NodePort_with_netfilterCompatMode%3Dtrue.zip
I even grep search cilium source code, I could not find anything with the grep search word "netfilterCompatMode". I have no connection issue in my local dev machine with cross node nodeport service access from client pod with hostnetwork set to true in host namespace, I wonder if k8s-1.19-kernel-5.4 (test-1.19-5.4) success test runs the --netfilter-compatible-mode='true'" test, do you have clue? |
My dev machine has cilium source not synced with upstream master branch and missing the "netfilterCompatMode" flag, my bad :(, I will sync my dev machine with upstream master and re-test locally. |
I think I could reproduce nodeport connection issue with my PR in my local dev machine, so there is functional impact with this PR, not sure why, will dig into it with upstream cilium v1.10.3 image and nodeport service:
connection to the node that host the nginx pod failed, strange:
connection to another two node that does not host the nginx pod succeeded !
with this PR image, test to all node nodeport failed
|
@joestringer it appears cilium upstream master branch broke the nodeport network connectivity from external client, I built an image from cilium upstream master branch without any changes, the nodeport service connectivity from external client failed like in this PR, since this PR is made based on the cilium upstream master branch, it of course failed the same way, should I file bug report for the cilium upstream master branch code? |
@joestringer I rebased this PR on the v1.10.3 release code and rebuilt the image in lab, I do not have nodeport connection issue from external client, I assume test k8s-1.16-kernel-netnext (test-1.16-netnext) is pulling the cilium upstream master branch code with this PR and built the image to test, thus the nodeport connection failed? |
@vincentmli thanks for investigating. Definitely if you can reproduce this with master but 1.10.3 seems to work fine then it sounds like it's a regression, so it could help to file a dedicated issue to track & follow up on that. If you're open to pursue that further, then you could even do a git bisect to pinpoint when the tree broke. For every commit on master, there should be a corresponding docker image quay.io/cilium/cilium-ci:. It should be possible to start bisecting from master back to some known good commit and test which commits are good vs. bad. Presumably the 'known good' commit would be where 'master' and 'v1.10' branches diverge. The test itself for git bisect to use could pull the specific docker image for that commit from Quay, replace the image, wait for cilium pods to be ready, run the check, and return whether it was successful or not. |
Regarding the netfilterCompatible mode, that is a new change recently introduced; adjusting your grep to |
I see also K8sPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath failing, there there's a port allocation issue which sounds familiar, I think there may be an issue already filed that is similar, maybe worth checking for that |
Yes, similar #13071 |
@joestringer I filed #17192 to track the issue, the cilium_ipcache map is not synced between nodes when new pod is created, result in cilium_ipcache lookup miss and no encapulation happened between nodes, then hitting FIB lookup failure following the code logic flow
I think the change might be around ipcache and identity area, I will try bisect when I got time :) |
@joestringer |
@kkourt @joestringer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I added some comments that relate to avoid duplicating code.
The intention of the PR makes sense to me assuming we would want to add information to the value parts in the future.
I'm somewhat confused as to why we used the same structure for the key and the value in the first place. I know this is not something introduced in this PR, but it might be a good opportunity to discuss and improve or at least document.
For example, the ->key
part in endpoint_key
does not seem be used anywhere.
Using the following patch:
$ git diff
diff --git a/bpf/lib/common.h b/bpf/lib/common.h
index 91471e5fcf..293dac7a3e 100644
--- a/bpf/lib/common.h
+++ b/bpf/lib/common.h
@@ -227,8 +227,8 @@ struct endpoint_key {
union v6addr ip6;
};
__u8 family;
- __u8 key;
__u16 pad5;
+ __u8 pad6;
} __packed;
And make build_all
in bpf
produces no failures.
pkg/maps/tunnel/tunnel.go
Outdated
@@ -17,6 +19,9 @@ const ( | |||
|
|||
// MaxEntries is the maximum entries in the tunnel endpoint map | |||
MaxEntries = 65536 | |||
|
|||
TunnelEndpointIPv4 uint8 = 1 | |||
TunnelEndpointIPv6 uint8 = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an expectation that this values should match:
Lines 14 to 18 in 18513db
// Must be in sync with ENDPOINT_KEY_* in <bpf/lib/common.h> | |
const ( | |
EndpointKeyIPv4 uint8 = 1 | |
EndpointKeyIPv6 uint8 = 2 | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the key in endpoint_key originally was used for IPSEC encryption
https://github.com/cilium/cilium/blob/master/bpf/lib/encap.h#L189
https://github.com/cilium/cilium/blob/master/bpf/lib/encap.h#L215
tunnel = map_lookup_elem(&TUNNEL_MAP, key);
if (!tunnel)
return DROP_NO_TUNNEL_ENDPOINT;
#ifdef ENABLE_IPSEC
if (tunnel->key) {
__u8 min_encrypt_key = get_min_encrypt_key(tunnel->key);
return encap_and_redirect_ipsec(ctx, tunnel->ip4,
min_encrypt_key,
seclabel);
}
#endif
since now we split the tunnel map key and value structure, we can remove the key from endpoint_key structure and keep the key in value tunnel_endpoint_info structure, what do you think?
Yes, I can remove TunnelEndpointIPv4/TunnelEndpointIPv6 and just use EndpointKeyIPv4/EndpointKeyIPv6, they are same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I guess your original suggestion is to remove key from endpoint_key since it is not used anymore due to this PR, yes i will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I guess your original suggestion is to remove key from endpoint_key since it is not used anymore due to this PR, yes i will remove it.
I was just curious. Not sure if we should remove it, but I think we should at least try to understand the code because now it is not clear to me what the purpose of ->key
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I guess your original suggestion is to remove key from endpoint_key since it is not used anymore due to this PR, yes i will remove it.
I was just curious. Not sure if we should remove it, but I think we should at least try to understand the code because now it is not clear to me what the purpose of
->key
is.
oh, I see. ->key
is used for IPSEC encryption, see
https://github.com/cilium/cilium/blob/master/bpf/lib/encap.h#L189-L223
we can remove the ->key
from endpoint_key and keep it in tunnel_endpoint_info since ->key
is required for IPSEC, agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to find all of the locations where this structure is used as a lookup key into the map, and those locations should tell us which fields are being populated in order to look up the values into the map; anything that is not being set there can be removed from the key version of the structure. Everything else can remain in the value.
endpoint_key ->key
is only used in endpoint_key String(), nowhere else used as part of map lookup
func (k EndpointKey) String() string {
if ip := k.ToIP(); ip != nil {
return net.JoinHostPort(ip.String(), fmt.Sprintf("%d", k.Key))
}
return "nil"
}
I noticed above k.Key
usage because I had compiling error about it after removing the ->key
from endpoint_key, I would have compiling errors for other places if the ->key
is used. and above return net.JoinHostPort(ip.String(), fmt.Sprintf("%d", k.Key))
is kind of strange to me, net.JoinHostPort()
is suppose to join host
and port
, use k.Key
as port
does not make much sense to me, but I don't know the history of this code and why it is done that way. I actually tried to change k.Key
to key.Family
after removing ->key
and tested in lab, seems no issue, so I have two options
1, remove the ->key
and replace above k.Key
with k.Family
2, keep ->key
in endpoint_key
I prefer 2) unless someone know the history of above code and why use k.Key
as port
as part of the string generation, maybe k.Key
has no any meaning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found following to commits:
commit 070214f089e78bf047ce671bd15d762174654ab6
Author: Tobias Klauser <tklauser@distanz.ch>
Date: Wed Aug 26 17:25:02 2020 +0200
Use net.JoinHostPort to construct network address strings
and
commit b6989723a7cc0ebc07345d78d28b63bf3825f7ba
Author: John Fastabend <john.fastabend@gmail.com>
Date: Tue Mar 12 10:18:40 2019 -0700
cilium: ipsec, support rolling updates
@jrfastab @tobiaskohlbau
is it safe to replace k.Key
with k.Family
here, we are considering removing the ->key
from endpoint_key structure since we are trying to separate tunnel map with key and value structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely this was a false ping and you would like to ping @tklauser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrfastab I went through the code change in b6989723a7
, I believe it is safe to remove the Key
from endpoint_key struct after this tunnel map key and value structure separation, @tklauser I also believe it is safe to replace k.Key
with key.Family
after reading 070214f0
, please let me know otherwise. @joestringer @kkourt , I will send a diff with Key
removed from endpoint_key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joestringer @kkourt I updated the PR with your comment addressed, Please let me know if I am miss anything, if it looks ok, Joe, could you trigger the CI again to test?
In enterprise on-prem or hybrid cloud datacenter, traditional load balancer devices like BIG-IP are used for north-south load balancing to Kubernetes cluster POD either through routing mode or VXLAN tunnel mode. These VXLAN devices usually has VNI key based implementation https://datatracker.ietf.org/doc/html/rfc7348. Cilium VXLAN implementation does not use VNI key and VTEP IP/MAC mapping to direct tunnel traffic, This feature enables Cilium VXLAN Tunnel Endpoint (VTEP) Integration. Add Cilium agent option EnableVTEP to enable this feature and it is disabled by default. This feature support Cilium tunnel and route mode 1 In VXLAN tunnel mode, the egress packets from Cilium-managed pod before encapsulation use host namespace side MAC address as the destination MAC address, when egress packet arrive at VTEP device, the MAC address does not match the VTEP MAC address of the VTEP device and the packet is dropped. Thus we need to rewrite the inner packet destination MAC address to remote VTEP MAC address. 2 Cilium VXLAN use pod identity as the VXLAN tunnel key that does not match pre-configured VTEP device VNI key This patch addresses above two points. In cilium#17106 discussion, We decided to pre-populate IPCache map with VTEP devices CIDR, VNI, MAC, IP. When packets egress to VTEP devices, use the pre-populated VTEP device entry in IPCache map to encapsulate the packet. One issue observed when using eth_store_daddr() from bpf/lib/eth.h to re-write the inner packet destination MAC address,it failed to pass BPF verifier check with “R1 invalid mem access ‘inv’”, see full detail in issue kernel, both issues are resolved by initializing vtep_mac to 0 Example to enable this feature in configmap cilium-config: enable-vtep: "true" vtep-endpoint: "10.169.72.14 10.169.72.15" vtep-cidr: "10.1.99.0/24 10.1.88.0/24" vtep-mac: "52:54:00:3e:3f:c1 52:54:00:4e:01:a6" VTEP devices must use cilium reserved world id "2" as VNI Suggested-by: Joe Stringer <joe@cilium.io> Signed-off-by: Vincent Li <v.li@f5.com>
In enterprise on-prem or hybrid cloud datacenter, traditional load balancer devices like BIG-IP are used for north-south load balancing to Kubernetes cluster POD either through routing mode or VXLAN tunnel mode. These VXLAN devices usually has VNI key based implementation https://datatracker.ietf.org/doc/html/rfc7348. Cilium VXLAN implementation does not use VNI key and VTEP IP/MAC mapping to direct tunnel traffic, This feature enables Cilium VXLAN Tunnel Endpoint (VTEP) Integration. Add Cilium agent option EnableVTEP to enable this feature and it is disabled by default. This feature support Cilium tunnel and route mode 1 In VXLAN tunnel mode, the egress packets from Cilium-managed pod before encapsulation use host namespace side MAC address as the destination MAC address, when egress packet arrive at VTEP device, the MAC address does not match the VTEP MAC address of the VTEP device and the packet is dropped. Thus we need to rewrite the inner packet destination MAC address to remote VTEP MAC address. 2 Cilium VXLAN use pod identity as the VXLAN tunnel key that does not match pre-configured VTEP device VNI key This patch addresses above two points. In cilium#17106 discussion, We decided to pre-populate IPCache map with VTEP devices CIDR, VNI, MAC, IP. When packets egress to VTEP devices, use the pre-populated VTEP device entry in IPCache map to encapsulate the packet. One issue observed when using eth_store_daddr() from bpf/lib/eth.h to re-write the inner packet destination MAC address,it failed to pass BPF verifier check with “R1 invalid mem access ‘inv’”, see full detail in issue kernel, both issues are resolved by initializing vtep_mac to 0 Example to enable this feature in configmap cilium-config: enable-vtep: "true" vtep-endpoint: "10.169.72.14 10.169.72.15" vtep-cidr: "10.1.99.0/24 10.1.88.0/24" vtep-mac: "52:54:00:3e:3f:c1 52:54:00:4e:01:a6" VTEP devices must use cilium reserved world id "2" as VNI Suggested-by: Joe Stringer <joe@cilium.io> Signed-off-by: Vincent Li <v.li@f5.com>
In enterprise on-prem or hybrid cloud datacenter, traditional load balancer devices like BIG-IP are used for north-south load balancing to Kubernetes cluster POD either through routing mode or VXLAN tunnel mode. These VXLAN devices usually has VNI key based implementation https://datatracker.ietf.org/doc/html/rfc7348. Cilium VXLAN implementation does not use VNI key and VTEP IP/MAC mapping to direct tunnel traffic, This feature enables Cilium VXLAN Tunnel Endpoint (VTEP) Integration. Add Cilium agent option EnableVTEP to enable this feature and it is disabled by default. This feature support Cilium tunnel and route mode 1 In VXLAN tunnel mode, the egress packets from Cilium-managed pod before encapsulation use host namespace side MAC address as the destination MAC address, when egress packet arrive at VTEP device, the MAC address does not match the VTEP MAC address of the VTEP device and the packet is dropped. Thus we need to rewrite the inner packet destination MAC address to remote VTEP MAC address. 2 Cilium VXLAN use pod identity as the VXLAN tunnel key that does not match pre-configured VTEP device VNI key This patch addresses above two points. In #17106 discussion, We decided to pre-populate IPCache map with VTEP devices CIDR, VNI, MAC, IP. When packets egress to VTEP devices, use the pre-populated VTEP device entry in IPCache map to encapsulate the packet. One issue observed when using eth_store_daddr() from bpf/lib/eth.h to re-write the inner packet destination MAC address,it failed to pass BPF verifier check with “R1 invalid mem access ‘inv’”, see full detail in issue kernel, both issues are resolved by initializing vtep_mac to 0 Example to enable this feature in configmap cilium-config: enable-vtep: "true" vtep-endpoint: "10.169.72.14 10.169.72.15" vtep-cidr: "10.1.99.0/24 10.1.88.0/24" vtep-mac: "52:54:00:3e:3f:c1 52:54:00:4e:01:a6" VTEP devices must use cilium reserved world id "2" as VNI Suggested-by: Joe Stringer <joe@cilium.io> Signed-off-by: Vincent Li <v.li@f5.com>
Tunnel map use same endpoint_key as key and value structure. in future
we may add tunnel map value structure fields for new features like
extending cilium VXLAN tunnel connectivity to external VXLAN devices and
add value fields like VNI and VTEP MAC...etc, We also need to add cilium
bpf tunnel command for users to maintain the tunnel map, thus separate the
tunnel map value and key structure as first step toward that goal
Suggested-by: Joe Stringer joe@cilium.io
Signed-off-by: Vincent Li vincent.mc.li@gmail.com
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.