Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: external workloads workflow consistently fails in "Verify DNS on VM" step #2070

Closed
tklauser opened this issue Oct 27, 2023 · 4 comments · Fixed by #2079
Closed

CI: external workloads workflow consistently fails in "Verify DNS on VM" step #2070

tklauser opened this issue Oct 27, 2023 · 4 comments · Fixed by #2079
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake Issues tracking failing (integration or unit) tests.

Comments

@tklauser
Copy link
Member

The external workloads workflow seems to be failing consistently:

https://github.com/cilium/cilium-cli/actions/runs/6622170504

It looks like nslookup is timing out trying to reach the DNS server:

;; connection timed out; no servers could be reached

This seems to have started around 2023-10-21.

@tklauser tklauser added area/CI Continuous Integration testing issue or flake ci/flake Issues tracking failing (integration or unit) tests. labels Oct 27, 2023
@nbusseneau
Copy link
Member

External workfloads on main Cilium repo is working, perhaps something was changed in there that should have been changed on CLI as well but did not?

michi-covalent added a commit that referenced this issue Nov 1, 2023
This reverts commit 0245001.

It looks like the external workloads test started failing after
upgrading Cilium to v1.14.3. Let's revert while we investigate. It's
unclear to me why #2057 didn't fail though.

Ref: #2070

Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
@michi-covalent
Copy link
Contributor

this might be relevant. from the sysdump in https://github.com/cilium/cilium-cli/actions/runs/6725580157 cilium-sysdump-out.zip.zip

% grep 'level=error' ./logs-clustermesh-apiserver-694765f445-9z5d6-apiserver-20231101-221630.log
2023-11-01T22:13:59.507218421Z level=error msg="CEW: Invalid identity 1 in &{{cilium-cilium-cli-6725580157-classic-vm1 cilium-cilium-cli-6725580157-classic-vm1 [{InternalIP 10.168.0.5} {InternalIP fc00::10ca:1} {CiliumInternalIP 10.192.1.2} {CiliumInternalIP f00d::a05:0:0:9f6f}] 10.192.1.0/30 [] f00d::a05:0:0:0/96 [] <nil> <nil> <nil> <nil> 0 local 0 map[io.cilium.k8s.policy.cluster:cilium-cilium-cli-6725580157-classic-vm1 io.kubernetes.pod.name:cilium-cilium-cli-6725580157-classic-vm1 io.kubernetes.pod.namespace:default] map[] 1 }}" subsys=clustermesh-apiserver

tklauser pushed a commit that referenced this issue Nov 2, 2023
This reverts commit 0245001.

It looks like the external workloads test started failing after
upgrading Cilium to v1.14.3. Let's revert while we investigate. It's
unclear to me why #2057 didn't fail though.

Ref: #2070

Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
@giorio94
Copy link
Member

giorio94 commented Nov 2, 2023

this might be relevant. from the sysdump in https://github.com/cilium/cilium-cli/actions/runs/6725580157 cilium-sysdump-out.zip.zip

% grep 'level=error' ./logs-clustermesh-apiserver-694765f445-9z5d6-apiserver-20231101-221630.log
2023-11-01T22:13:59.507218421Z level=error msg="CEW: Invalid identity 1 in &{{cilium-cilium-cli-6725580157-classic-vm1 cilium-cilium-cli-6725580157-classic-vm1 [{InternalIP 10.168.0.5} {InternalIP fc00::10ca:1} {CiliumInternalIP 10.192.1.2} {CiliumInternalIP f00d::a05:0:0:9f6f}] 10.192.1.0/30 [] f00d::a05:0:0:0/96 [] <nil> <nil> <nil> <nil> 0 local 0 map[io.cilium.k8s.policy.cluster:cilium-cilium-cli-6725580157-classic-vm1 io.kubernetes.pod.name:cilium-cilium-cli-6725580157-classic-vm1 io.kubernetes.pod.namespace:default] map[] 1 }}" subsys=clustermesh-apiserver

I did a few tests, and this error message is reported also when using earlier versions (both v1.13.x and v1.14.2), so I'd say it should not be strictly related. That said, the cause is [1] (which overrides the identity created by the clustermesh-apiserver with the default one of nodes), but I'm not 100% sure if it is a real bug or not (I guess it might impact policy enforcement towards external workloads).

But I'm still confused why we are seeing this error only in the cilium/cilium-cli workflows. I've tried comparing the agent parameters with those on cilium/cilium, and there seems to be no difference. I've also tried reproducing this locally, without any luck.

[1]: https://github.com/cilium/cilium/blob/4f08f6cc7029789cd6aae9f6d4e70ec9613c8442/pkg/nodediscovery/nodediscovery.go#L195-L218

@giorio94
Copy link
Member

giorio94 commented Nov 2, 2023

Had a fresh look again, and this looks incredibly suspicious.

v1.14.2:

$ grep routing-mode cilium-configmap-* 
routing-mode: tunnel

v1.14.3:

$ grep routing-mode cilium-configmap-* 
routing-mode: native

The behavioral change has been introduced in cilium/cilium@381e4ec15334.

$ helm template cilium install/kubernetes/cilium -n kube-system -s templates/cilium-configmap.yaml --set gke.enabled=true --set routingMode=tunnel | grep routing-mode

# Before:
  routing-mode: "native"
# After:
  routing-mode: "native"
  routing-mode: "tunnel"

Still unclear to me why it is not happening in cilium/cilium though.

Edit: this is the likely reason why we didn't hit it in cilium/cilium:

CILIUM_INSTALL_DEFAULTS="--cluster-name=${{ env.clusterName }} \
            --datapath-mode=tunnel \ <----
            --chart-directory=install/kubernetes/cilium \
            --helm-set=image.repository=quay.io/${{ env.QUAY_ORGANIZATION_DEV }}/cilium-ci \
            ...

michi-covalent added a commit that referenced this issue Nov 18, 2023
Ref: #2070

Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
michi-covalent added a commit that referenced this issue Nov 18, 2023
Ref: #2070

Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
michi-covalent pushed a commit that referenced this issue Nov 18, 2023
cilium/cilium#27841 changed how the routing mode gets set for GKE, and
now it always gets set to "native". Use --datapath-mode flag to force
the tunnel mode for the external workload test since that's the only
configuration that's known to work [^1].

Fixes: #2070

[^1]: https://docs.cilium.io/en/latest/network/external-workloads/

Signed-off-by: renovate[bot] <bot@renovateapp.com>
Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
@michi-covalent michi-covalent self-assigned this Nov 19, 2023
michi-covalent pushed a commit that referenced this issue Nov 20, 2023
cilium/cilium#27841 changed how the routing mode gets set for GKE, and
now it always gets set to "native". Use --datapath-mode flag to force
the tunnel mode for the external workload test since that's the only
configuration that's known to work [^1].

Fixes: #2070

[^1]: https://docs.cilium.io/en/latest/network/external-workloads/

Signed-off-by: renovate[bot] <bot@renovateapp.com>
Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake Issues tracking failing (integration or unit) tests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants