Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: K8sChaosTest Connectivity demo application Endpoint can still connect while Cilium is not running #13552

Closed
pchaigno opened this issue Oct 14, 2020 · 11 comments
Labels
area/CI Continuous Integration testing issue or flake area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. ci/flake This is a known failure that occurs in the tree. Please investigate me! stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects

Comments

@pchaigno
Copy link
Member

pchaigno commented Oct 14, 2020

Stacktrace

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Kernel/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
cilium pre-flight checks failed
Expected
    <*errors.errorString | 0xc000523410>: {
        s: "Cilium validation failed: 4m0s timeout expired: Last polled error: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-v8b2c': Exitcode: 255 \nErr: exit status 255\nStdout:\n \t \nStderr:\n \t Error: Cannot get status/probe: Put \"http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe\": context deadline exceeded\n\t \n\t command terminated with exit code 255\n\t \n",
    }
to be nil
/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Kernel/src/github.com/cilium/cilium/test/k8sT/assertionHelpers.go:107

Standard Output

Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
⚠️  Number of "context deadline exceeded" in logs: 16
Number of "level=error" in logs: 0
⚠️  Number of "level=warning" in logs: 6
Number of "Cilium API handler panicked" in logs: 0
⚠️  Number of "Goroutine took lock for more than" in logs: 7
Top 3 errors/warnings:
Session affinity for host reachable services needs kernel 5.7.0 or newer to work properly when accessed from inside cluster: the same service endpoint will be selected from all network namespaces on the host.
BPF bandwidth manager needs kernel 5.0 or newer. Disabling the feature.
Unable to update ipcache map entry on pod add
Cilium pods: [cilium-htxg2 cilium-v8b2c]
Netpols loaded: 
CiliumNetworkPolicies loaded: 
Endpoint Policy Enforcement:
Pod                           Ingress   Egress
grafana-54dbdc987-hgv4n                 
prometheus-6ff848df8b-5klz7             
coredns-7964865f77-t6r8z                
Cilium agent 'cilium-htxg2': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 17 Failed 0
Cilium agent 'cilium-v8b2c': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 21 Failed 0

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Kernel/3445/testReport/junit/Suite-k8s-1/18/K8sChaosTest_Connectivity_demo_application_Endpoint_can_still_connect_while_Cilium_is_not_running/
0892951b_K8sChaosTest_Connectivity_demo_application_Endpoint_can_still_connect_while_Cilium_is_not_running.zip

This test failing then caused two other subsequent tests to fail with failed due to BeforeAll failure:

Suite-k8s-1.18.K8sChaosTest Restart with long lived connections TCP connection is not dropped when cilium restarts
Suite-k8s-1.18.K8sChaosTest Restart with long lived connections L3/L4 policies still work while Cilium is restarted
@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Oct 14, 2020
@pchaigno pchaigno added this to To Do (1.8, 1.9 - Rare Flakes) in CI Force Oct 14, 2020
@joestringer
Copy link
Member

@tklauser
Copy link
Member

tklauser commented Nov 9, 2020

Hit during K8sPolicyTest Multi-node policy test validates fromEntities policies with remote-node identity disabled Allows from all hosts with cnp fromEntities host policy in 1.7 backport #13950

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-K8s/3682/

@pchaigno
Copy link
Member Author

pchaigno commented Feb 3, 2021

Happened again in #14797:
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.19/110/testReport/junit/Suite-k8s-1/17/K8sChaosTest_Connectivity_demo_application_Endpoint_can_still_connect_while_Cilium_is_not_running/
198f744c_K8sChaosTest_Connectivity_demo_application_Endpoint_can_still_connect_while_Cilium_is_not_running.zip

The logs for the CrashLoopBackOff cilium-agent pod have this fatal:

2021-02-03T11:07:01.202509486Z level=fatal msg="Error while creating daemon" error="listen tcp :45113: bind: address already in use" subsys=daemon

Looks like an issue with the DNS proxy. /cc @jrajahalme

@pchaigno pchaigno added the area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. label Feb 3, 2021
@stale
Copy link

stale bot commented Jun 23, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 23, 2021
@stale
Copy link

stale bot commented Jul 11, 2021

This issue has not seen any activity since it was marked stale. Closing.

@stale stale bot closed this as completed Jul 11, 2021
CI Force automation moved this from To Do (1.8, 1.9 - Rare Flakes) to Fixed / Done Jul 11, 2021
CI Force automation moved this from Fixed / Done to In Progress (Cilium) Oct 13, 2021
@errordeveloper
Copy link
Contributor

Re-opening since it reoccured in #17567.

@pchaigno
Copy link
Member Author

Re-opening since it reoccured in #17567.

* [job](https://jenkins.cilium.io/job/Cilium-PR-K8s-1.17-kernel-4.9/373/)

@errordeveloper The K8sChaosTest Connectivity demo application Endpoint can still connect while Cilium is not running test passed in this Jenkins job (see console at 14:41:03). Did you mean to link to something else?

@errordeveloper
Copy link
Contributor

@pchaigno I was looking at this:

15:43:34  • Failure in Spec Setup (BeforeEach) [150.139 seconds]
15:43:34  K8sChaosTest
15:43:34  /home/jenkins/workspace/Cilium-PR-K8s-1.17-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:478
15:43:34    Restart with long lived connections
15:43:34    /home/jenkins/workspace/Cilium-PR-K8s-1.17-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:478
15:43:34      TCP connection is not dropped when cilium restarts [BeforeEach]
15:43:34      /home/jenkins/workspace/Cilium-PR-K8s-1.17-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:514
15:43:34  
15:43:34      Netperf cannot be deployed
[2021-10-12T14:43:34.716Z]     Expected command: kubectl apply --force=false -f /home/jenkins/workspace/Cilium-PR-K8s-1.17-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/manifests/netperf-deployment.yaml 
[2021-10-12T14:43:34.716Z]     To succeed, but it failed:
[2021-10-12T14:43:34.716Z]     Exitcode: -1 
[2021-10-12T14:43:34.716Z]     Err: signal: killed
[2021-10-12T14:43:34.716Z]     Stdout:
[2021-10-12T14:43:34.716Z]      	 pod/netperf-server created
[2021-10-12T14:43:34.716Z]     	 pod/netperf-client created
[2021-10-12T14:43:34.716Z]     	 
[2021-10-12T14:43:34.716Z]     Stderr:
[2021-10-12T14:43:34.716Z]      	 
[2021-10-12T14:43:34.716Z]     
15:43:34  
15:43:34      /home/jenkins/workspace/Cilium-PR-K8s-1.17-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/Chaos.go:187

I assumed it's to do with this issues as 'TCP connection is not dropped when cilium restarts' is mentioned above... should this be a separate issue?

@pchaigno
Copy link
Member Author

should this be a separate issue?

I think so. Neither the test name nor the error message match the present flake report.

@pchaigno pchaigno removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Nov 5, 2021
@github-actions
Copy link

github-actions bot commented Jul 9, 2022

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 9, 2022
@pchaigno pchaigno closed this as completed Jul 9, 2022
CI Force automation moved this from In Progress (Cilium) to Fixed / Done Jul 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. ci/flake This is a known failure that occurs in the tree. Please investigate me! stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
No open projects
CI Force
  
Fixed / Done
Development

No branches or pull requests

5 participants