Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI]: K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master #6307

Closed
aanm opened this issue Nov 27, 2018 · 12 comments · Fixed by #12869
Closed

[CI]: K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master #6307

aanm opened this issue Nov 27, 2018 · 12 comments · Fixed by #12869
Labels
area/CI Continuous Integration testing issue or flake needs/triage This issue requires triaging to establish severity and next steps. priority/medium This is considered important, but not urgent.
Milestone

Comments

@aanm
Copy link
Member

aanm commented Nov 27, 2018

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:383
Cannot curl app1-service
Expected command: kubectl exec -n default app2-85b74b9c79-ztpb6 -- curl -s -D /dev/stderr --fail --connect-timeout 3 --max-time 8 http://app1-service/public -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" 
To succeed, but it failed:
Exitcode: 28 
Stdout:
 	 time-> DNS: '0.000000()', Connect: '0.000000',Transfer '0.000000', total '3.516777'
Stderr:
 	 command terminated with exit code 28
	 

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/src/github.com/cilium/cilium/test/k8sT/Updates.go:216

e413417a_K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master.zip

2nd time:

da2b440e_K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master.zip

@aanm aanm added priority/medium This is considered important, but not urgent. area/CI Continuous Integration testing issue or flake needs/triage This issue requires triaging to establish severity and next steps. labels Nov 27, 2018
@aanm aanm added this to the 1.4-bugfix milestone Nov 27, 2018
@aanm aanm added this to Needs triage in CI Failures via automation Nov 27, 2018
@aanm aanm added this to Proposed in 1.4 via automation Nov 27, 2018
@aanm aanm moved this from Needs triage to Daily / Always in CI Failures Nov 29, 2018
@aanm
Copy link
Member Author

aanm commented Nov 29, 2018

happened here as well #6331

@tgraf
Copy link
Member

tgraf commented Jan 28, 2019

Dup of #6730

@tgraf tgraf closed this as completed Jan 28, 2019
1.4 automation moved this from Proposed to Done Jan 28, 2019
@raybejjani
Copy link
Contributor

master: https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/2763
test_results_master_2763_BDD-Test-PR.zip

I've seen this one other time, and it may be distinct from the test failing overall:

23:54:23  cmd: kubectl get pods -o wide --all-namespaces
23:54:23  Exitcode: 0 
23:54:23  Stdout:
23:54:23   	 NAMESPACE     NAME                                    READY     STATUS     RESTARTS   AGE       IP              NODE
23:54:23  	 kube-system   cilium-etcd-b245mtxhtk                  1/1       Running    0          34m       10.10.0.76      k8s1
23:54:23  	 kube-system   cilium-etcd-hf74w4zsdd                  1/1       Unknown    0          34m       10.10.1.41      k8s2
23:54:23  	 kube-system   cilium-etcd-operator-77d4ddf8c6-5skxb   1/1       Unknown    0          36m       192.168.36.12   k8s2
23:54:23  	 kube-system   cilium-etcd-whh4wxpk6v                  1/1       Unknown    0          34m       10.10.1.77      k8s2
23:54:23  	 kube-system   cilium-v9rl6                            1/1       NodeLost   0          27m       192.168.36.12   k8s2
23:54:23  	 kube-system   cilium-z5kwm                            0/1       Running    6          26m       192.168.36.11   k8s1
23:54:23  	 kube-system   etcd-k8s1                               1/1       Running    0          41m       192.168.36.11   k8s1
23:54:23  	 kube-system   etcd-operator-65476dd78f-2c8zz          1/1       Unknown    0          35m       10.10.1.251     k8s2
23:54:23  	 kube-system   kube-apiserver-k8s1                     1/1       Running    0          41m       192.168.36.11   k8s1
23:54:23  	 kube-system   kube-controller-manager-k8s1            1/1       Running    0          41m       192.168.36.11   k8s1
23:54:23  	 kube-system   kube-dns-f4d788bb7-rw7xm                3/3       Unknown    0          42m       10.10.1.253     k8s2
23:54:23  	 kube-system   kube-proxy-6b7zk                        1/1       Running    0          42m       192.168.36.11   k8s1
23:54:23  	 kube-system   kube-proxy-7wr62                        1/1       NodeLost   0          36m       192.168.36.12   k8s2
23:54:23  	 kube-system   kube-scheduler-k8s1                     1/1       Running    0          41m       192.168.36.11   k8s1

shows k8s2 as completely lost. I'm not sure what to make of this, and the only logs available are test-specific (nothing for cilium-v9rl6 or for kubelet).

@raybejjani raybejjani reopened this Apr 23, 2019
@jrajahalme
Copy link
Member

jrajahalme commented May 9, 2019

Failed on https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/12211/execution/node/132/log/. Looks like test on traffic started even when some endpoints were NOT ready?

STEP: Performing Cilium preflight check
STEP: Validate that endpoints are ready before making any connection
STEP: Making L7 requests between endpoints
=== Test Finished at 2019-05-09T20:06:56Z====
===================== TEST FAILED =====================
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE     NAME                                    READY     STATUS    RESTARTS   AGE       IP              NODE
	 default       app1-6f5f7bd649-4m944                   0/1       Running   0          3m        10.10.0.115     k8s1
	 default       app1-6f5f7bd649-8pnr4                   0/1       Running   0          3m        10.10.0.56      k8s1
	 default       app2-5c44ff87c-t9gcs                    1/1       Running   0          3m        10.10.0.228     k8s1
	 default       app3-579cbb5fcd-q8b7v                   1/1       Running   0          3m        10.10.0.125     k8s1
	 default       migrate-svc-client-7zht2                1/1       Running   0          3m        10.10.1.177     k8s2
	 default       migrate-svc-client-dvn82                1/1       Running   0          3m        10.10.1.136     k8s2
	 default       migrate-svc-client-tshpn                1/1       Running   0          3m        10.10.1.61      k8s2
	 default       migrate-svc-client-v7d6j                1/1       Running   0          3m        10.10.0.84      k8s1
	 default       migrate-svc-client-zzthh                1/1       Running   0          3m        10.10.0.175     k8s1
	 default       migrate-svc-server-7c2zv                1/1       Running   0          3m        10.10.1.138     k8s2
	 default       migrate-svc-server-d76hw                1/1       Running   0          3m        10.10.0.63      k8s1
	 default       migrate-svc-server-rx2xw                1/1       Running   0          3m        10.10.1.84      k8s2
	 kube-system   cilium-cl4ws                            1/1       Running   0          49s       192.168.36.11   k8s1
	 kube-system   cilium-etcd-mnbb98rjbf                  1/1       Running   0          4m        10.10.0.200     k8s1
	 kube-system   cilium-etcd-operator-77d4ddf8c6-595r7   1/1       Running   0          4m        192.168.36.12   k8s2
	 kube-system   cilium-etcd-qcjkndw7mk                  1/1       Running   0          3m        10.10.1.243     k8s2
	 kube-system   cilium-etcd-vm2wzkklc7                  1/1       Running   0          4m        10.10.1.51      k8s2
	 kube-system   cilium-n8d6f                            1/1       Running   0          49s       192.168.36.12   k8s2
	 kube-system   cilium-operator-5fb956fc65-ppntk        1/1       Running   0          1m        10.10.0.217     k8s1
	 kube-system   etcd-k8s1                               1/1       Running   0          27m       192.168.36.11   k8s1
	 kube-system   etcd-operator-65476dd78f-s9lmd          1/1       Running   0          4m        10.10.1.35      k8s2
	 kube-system   kube-apiserver-k8s1                     1/1       Running   0          27m       192.168.36.11   k8s1
	 kube-system   kube-controller-manager-k8s1            1/1       Running   0          27m       192.168.36.11   k8s1
	 kube-system   kube-dns-f4d788bb7-xjjmg                3/3       Running   0          4m        10.10.1.25      k8s2
	 kube-system   kube-proxy-v4xwj                        1/1       Running   0          28m       192.168.36.11   k8s1
	 kube-system   kube-proxy-z8c7z                        1/1       Running   0          19m       192.168.36.12   k8s2
	 kube-system   kube-scheduler-k8s1                     1/1       Running   0          27m       192.168.36.11   k8s1

@jrajahalme
Copy link
Member

@aanm Do you know what is the significance of pod listing READY column reporting "0/1"? Apparently the cilium state for these endpoint is "ready" and the test is happy to proceed, but then fails. I.e., we wait for the cilium endpoint list to show "ready", but not for the k8s pod READY to be "n/n".

@aanm
Copy link
Member Author

aanm commented May 9, 2019

@aanm Do you know what is the significance of pod listing READY column reporting "0/1"? Apparently the cilium state for these endpoint is "ready" and the test is happy to proceed, but then fails. I.e., we wait for the cilium endpoint list to show "ready", but not for the k8s pod READY to be "n/n".

@jrajahalme ebc77ec

@stale
Copy link

stale bot commented Jul 8, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 8, 2019
@stale
Copy link

stale bot commented Jul 23, 2019

This issue has not seen any activity since it was marked stale. Closing.

@stale stale bot closed this as completed Jul 23, 2019
@jrajahalme jrajahalme reopened this Jul 28, 2020
@stale stale bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 28, 2020
@jrajahalme
Copy link
Member

#12548 had same failure, same symptom as in description. coreDNS was in crashloop backoff with these logs:

2020-07-28T19:04:49.845705843Z .:53
2020-07-28T19:04:49.845742463Z 2020/07/28 19:04:49 [INFO] CoreDNS-1.2.2
2020-07-28T19:04:49.845746522Z 2020/07/28 19:04:49 [INFO] linux/amd64, go1.11, eb51e8b
2020-07-28T19:04:49.845749987Z CoreDNS-1.2.2
2020-07-28T19:04:49.845753393Z linux/amd64, go1.11, eb51e8b
2020-07-28T19:04:49.845756755Z 2020/07/28 19:04:49 [INFO] plugin/reload: Running configuration MD5 = ffcc993a37738c0d6dd423fdb6ad81b0
2020-07-28T19:04:56.028383249Z 10.0.0.49:48566 - [28/Jul/2020:19:04:56 +0000] 59749 "A IN app1-service.default.svc.cluster.local. udp 80 false 4096" NOERROR qr,aa,rd,ra 134 0.000136696s
2020-07-28T19:04:56.453196361Z 10.0.0.49:54759 - [28/Jul/2020:19:04:56 +0000] 2004 "A IN app1-service.default.svc.cluster.local. udp 80 false 4096" NOERROR qr,rd,ra 134 0.000084368s
2020-07-28T19:04:56.848148387Z 2020/07/28 19:04:56 [FATAL] plugin/loop: Seen "HINFO IN 2264449761600304678.7082390303545810211." more than twice, loop detected

jrajahalme added a commit that referenced this issue Aug 13, 2020
Update k8s 1.12 coredns deployment to image tag 1.2.6 to get the bug
fix for loop detection getting confused due to retries
(coredns/coredns#2391).

Fixes: #6307
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
aanm pushed a commit that referenced this issue Aug 13, 2020
Update k8s 1.12 coredns deployment to image tag 1.2.6 to get the bug
fix for loop detection getting confused due to retries
(coredns/coredns#2391).

Fixes: #6307
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake needs/triage This issue requires triaging to establish severity and next steps. priority/medium This is considered important, but not urgent.
Projects
No open projects
1.4
  
Done
CI Failures
  
Daily / Always
Development

Successfully merging a pull request may close this issue.

5 participants