New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: RuntimeKVStoreTest: Stopping cilium during Runtime tests #11895
Comments
Another case in https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/608/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Consul_KVStore/ That build had the new |
Also in #11901: https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/620/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Consul_KVStore/ Relevant Cilium logs:
|
A couple more cases: |
Another case: |
Related: #11895 Signed-off-by: Paul Chaignon <paul@cilium.io>
Test disabled by #11945. |
Related: #11895 Signed-off-by: Paul Chaignon <paul@cilium.io>
These two tests started failing on master on May 22, a Friday. Fails happen more often for the Consul test than the Etcd one. |
As for most tests, at the end of RuntimeKVStoreTest, we validate the logs don't contain any worrisome messages with: vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration) CurrentGinkgoTestDescription().Duration includes the execution of all ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our ginkgo.BeforeEach stops the Cilium systemd service (because we run cilium-agent as a standalone binary in the test itself). Stopping Cilium can result in worrisome messages in the logs e.g., if the compilation of BPF programs is terminated abruptly. This in turn makes the tests fail once in a while. To fix this, we can replace CurrentGinkgoTestDescription().Duration with our own "time counter" that doesn't include any of the ginkgo.BeforeEach executions. To validate this fix, I ran the whole RuntimeKVStoreTest with this change 60 times locally and 60 times in the CI (#12419). The tests passed all 120 times. Before applying the fix, the Consul test would fail ~1/30 times, both locally and in CI. 1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134 Fixes: #11895 Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.") Related: #12419 Signed-off-by: Paul Chaignon <paul@cilium.io>
As for most tests, at the end of RuntimeKVStoreTest, we validate the logs don't contain any worrisome messages with: vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration) CurrentGinkgoTestDescription().Duration includes the execution of all ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our ginkgo.BeforeEach stops the Cilium systemd service (because we run cilium-agent as a standalone binary in the test itself). Stopping Cilium can result in worrisome messages in the logs e.g., if the compilation of BPF programs is terminated abruptly. This in turn makes the tests fail once in a while. To fix this, we can replace CurrentGinkgoTestDescription().Duration with our own "time counter" that doesn't include any of the ginkgo.BeforeEach executions. To validate this fix, I ran the whole RuntimeKVStoreTest with this change 60 times locally and 60 times in the CI (#12419). The tests passed all 120 times. Before applying the fix, the Consul test would fail ~1/30 times, both locally and in CI. 1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134 Fixes: #11895 Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.") Related: #12419 Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e558100 ] As for most tests, at the end of RuntimeKVStoreTest, we validate the logs don't contain any worrisome messages with: vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration) CurrentGinkgoTestDescription().Duration includes the execution of all ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our ginkgo.BeforeEach stops the Cilium systemd service (because we run cilium-agent as a standalone binary in the test itself). Stopping Cilium can result in worrisome messages in the logs e.g., if the compilation of BPF programs is terminated abruptly. This in turn makes the tests fail once in a while. To fix this, we can replace CurrentGinkgoTestDescription().Duration with our own "time counter" that doesn't include any of the ginkgo.BeforeEach executions. To validate this fix, I ran the whole RuntimeKVStoreTest with this change 60 times locally and 60 times in the CI (#12419). The tests passed all 120 times. Before applying the fix, the Consul test would fail ~1/30 times, both locally and in CI. 1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134 Fixes: #11895 Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.") Related: #12419 Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e558100 ] As for most tests, at the end of RuntimeKVStoreTest, we validate the logs don't contain any worrisome messages with: vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration) CurrentGinkgoTestDescription().Duration includes the execution of all ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our ginkgo.BeforeEach stops the Cilium systemd service (because we run cilium-agent as a standalone binary in the test itself). Stopping Cilium can result in worrisome messages in the logs e.g., if the compilation of BPF programs is terminated abruptly. This in turn makes the tests fail once in a while. To fix this, we can replace CurrentGinkgoTestDescription().Duration with our own "time counter" that doesn't include any of the ginkgo.BeforeEach executions. To validate this fix, I ran the whole RuntimeKVStoreTest with this change 60 times locally and 60 times in the CI (#12419). The tests passed all 120 times. Before applying the fix, the Consul test would fail ~1/30 times, both locally and in CI. 1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134 Fixes: #11895 Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.") Related: #12419 Signed-off-by: Paul Chaignon <paul@cilium.io>
RuntimeKVStoreTest Consul KVStore
andRuntimeKVStoreTest Etcd KVStore
are failing often with different error messages. The root cause, found in the logs, seems to be:RuntimeKVStoreTest Consul KVStore
usually fails with:because the BPF compilation gets interrupted by Cilium stopping.
https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/4989/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Etcd_KVStore/
https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/4908/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Etcd_KVStore/
15d82b0c_RuntimeKVStoreTest_Etcd_KVStore.zip
2533d569_RuntimeKVStoreTest_Etcd_KVStore.zip
RuntimeKVStoreTest Etcd KVStore
usually fails with:because Cilium is not running anymore.
https://jenkins.cilium.io/job/Ginkgo-CI-Tests-runtime-Pipeline/416/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Consul_KVStore/
https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/4989/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Consul_KVStore/
test_results_Ginkgo-CI-Tests-runtime-Pipeline_416_BDD-Test-PR.zip
test_results_master_4989_BDD-Test-PR-runtime.zip
The text was updated successfully, but these errors were encountered: