Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: RuntimeKVStoreTest: Stopping cilium during Runtime tests #11895

Closed
pchaigno opened this issue Jun 4, 2020 · 7 comments · Fixed by #12478
Closed

CI: RuntimeKVStoreTest: Stopping cilium during Runtime tests #11895

pchaigno opened this issue Jun 4, 2020 · 7 comments · Fixed by #12478
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects

Comments

@pchaigno
Copy link
Member

pchaigno commented Jun 4, 2020

RuntimeKVStoreTest Consul KVStore and RuntimeKVStoreTest Etcd KVStore are failing often with different error messages. The root cause, found in the logs, seems to be:

Jun 03 01:40:59 runtime systemd[1]: Stopping cilium...

RuntimeKVStoreTest Consul KVStore usually fails with:

home/jenkins/workspace/Ginkgo-CI-Tests-runtime-Pipeline/runtime-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:421
Found 2 Cilium logs matching list of errors that must be investigated
/home/jenkins/workspace/Ginkgo-CI-Tests-runtime-Pipeline/runtime-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:419
Standard output
⚠️  Found a "JoinEP: " in logs
⚠️  Found a "JoinEP: " in logs

because the BPF compilation gets interrupted by Cilium stopping.
https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/4989/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Etcd_KVStore/
https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/4908/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Etcd_KVStore/
15d82b0c_RuntimeKVStoreTest_Etcd_KVStore.zip
2533d569_RuntimeKVStoreTest_Etcd_KVStore.zip


RuntimeKVStoreTest Etcd KVStore usually fails with:

Error getting names of endpoints from cilium
Expected
    <*errors.errorString | 0xc0011ce610>: {
        s: "`cilium endpoint list` was not successful",
    }
to be nil

because Cilium is not running anymore.
https://jenkins.cilium.io/job/Ginkgo-CI-Tests-runtime-Pipeline/416/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Consul_KVStore/
https://jenkins.cilium.io/job/cilium-ginkgo/job/cilium/job/master/4989/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Consul_KVStore/
test_results_Ginkgo-CI-Tests-runtime-Pipeline_416_BDD-Test-PR.zip
test_results_master_4989_BDD-Test-PR-runtime.zip

@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jun 4, 2020
@pchaigno pchaigno self-assigned this Jun 4, 2020
@pchaigno pchaigno added this to To Do (1.8 - Daily Flakes) in CI Force Jun 4, 2020
@pchaigno pchaigno moved this from To Do (1.8 - Daily Flakes) to In Progress (Cilium) in CI Force Jun 4, 2020
@pchaigno
Copy link
Member Author

pchaigno commented Jun 4, 2020

Another case in https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/608/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Consul_KVStore/
test_results_Cilium-PR-Runtime-4.9_608_BDD-Test-PR.zip

That build had the new lscpu output, so we know it's not the CPU frequency issue from Packet that is causing this.

@jrajahalme
Copy link
Member

Also in #11901: https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/620/testReport/junit/(root)/Suite-runtime/RuntimeKVStoreTest_Consul_KVStore/

Relevant Cilium logs:

Jun 04 22:12:35 runtime cilium-agent[12006]: level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=7.714184ms bpfWaitForELF="63.251µs" bpfWriteELF="767.092µs" buildDuration=12.665186ms containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=3702 error="Failed to load tc filter: Command execution failed for [tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj 3702_next/bpf_host.o sec to-host]: context canceled" identity=1 ipv4= ipv6= k8sPodName=/ mapSync="91.439µs" policyCalculation="221.979µs" prepareBuild="278.9µs" proxyConfiguration="80.909µs" proxyPolicyCalculation="221.385µs" proxyWaitForAck=0s reason="agent configuration update" subsys=endpoint waitingForCTClean="4.048µs" waitingForLock="24.347µs"
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Error regenerating endpoint: Failed to load tc filter: Command execution failed for [tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj 3702_next/bpf_host.o sec to-host]: context canceled" code=Failure containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=3702 endpointState=ready identity=1 ipv4= ipv6= k8sPodName=/ policyRevision=20 subsys=endpoint type=200
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="EventQueue event processing statistics" eventConsumeOffQueueWaitTime="554.604µs" eventEnqueueWaitTime="6.108µs" eventHandlingDuration=13.981773ms eventType="*endpoint.EndpointRegenerationEvent" name=endpoint-3702 subsys=eventqueue
Jun 04 22:12:35 runtime cilium-agent[12006]: level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=3702 error="Failed to load tc filter: Command execution failed for [tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj 3702_next/bpf_host.o sec to-host]: context canceled" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Process exited" cmd="ip [netns exec cilium-health cilium-health-responder --pidfile /var/run/cilium/state/health-endpoint.pid]" exitCode="<nil>" subsys=launcher
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="NAME cilium-agent STATUS S PID 12006 CPU: 1.28% MEM: 1.29% CMDLINE: /usr/bin/cilium-agent --kvstore consul --kvstore-opt consul.address=127.0.0.1:8500 --debug --pprof=true --log-system-load --tofqdns-enable-poller=true --exclude-local-address=172.17.0.1/32 --exclude-local-address=192.168.254.254/32 --exclude-local-address=fdff::ff/128 MEM-EXT: RSS: 103 VMS: 753 Data: 0 Stack: 0 Locked: 0 Swap: 0" endpointID=3702 subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=error msg="Command execution failed" cmd="[tc filter replace dev lxc_health ingress prio 1 handle 1 bpf da obj 2594_next/bpf_lxc.o sec from-container]" error="exit status 1" subsys=datapath-loader
Jun 04 22:12:35 runtime cilium-agent[12006]: level=warning msg="Cannot find device \"lxc_health\"" subsys=datapath-loader
Jun 04 22:12:35 runtime cilium-agent[12006]: level=warning msg="JoinEP: Failed to load program" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 error="Failed to load tc filter: exit status 1" file-path=2594_next/bpf_lxc.o identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=datapath-loader veth=lxc_health
Jun 04 22:12:35 runtime cilium-agent[12006]: level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 error="Failed to load tc filter: exit status 1" identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Reverting endpoint changes after BPF regeneration failed" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Reverting proxy redirect removals" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Finished reverting proxy redirect removals" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Reverting proxy redirect additions" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Finished reverting proxy redirect additions" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Reverting xDS network policy update" subsys=envoy-manager
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" subsys=xds xdsCachedVersion=228 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="updating resource in cache" subsys=xds xdsCachedVersion=228 xdsResourceName="f00d::a0f:0:0:1976" xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="committing cache transaction and notifying of new version" subsys=xds xdsCachedVersion=228 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" subsys=xds xdsCachedVersion=229 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="updating resource in cache" subsys=xds xdsCachedVersion=229 xdsResourceName=10.15.142.199 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="committing cache transaction and notifying of new version" subsys=xds xdsCachedVersion=229 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Finished reverting xDS network policy update" subsys=envoy-manager
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Finished reverting endpoint changes after BPF regeneration failed" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 file-path=2594_next_fail identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="removing directory" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 directory=2594_next_fail endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="removing directory" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 directory=2594_next endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Completed endpoint regeneration with no pending regeneration requests" code=OK containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 endpointState=ready identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ policyRevision=20 subsys=endpoint type=0
Jun 04 22:12:35 runtime cilium-agent[12006]: level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=16.39825ms bpfWaitForELF="19.134µs" bpfWriteELF="585.219µs" buildDuration=26.6262ms containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 error="Failed to load tc filter: exit status 1" identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ mapSync="66.944µs" policyCalculation="219.725µs" prepareBuild="289.595µs" proxyConfiguration="14.152µs" proxyPolicyCalculation="107.483µs" proxyWaitForAck=0s reason="agent configuration update" subsys=endpoint waitingForCTClean="7.815µs" waitingForLock="23.984µs"
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="Error regenerating endpoint: Failed to load tc filter: exit status 1" code=Failure containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 endpointState=ready identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ policyRevision=20 subsys=endpoint type=200
Jun 04 22:12:35 runtime cilium-agent[12006]: level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 error="Failed to load tc filter: exit status 1" identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint
Jun 04 22:12:35 runtime cilium-agent[12006]: level=debug msg="received signal that regeneration failed" containerID= datapathPolicyRevision=20 desiredPolicyRevision=21 endpointID=2594 identity=4 ipv4=10.15.142.199 ipv6="f00d::a0f:0:0:1976" k8sPodName=/ subsys=endpoint

@christarazi
Copy link
Member

@pchaigno pchaigno changed the title CI: Stopping cilium during Runtime tests CI: RuntimeKVStoreTest: Stopping cilium during Runtime tests Jun 8, 2020
pchaigno added a commit that referenced this issue Jun 8, 2020
Related: #11895
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Jun 8, 2020

Test disabled by #11945.

nebril pushed a commit that referenced this issue Jun 8, 2020
Related: #11895
Signed-off-by: Paul Chaignon <paul@cilium.io>
aanm pushed a commit that referenced this issue Jun 9, 2020
[ upstream commit e158b63 ]

Related: #11895
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
aanm pushed a commit that referenced this issue Jun 10, 2020
[ upstream commit e158b63 ]

Related: #11895
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Jun 23, 2020

These two tests started failing on master on May 22, a Friday.
fails
We merged 50 pull requests into master on that week.

Fails happen more often for the Consul test than the Etcd one.

pchaigno added a commit that referenced this issue Jul 9, 2020
As for most tests, at the end of RuntimeKVStoreTest, we validate the logs
don't contain any worrisome messages with:

    vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration)

CurrentGinkgoTestDescription().Duration includes the execution of all
ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our
ginkgo.BeforeEach stops the Cilium systemd service (because we run
cilium-agent as a standalone binary in the test itself). Stopping Cilium
can result in worrisome messages in the logs e.g., if the compilation of
BPF programs is terminated abruptly. This in turn makes the tests fail
once in a while.

To fix this, we can replace CurrentGinkgoTestDescription().Duration with
our own "time counter" that doesn't include any of the
ginkgo.BeforeEach executions.

To validate this fix, I ran the whole RuntimeKVStoreTest with this
change 60 times locally and 60 times in the CI (#12419). The tests
passed all 120 times. Before applying the fix, the Consul test would
fail ~1/30 times, both locally and in CI.

1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134
Fixes: #11895
Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.")
Related: #12419
Signed-off-by: Paul Chaignon <paul@cilium.io>
CI Force automation moved this from In Progress (Cilium) to Fixed / Done Jul 10, 2020
nebril pushed a commit that referenced this issue Jul 10, 2020
As for most tests, at the end of RuntimeKVStoreTest, we validate the logs
don't contain any worrisome messages with:

    vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration)

CurrentGinkgoTestDescription().Duration includes the execution of all
ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our
ginkgo.BeforeEach stops the Cilium systemd service (because we run
cilium-agent as a standalone binary in the test itself). Stopping Cilium
can result in worrisome messages in the logs e.g., if the compilation of
BPF programs is terminated abruptly. This in turn makes the tests fail
once in a while.

To fix this, we can replace CurrentGinkgoTestDescription().Duration with
our own "time counter" that doesn't include any of the
ginkgo.BeforeEach executions.

To validate this fix, I ran the whole RuntimeKVStoreTest with this
change 60 times locally and 60 times in the CI (#12419). The tests
passed all 120 times. Before applying the fix, the Consul test would
fail ~1/30 times, both locally and in CI.

1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134
Fixes: #11895
Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.")
Related: #12419
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this issue Jul 21, 2020
[ upstream commit e558100 ]

As for most tests, at the end of RuntimeKVStoreTest, we validate the logs
don't contain any worrisome messages with:

    vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration)

CurrentGinkgoTestDescription().Duration includes the execution of all
ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our
ginkgo.BeforeEach stops the Cilium systemd service (because we run
cilium-agent as a standalone binary in the test itself). Stopping Cilium
can result in worrisome messages in the logs e.g., if the compilation of
BPF programs is terminated abruptly. This in turn makes the tests fail
once in a while.

To fix this, we can replace CurrentGinkgoTestDescription().Duration with
our own "time counter" that doesn't include any of the
ginkgo.BeforeEach executions.

To validate this fix, I ran the whole RuntimeKVStoreTest with this
change 60 times locally and 60 times in the CI (#12419). The tests
passed all 120 times. Before applying the fix, the Consul test would
fail ~1/30 times, both locally and in CI.

1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134
Fixes: #11895
Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.")
Related: #12419
Signed-off-by: Paul Chaignon <paul@cilium.io>
rolinh pushed a commit that referenced this issue Jul 21, 2020
[ upstream commit e558100 ]

As for most tests, at the end of RuntimeKVStoreTest, we validate the logs
don't contain any worrisome messages with:

    vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration)

CurrentGinkgoTestDescription().Duration includes the execution of all
ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our
ginkgo.BeforeEach stops the Cilium systemd service (because we run
cilium-agent as a standalone binary in the test itself). Stopping Cilium
can result in worrisome messages in the logs e.g., if the compilation of
BPF programs is terminated abruptly. This in turn makes the tests fail
once in a while.

To fix this, we can replace CurrentGinkgoTestDescription().Duration with
our own "time counter" that doesn't include any of the
ginkgo.BeforeEach executions.

To validate this fix, I ran the whole RuntimeKVStoreTest with this
change 60 times locally and 60 times in the CI (#12419). The tests
passed all 120 times. Before applying the fix, the Consul test would
fail ~1/30 times, both locally and in CI.

1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134
Fixes: #11895
Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.")
Related: #12419
Signed-off-by: Paul Chaignon <paul@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
No open projects
CI Force
  
Fixed / Done
Development

Successfully merging a pull request may close this issue.

4 participants