Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: improve DumpReliablyWithCallback #9972

Merged
merged 3 commits into from
Jan 31, 2020

Conversation

rolinh
Copy link
Member

@rolinh rolinh commented Jan 27, 2020

DumpReliablyWithCallback() should now be more resilient although it may still fail under heavy concurrent update situations, in which case it returns an appropriate error.

The upper bound limit was set to 4 * stats.MaxEntries as suggested by @borkmann, which is probably a reasonable default.

See also the discussion at cilium/ebpf#11

Fixes: #9811


This change is Reviewable

@rolinh rolinh requested a review from a team January 27, 2020 12:50
@coveralls
Copy link

coveralls commented Jan 27, 2020

Coverage Status

Coverage decreased (-0.01%) to 44.758% when pulling e18253a on pr/rolinh/replace-DumpReliablyWithCallback into 7b6d34d on master.

@rolinh rolinh added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Jan 27, 2020
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@rolinh rolinh added kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Jan 27, 2020
@gandro
Copy link
Member

gandro commented Jan 27, 2020

test-me-please

@tgraf tgraf added the kind/bug This is a bug in the Cilium logic. label Jan 27, 2020
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tidyup, one main note below on the number of iterations. Did you mean to switch it to N iteration failures rather than N iterations total?

pkg/bpf/map_linux.go Outdated Show resolved Hide resolved
pkg/bpf/map_linux_test.go Outdated Show resolved Hide resolved
@rolinh
Copy link
Member Author

rolinh commented Jan 28, 2020

Nice tidyup, one main note below on the number of iterations. Did you mean to switch it to N iteration failures rather than N iterations total?

No, I didn't 🙈 Lemme fix that.

@rolinh rolinh force-pushed the pr/rolinh/replace-DumpReliablyWithCallback branch 2 times, most recently from 629cf2e to 676b21e Compare January 28, 2020 13:36
@rolinh
Copy link
Member Author

rolinh commented Jan 28, 2020

test-me-please

1 similar comment
@rolinh
Copy link
Member Author

rolinh commented Jan 28, 2020

test-me-please

@joestringer
Copy link
Member

joestringer commented Jan 28, 2020

Provisioning failure above: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/16703/

Re-running CI.

@aanm aanm added this to the 1.7 milestone Jan 28, 2020
@joestringer
Copy link
Member

test-me-please

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@joestringer
Copy link
Member

@rolinh
Copy link
Member Author

rolinh commented Jan 29, 2020

That's interesting. I had multiple runs locally and none failed. It's not clear what the error is in the CI though. Timeout? I'll try decreasing the map size.

@rolinh rolinh force-pushed the pr/rolinh/replace-DumpReliablyWithCallback branch from 676b21e to e3c598a Compare January 29, 2020 08:33
@rolinh
Copy link
Member Author

rolinh commented Jan 29, 2020

test-me-please

@aanm
Copy link
Member

aanm commented Jan 29, 2020

@rolinh the privileged unit tests failed again

@rolinh
Copy link
Member Author

rolinh commented Jan 29, 2020

Unfortunately, there isn't much useful information from the CI logs (or I didn't find it).

Thus, I ran the privileged tests on my dev machine 100+ times without hitting a failure (for i in $(seq 100); do sudo make tests-privileged || exit 1; done), arch linux with kernel 5.4.15-arch1-1.

Then I tried running the privileged tests on a dev VM (Ubuntu with kernel 4.9.160-0409160-generic) and there I managed to get a failure and a trace:

vagrant@runtime1:~/go/src/github.com/cilium/cilium$ sudo make tests-privileged
# cilium-map-migrate is a dependency of some unit tests.
make -C bpf cilium-map-migrate
make[1]: Entering directory '/home/vagrant/go/src/github.com/cilium/cilium/bpf'
make[1]: 'cilium-map-migrate' is up to date.
make[1]: Leaving directory '/home/vagrant/go/src/github.com/cilium/cilium/bpf'
go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/cilium/cmd -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/bpf -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux/config -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux/ipsec -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux/probes -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux/route -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/loader -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/fqdn/dnsproxy -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/maps/ctmap -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/maps/eppolicymap -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/maps/policymap -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/mtu -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;
=== RUN   Test
START: bpf_ipcache_get_test.go:58: BPFIPCacheGetSuite.TestGetLPMValue
PASS: bpf_ipcache_get_test.go:58: BPFIPCacheGetSuite.TestGetLPMValue	0.000s

START: bpf_ipcache_get_test.go:32: BPFIPCacheGetSuite.TestGetPrefix
PASS: bpf_ipcache_get_test.go:32: BPFIPCacheGetSuite.TestGetPrefix	0.002s

START: helpers_test.go:39: CMDHelpersSuite.TestExpandNestedJSON
PASS: helpers_test.go:39: CMDHelpersSuite.TestExpandNestedJSON	0.001s

START: helpers_test.go:576: CMDHelpersSuite.TestParsePolicyUpdateArgsHelper
PASS: helpers_test.go:576: CMDHelpersSuite.TestParsePolicyUpdateArgsHelper	0.000s

START: helpers_test.go:551: CMDHelpersSuite.TestParseTrafficString
PASS: helpers_test.go:551: CMDHelpersSuite.TestParseTrafficString	0.000s

OK: 5 passed
--- PASS: Test (0.00s)
PASS
ok  	github.com/cilium/cilium/cilium/cmd	0.033s
level=info msg="Detected mounted BPF filesystem at /sys/fs/bpf" subsys=bpf
=== RUN   Test
START: map_linux_test.go:250: BPFPrivilegedTestSuite.TestBasicManipulation
PASS: map_linux_test.go:250: BPFPrivilegedTestSuite.TestBasicManipulation	0.000s

START: map_linux_test.go:521: BPFPrivilegedTestSuite.TestCheckAndUpgrade
level=warning msg="Flags mismatch for BPF map" file-path=/sys/fs/bpf/tc/globals/cilium_test_upgrade new=0 old=1 subsys=bpf
level=warning msg="Removing map to allow for property upgrade (expect map data loss)" file-path=/sys/fs/bpf/tc/globals/cilium_test_upgrade subsys=bpf
PASS: map_linux_test.go:521: BPFPrivilegedTestSuite.TestCheckAndUpgrade	0.001s

START: map_linux_test.go:481: BPFPrivilegedTestSuite.TestDeleteAll
PASS: map_linux_test.go:481: BPFPrivilegedTestSuite.TestDeleteAll	0.000s

START: map_linux_test.go:319: BPFPrivilegedTestSuite.TestDump
PASS: map_linux_test.go:319: BPFPrivilegedTestSuite.TestDump	0.000s

START: map_linux_test.go:390: BPFPrivilegedTestSuite.TestDumpReliablyWithCallback
panic: test timed out after 6m0s

goroutine 57 [running]:
testing.(*M).startAlarm.func1()
	/usr/local/go/src/testing/testing.go:1377 +0xdf
created by time.goFunc
	/usr/local/go/src/time/sleep.go:168 +0x44

goroutine 1 [chan receive, 6 minutes]:
testing.(*T).Run(0xc0002fe100, 0xd18416, 0x4, 0xd4e2e8, 0x4c44f6)
	/usr/local/go/src/testing/testing.go:961 +0x377
testing.runTests.func1(0xc0002fe000)
	/usr/local/go/src/testing/testing.go:1202 +0x78
testing.tRunner(0xc0002fe000, 0xc000621ca0)
	/usr/local/go/src/testing/testing.go:909 +0xc9
testing.runTests(0xc0003f6380, 0x139b6b0, 0x1, 0x1, 0x0)
	/usr/local/go/src/testing/testing.go:1200 +0x2a7
testing.(*M).Run(0xc0004ea900, 0x0)
	/usr/local/go/src/testing/testing.go:1117 +0x176
github.com/cilium/cilium/pkg/bpf.runTests(0xc0004ea900, 0x0, 0x0, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:88 +0x126
github.com/cilium/cilium/pkg/bpf.TestMain(0xc0004ea900)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:92 +0x2f
main.main()
	_testmain.go:42 +0x135

goroutine 83 [chan receive, 6 minutes]:
gopkg.in/check%2ev1.(*suiteRunner).runTest(...)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:819
gopkg.in/check%2ev1.(*suiteRunner).run(0xc000436080, 0x13d0ce0)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:624 +0x11d
gopkg.in/check%2ev1.Run(0xcbf880, 0x13d0ce0, 0xc00007c708, 0xd0)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/run.go:92 +0x4d
gopkg.in/check%2ev1.RunAll(0xc00008ef08, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/run.go:84 +0x96
gopkg.in/check%2ev1.TestingT(0xc0002fe100)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/run.go:72 +0x395
github.com/cilium/cilium/pkg/bpf.Test(0xc0002fe100)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:34 +0x2b
testing.tRunner(0xc0002fe100, 0xd4e2e8)
	/usr/local/go/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:960 +0x350

goroutine 84 [select, 6 minutes]:
gopkg.in/check%2ev1.(*resultTracker)._loopRoutine(0xc0002c2360)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:470 +0xc4
created by gopkg.in/check%2ev1.(*resultTracker).start
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:450 +0x3f

goroutine 89 [semacquire, 6 minutes]:
sync.runtime_Semacquire(0xc000349408)
	/usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc000349400)
	/usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback(0x13d0ce0, 0xc000308780)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:478 +0x478
reflect.Value.call(0xcbf880, 0x13d0ce0, 0x1213, 0xd182fa, 0x4, 0xc00008df30, 0x1, 0x1, 0x13d2be0, 0xc00008ddd0, ...)
	/usr/local/go/src/reflect/value.go:460 +0x5f6
reflect.Value.Call(0xcbf880, 0x13d0ce0, 0x1213, 0xc00008df30, 0x1, 0x1, 0x0, 0xe, 0x8)
	/usr/local/go/src/reflect/value.go:321 +0xb4
gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc000308780)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0x5e4
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1(0xc000436080, 0xc000308780, 0xc0003f72a0)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xa2
created by gopkg.in/check%2ev1.(*suiteRunner).forkCall
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x1fb

goroutine 90 [runnable]:
github.com/cilium/cilium/pkg/bpf.(*Map).Update.func1(0xc000308870, 0xc000093ee8, 0xe44940, 0xc0006045ac, 0xe3ff80, 0xc0006045b0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:810 +0x132
github.com/cilium/cilium/pkg/bpf.(*Map).Update(0xc000308870, 0xe44940, 0xc0006045ac, 0xe3ff80, 0xc0006045b0, 0x0, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:826 +0x333
github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback.func2(0xc000349400, 0xc0002fa540, 0xc0002fa5a0, 0xc000308870, 0xc000308780)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:432 +0x1ff
created by github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:422 +0x3df

goroutine 72 [syscall, 5 minutes]:
syscall.Syscall(0x141, 0x4, 0xc000428d68, 0x20, 0xc116a0, 0xc0004d6180, 0xc0008798b0)
	/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
github.com/cilium/cilium/pkg/bpf.GetNextKeyFromPointers(0x5, 0xc000428d68, 0x20, 0xc00088e6f0, 0x4)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/bpf_linux.go:218 +0x7a
github.com/cilium/cilium/pkg/bpf.(*Map).DumpReliablyWithCallback(0xc000308870, 0xc000428f40, 0xc000428f58, 0x0, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:697 +0x439
github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback.func3(0xc000349400, 0xc000000100, 0xc000308780, 0xc000308870, 0xc0002fa5a0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:468 +0x315
created by github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:447 +0x467
FAIL	github.com/cilium/cilium/pkg/bpf	360.125s
FAIL
Makefile:97: recipe for target 'tests-privileged' failed
make: *** [tests-privileged] Error 1

goroutine 90 and goroutine 72 are the interesting ones here. We can see that goroutine 90 was in the process of updating the map (github.com/cilium/cilium/pkg/bpf.(*Map).Update()) while goroutine 72 was doing the BPF_MAP_GET_NEXT_KEY syscall in github.com/cilium/cilium/pkg/bpf.GetNextKeyFromPointers().

Then another attempt, stuck again:

vagrant@runtime1:~/go/src/github.com/cilium/cilium$ sudo make tests-privileged
# cilium-map-migrate is a dependency of some unit tests.
make -C bpf cilium-map-migrate
make[1]: Entering directory '/home/vagrant/go/src/github.com/cilium/cilium/bpf'
make[1]: 'cilium-map-migrate' is up to date.
make[1]: Leaving directory '/home/vagrant/go/src/github.com/cilium/cilium/bpf'
go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/cilium/cmd -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/bpf -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux/config -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux/ipsec -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux/probes -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/linux/route -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/datapath/loader -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/fqdn/dnsproxy -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/maps/ctmap -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/maps/eppolicymap -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/maps/policymap -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;  go test -mod=vendor -ldflags "-X github.com/cilium/cilium/pkg/kvstore.consulDummyAddress=https://consul:8443 -X github.com/cilium/cilium/pkg/kvstore.etcdDummyAddress=http://etcd:4002 -X github.com/cilium/cilium/pkg/testutils.CiliumRootDir=/home/vagrant/go/src/github.com/cilium/cilium -X github.com/cilium/cilium/pkg/datapath.DatapathSHA=1234567890abcdef7890" github.com/cilium/cilium/pkg/mtu -test.v -timeout 360s -check.vv -tags=privileged_tests || exit 1;
=== RUN   Test
START: bpf_ipcache_get_test.go:58: BPFIPCacheGetSuite.TestGetLPMValue
PASS: bpf_ipcache_get_test.go:58: BPFIPCacheGetSuite.TestGetLPMValue	0.000s

START: bpf_ipcache_get_test.go:32: BPFIPCacheGetSuite.TestGetPrefix
PASS: bpf_ipcache_get_test.go:32: BPFIPCacheGetSuite.TestGetPrefix	0.001s

START: helpers_test.go:39: CMDHelpersSuite.TestExpandNestedJSON
PASS: helpers_test.go:39: CMDHelpersSuite.TestExpandNestedJSON	0.001s

START: helpers_test.go:576: CMDHelpersSuite.TestParsePolicyUpdateArgsHelper
PASS: helpers_test.go:576: CMDHelpersSuite.TestParsePolicyUpdateArgsHelper	0.000s

START: helpers_test.go:551: CMDHelpersSuite.TestParseTrafficString
PASS: helpers_test.go:551: CMDHelpersSuite.TestParseTrafficString	0.000s

OK: 5 passed
--- PASS: Test (0.00s)
PASS
ok  	github.com/cilium/cilium/cilium/cmd	0.038s
level=info msg="Detected mounted BPF filesystem at /sys/fs/bpf" subsys=bpf
=== RUN   Test
START: map_linux_test.go:250: BPFPrivilegedTestSuite.TestBasicManipulation
PASS: map_linux_test.go:250: BPFPrivilegedTestSuite.TestBasicManipulation	0.000s

START: map_linux_test.go:521: BPFPrivilegedTestSuite.TestCheckAndUpgrade
level=warning msg="Flags mismatch for BPF map" file-path=/sys/fs/bpf/tc/globals/cilium_test_upgrade new=0 old=1 subsys=bpf
level=warning msg="Removing map to allow for property upgrade (expect map data loss)" file-path=/sys/fs/bpf/tc/globals/cilium_test_upgrade subsys=bpf
PASS: map_linux_test.go:521: BPFPrivilegedTestSuite.TestCheckAndUpgrade	0.000s

START: map_linux_test.go:481: BPFPrivilegedTestSuite.TestDeleteAll
PASS: map_linux_test.go:481: BPFPrivilegedTestSuite.TestDeleteAll	0.000s

START: map_linux_test.go:319: BPFPrivilegedTestSuite.TestDump
PASS: map_linux_test.go:319: BPFPrivilegedTestSuite.TestDump	0.000s

START: map_linux_test.go:390: BPFPrivilegedTestSuite.TestDumpReliablyWithCallback
panic: test timed out after 6m0s

goroutine 38 [running]:
testing.(*M).startAlarm.func1()
	/usr/local/go/src/testing/testing.go:1377 +0xdf
created by time.goFunc
	/usr/local/go/src/time/sleep.go:168 +0x44

goroutine 1 [chan receive, 6 minutes]:
testing.(*T).Run(0xc000102600, 0xd18416, 0x4, 0xd4e2e8, 0x4c44f6)
	/usr/local/go/src/testing/testing.go:961 +0x377
testing.runTests.func1(0xc000102400)
	/usr/local/go/src/testing/testing.go:1202 +0x78
testing.tRunner(0xc000102400, 0xc000609ca0)
	/usr/local/go/src/testing/testing.go:909 +0xc9
testing.runTests(0xc0000d9ec0, 0x139b6b0, 0x1, 0x1, 0x0)
	/usr/local/go/src/testing/testing.go:1200 +0x2a7
testing.(*M).Run(0xc0004e1f00, 0x0)
	/usr/local/go/src/testing/testing.go:1117 +0x176
github.com/cilium/cilium/pkg/bpf.runTests(0xc0004e1f00, 0x0, 0x0, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:88 +0x126
github.com/cilium/cilium/pkg/bpf.TestMain(0xc0004e1f00)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:92 +0x2f
main.main()
	_testmain.go:42 +0x135

goroutine 98 [chan receive, 6 minutes]:
gopkg.in/check%2ev1.(*suiteRunner).runTest(...)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:819
gopkg.in/check%2ev1.(*suiteRunner).run(0xc00059fd80, 0x13d0ce0)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:624 +0x11d
gopkg.in/check%2ev1.Run(0xcbf880, 0x13d0ce0, 0xc000485708, 0xd0)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/run.go:92 +0x4d
gopkg.in/check%2ev1.RunAll(0xc00008cf08, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/run.go:84 +0x96
gopkg.in/check%2ev1.TestingT(0xc000102600)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/run.go:72 +0x395
github.com/cilium/cilium/pkg/bpf.Test(0xc000102600)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:34 +0x2b
testing.tRunner(0xc000102600, 0xd4e2e8)
	/usr/local/go/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:960 +0x350

goroutine 37 [syscall, 5 minutes]:
syscall.Syscall(0x141, 0x4, 0xc000092d68, 0x20, 0xc116a0, 0xc0003b4180, 0xc000398770)
	/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
github.com/cilium/cilium/pkg/bpf.GetNextKeyFromPointers(0x5, 0xc000092d68, 0x20, 0xc0002df650, 0x4)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/bpf_linux.go:218 +0x7a
github.com/cilium/cilium/pkg/bpf.(*Map).DumpReliablyWithCallback(0xc0001d6960, 0xc000092f40, 0xc000092f58, 0x0, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:697 +0x439
github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback.func3(0xc00029c830, 0xc000000100, 0xc0001d6870, 0xc0001d6960, 0xc0000c0840)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:468 +0x315
created by github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:447 +0x467

goroutine 99 [select, 6 minutes]:
gopkg.in/check%2ev1.(*resultTracker)._loopRoutine(0xc0001927e0)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:470 +0xc4
created by gopkg.in/check%2ev1.(*resultTracker).start
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:450 +0x3f

goroutine 84 [semacquire, 6 minutes]:
sync.runtime_Semacquire(0xc00029c838)
	/usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc00029c830)
	/usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback(0x13d0ce0, 0xc0001d6870)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:478 +0x478
reflect.Value.call(0xcbf880, 0x13d0ce0, 0x1213, 0xd182fa, 0x4, 0xc00008df30, 0x1, 0x1, 0x13d2be0, 0xc00008ddd0, ...)
	/usr/local/go/src/reflect/value.go:460 +0x5f6
reflect.Value.Call(0xcbf880, 0x13d0ce0, 0x1213, 0xc00008df30, 0x1, 0x1, 0x0, 0xe, 0x8)
	/usr/local/go/src/reflect/value.go:321 +0xb4
gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc0001d6870)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0x5e4
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1(0xc00059fd80, 0xc0001d6870, 0xc0003b6c40)
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xa2
created by gopkg.in/check%2ev1.(*suiteRunner).forkCall
	/home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x1fb

goroutine 85 [runnable]:
github.com/cilium/cilium/pkg/bpf.(*Map).deleteCacheEntry(0xc0001d6960, 0xe44940, 0xc000640890, 0x0, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:835 +0x214
github.com/cilium/cilium/pkg/bpf.(*Map).DeleteWithErrno(0xc0001d6960, 0xe44940, 0xc000640890, 0x0, 0x0, 0x0)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:881 +0x2ca
github.com/cilium/cilium/pkg/bpf.(*Map).Delete(...)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:885
github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback.func2(0xc00029c830, 0xc0000c07e0, 0xc0000c0840, 0xc0001d6960, 0xc0001d6870)
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:437 +0x10e
created by github.com/cilium/cilium/pkg/bpf.(*BPFPrivilegedTestSuite).TestDumpReliablyWithCallback
	/home/vagrant/go/src/github.com/cilium/cilium/pkg/bpf/map_linux_test.go:422 +0x3df
FAIL	github.com/cilium/cilium/pkg/bpf	360.062s
FAIL
Makefile:97: recipe for target 'tests-privileged' failed
make: *** [tests-privileged] Error 1

This time, we have goroutine 85 trying to delete a map entry (github.com/cilium/cilium/pkg/bpf.(*Map).deleteCacheEntry() while goroutine 37 is at the BPF_MAP_GET_NEXT_KEY syscall in github.com/cilium/cilium/pkg/bpf.GetNextKeyFromPointers() again.

I managed to reproduce this issue several times and the only constant I see is that the BPF_MAP_GET_NEXT_KEY syscall hangs. The other goroutines are not always at the same execution point when the test is interrupted by the timeout.
As of note, the tests also succeed quite often on the dev VM.

I'm probably missing something at this point. Could someone else try to reproduce the failure and have a look at it?

pkg/bpf/map_linux_test.go Outdated Show resolved Hide resolved
pkg/bpf/map_linux_test.go Outdated Show resolved Hide resolved
@rolinh
Copy link
Member Author

rolinh commented Jan 30, 2020

I tried to reproduce this with the latest 4.9 kernel (4.9.212 at the moment) and was unable even after 20+ runs whereas with kernel 4.9.160, the issue manifests every other test run.

This test is probably triggering the data race fixed by this kernel commit which was introduced in 4.9.177. In order to make sure, I should check if I can reproduce with kernel 4.9.176 and then with kernel 4.9.177.

Do we simply bump the VM kernel to 4.9.212 at this point?

brb added a commit to cilium/packer-ci-build that referenced this pull request Jan 30, 2020
This should resolve BPF_MAP_GET_NEXT_KEY getting stuck issue mentioned
in [1].

[1]: cilium/cilium#9972 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb
Copy link
Member

brb commented Jan 30, 2020

Do we simply bump the VM kernel to 4.9.212 at this point?

It's coming cilium/packer-ci-build#184!

@rolinh
Copy link
Member Author

rolinh commented Jan 30, 2020

This test is probably triggering the data race fixed by this kernel commit which was introduced in 4.9.177. In order to make sure, I should check if I can reproduce with kernel 4.9.176 and then with kernel 4.9.177.

So I did that and the result is:

  • kernel 4.9.176 -> hit the issue very easily
  • kernel 4.9.177 -> 100 test run without triggering

This confirms the issue. Let's wait for our CI images to use the updated kernel (thanks @brb !) and re-run the CI tests.

pkg/bpf/map_linux_test.go Outdated Show resolved Hide resolved
@joestringer
Copy link
Member

Seems that we're waiting on #10001 ?

BPF_MAP_GET_NEXT_KEY returns ENOENT when all keys have been iterated.
This commit modifies GetNextKeyFromPointers() (and thus GetFirstKey())
to return io.EOF when the BPF_MAP_GET_NEXT_KEY syscall returns ENOENT in
order to signify that there are no more elements to iterate over.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
GetFirstKey()/GetNextKeyFromPointers() return io.EOF when there are no
more elements to iterate over the map. Any other error is actually an
error.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
DumpReliablyWithCallback() attempts to dump the map without acquiring a
lock. If the map is being updated at the same time it is being iterated
over its elements by the dump function, the iteration loop backtracks in
order to catch the change(s). However, an upper bound limit for
backtracking is required in order to avoid hanging by iterating
"forever" in a worst case scenario.

This commit improves this function by returning when io.EOF is returned
by GetNextKeyFromPointer or when the max lookup limit has been reached
(which is now set to 4 times the map capacity as opposed to only the
map's capacity as in the previous implementation). This should make the
function more reliable on concurrent map modifications.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
@rolinh rolinh force-pushed the pr/rolinh/replace-DumpReliablyWithCallback branch from 13c0c96 to e18253a Compare January 31, 2020 13:33
@rolinh
Copy link
Member Author

rolinh commented Jan 31, 2020

Rebased against master to pickup #10001.

@brb
Copy link
Member

brb commented Jan 31, 2020

test-me-please

@maintainer-s-little-helper
Copy link

Commit 8b56e2f064cd02add5f1197318d898f8964007e6 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/contributing/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jan 31, 2020
@rolinh rolinh force-pushed the pr/rolinh/replace-DumpReliablyWithCallback branch from 8b56e2f to e18253a Compare January 31, 2020 15:34
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jan 31, 2020
@rolinh
Copy link
Member Author

rolinh commented Jan 31, 2020

Sorry, I inadvertently clicked the "Update branch" button which merged master into this branch. Forced pushed to revert.

@aanm aanm merged commit e50037b into master Jan 31, 2020
@aanm aanm deleted the pr/rolinh/replace-DumpReliablyWithCallback branch January 31, 2020 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace DumpReliablyWithCallback with a more reliable function
8 participants