Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermesh-apiserver: fixed nil pointer dereference #18957

Merged
merged 1 commit into from
May 3, 2022

Conversation

abocim
Copy link
Contributor

@abocim abocim commented Feb 28, 2022

Fixes nil pointer dereference while clustermesh-apiserver is starting and new CEW with identity 0
is existing in etcd

In original implementation clustermesh-apiserver and etcd are running
in separate containers within one pod. So a new empty etcd is created while
clustermesh-apiserver is starting.

In our usecase we use an existing etcd which is shared with cilium-agent.
An error below has been occurring when a new CEW resource was already applied
into kubernetes and cilium-agent was already started on related external
workload machine while clustermesh-apiserver was not deployed yet.

$ ./clustermesh-apiserver --cluster-id=12 --cluster-name=uacl-test --k8s-kubeconfig-path ../../uacl/uacl-test.kubeconfig --kvstore-opt etcd.config=../../uacl/etcd.config
level=info msg="Started gops server" address="127.0.0.1:9892" subsys=clustermesh-apiserver
level=info msg="Starting clustermesh-apiserver..." cluster-id=12 cluster-name=uacl-test subsys=clustermesh-apiserver
level=info msg="Establishing connection to apiserver" host="https://uacl-test-api.test:31243" subsys=k8s
level=info msg="Connected to apiserver" subsys=k8s
level=info msg="Waiting until all Cilium CRDs are available" subsys=k8s
level=info msg="All Cilium CRDs have been found and are available" subsys=k8s
level=info msg="Initializing identity allocator" subsys=identity-cache
level=info msg="Creating etcd client" ConfigPath=../../uacl/etcd.config KeepAliveHeartbeat=15s KeepAliveTimeout=25s RateLimit=20 subsys=kvstore
level=info msg="Started health API" subsys=clustermesh-apiserver
level=info msg="Connecting to etcd server..." config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" subsys=kvstore
level=info msg="Got lease ID 320f7d7b1f23bc32" subsys=kvstore
level=info msg="Got lock lease ID 320f7d7b1f23bc34" subsys=kvstore
level=info msg="Initial etcd session established" config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" subsys=kvstore
level=info msg="Successfully verified version of etcd endpoint" config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" etcdEndpoint="https://uacl-test-api.test:30108" subsys=kvstore version=3.4.16
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1919aa3]

goroutine 213 [running]:
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).keyPath(0x0, {0x7f349c122c38, 0xc000a7e240})
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:276 +0x43
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).syncLocalKey(0x0, {0x2204f70, 0xc000050138}, {0x22051d8, 0xc000a7e240})
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:288 +0x87
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).UpdateKeySync(...)
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:375
main.(*VMManager).OnUpdate(0xc0005f6080, {0x21e9e10, 0xc000a7e0c0})
/home/abocim/go/src/github.com/cilium/cilium/clustermesh-apiserver/vmmanager.go:183 +0x3f4
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).onUpdate(...)
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:233
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).updateKey(0xc000128780, {0xc00025e85d, 0x15}, {0xc0000f6a00, 0xf3, 0x100})
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:414 +0x102
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).watcher(0xc000128780, 0xc000270a20)
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:482 +0x73c
created by github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).listAndStartWatcher
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:447 +0x89

Signed-off-by: Adam Bocim adam.bocim@seznam.cz

@abocim abocim requested a review from a team as a code owner February 28, 2022 07:27
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 28, 2022
@kaworu kaworu added area/clustermesh Relates to multi-cluster routing functionality in Cilium. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Feb 28, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 28, 2022
@kaworu kaworu requested review from a team and ldelossa and removed request for a team February 28, 2022 08:26
@kaworu
Copy link
Member

kaworu commented Feb 28, 2022

Thanks for the PR @abocim! Added @cilium/sig-clustermesh to review this PR as I lack context.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.11.3 Feb 28, 2022
@aanm aanm requested review from a team and tklauser and removed request for a team and ldelossa February 28, 2022 15:19
@abocim abocim force-pushed the ab-clustermesh-fixnilpointer branch from ee81dcc to f95e539 Compare March 4, 2022 11:55
@abocim abocim requested a review from a team as a code owner March 4, 2022 11:55
@abocim abocim force-pushed the ab-clustermesh-fixnilpointer branch from f95e539 to 3023735 Compare March 10, 2022 11:41
@aanm aanm added this to Needs backport from master in 1.11.4 Mar 26, 2022
@aanm aanm removed this from Needs backport from master in 1.11.3 Mar 26, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 10, 2022
@pchaigno pchaigno requested a review from aanm April 10, 2022 16:51
@pchaigno pchaigno removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 10, 2022
@joestringer joestringer added this to Needs backport from master in 1.11.5 Apr 15, 2022
@joestringer joestringer removed this from Needs backport from master in 1.11.4 Apr 15, 2022
@jrajahalme
Copy link
Member

/test

@pchaigno
Copy link
Member

Ping @aanm 👋

@abocim You'll need to rebase the PR so we can trigger the end-to-end tests. Sorry for the trouble.

@pchaigno pchaigno added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Apr 19, 2022
@abocim abocim force-pushed the ab-clustermesh-fixnilpointer branch from 3023735 to 9c1d8fa Compare April 19, 2022 13:19
@abocim
Copy link
Contributor Author

abocim commented Apr 19, 2022

@abocim You'll need to rebase the PR so we can trigger the end-to-end tests. Sorry for the trouble.

Rebase done.

@jrajahalme
Copy link
Member

/ci-external-workloads

1 similar comment
@jrajahalme
Copy link
Member

/ci-external-workloads

@jrajahalme
Copy link
Member

/test

@abocim
Copy link
Contributor Author

abocim commented Apr 29, 2022

Hello, @jrajahalme, do you think the External workloads test has failed because of my PR? I think server can't find clustermesh-apiserver.kube-system.svc.cluster.local: NXDOMAIN error is related to some DNS problem, not to these PR changes. Maybe I should rebase once more.

@aanm
Copy link
Member

aanm commented Apr 29, 2022

@abocim can you rebase again against master? Sorry for keep asking for this but the CI is currently failing because it's not rebased :-)

... while clustermesh-apiserver is starting and new CEW with identity 0
is existing in etcd

In original implementation clustermesh-apiserver and etcd are running
in separate containers within one pod. So a new empty etcd is created while
clustermesh-apiserver is starting.

In our usecase we use an existing etcd which is shared with cilium-agent.
An error below has been occurring when a new CEW resource was already applied
into kubernetes and cilium-agent was already started on related external
workload machine while clustermesh-apiserver was not deployed yet.

$ ./clustermesh-apiserver --cluster-id=12 --cluster-name=uacl-test --k8s-kubeconfig-path ../../uacl/uacl-test.kubeconfig --kvstore-opt etcd.config=../../uacl/etcd.config
level=info msg="Started gops server" address="127.0.0.1:9892" subsys=clustermesh-apiserver
level=info msg="Starting clustermesh-apiserver..." cluster-id=12 cluster-name=uacl-test subsys=clustermesh-apiserver
level=info msg="Establishing connection to apiserver" host="https://uacl-test-api.test:31243" subsys=k8s
level=info msg="Connected to apiserver" subsys=k8s
level=info msg="Waiting until all Cilium CRDs are available" subsys=k8s
level=info msg="All Cilium CRDs have been found and are available" subsys=k8s
level=info msg="Initializing identity allocator" subsys=identity-cache
level=info msg="Creating etcd client" ConfigPath=../../uacl/etcd.config KeepAliveHeartbeat=15s KeepAliveTimeout=25s RateLimit=20 subsys=kvstore
level=info msg="Started health API" subsys=clustermesh-apiserver
level=info msg="Connecting to etcd server..." config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" subsys=kvstore
level=info msg="Got lease ID 320f7d7b1f23bc32" subsys=kvstore
level=info msg="Got lock lease ID 320f7d7b1f23bc34" subsys=kvstore
level=info msg="Initial etcd session established" config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" subsys=kvstore
level=info msg="Successfully verified version of etcd endpoint" config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" etcdEndpoint="https://uacl-test-api.test:30108" subsys=kvstore version=3.4.16
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1919aa3]

goroutine 213 [running]:
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).keyPath(0x0, {0x7f349c122c38, 0xc000a7e240})
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:276 +0x43
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).syncLocalKey(0x0, {0x2204f70, 0xc000050138}, {0x22051d8, 0xc000a7e240})
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:288 +0x87
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).UpdateKeySync(...)
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:375
main.(*VMManager).OnUpdate(0xc0005f6080, {0x21e9e10, 0xc000a7e0c0})
/home/abocim/go/src/github.com/cilium/cilium/clustermesh-apiserver/vmmanager.go:183 +0x3f4
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).onUpdate(...)
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:233
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).updateKey(0xc000128780, {0xc00025e85d, 0x15}, {0xc0000f6a00, 0xf3, 0x100})
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:414 +0x102
github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).watcher(0xc000128780, 0xc000270a20)
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:482 +0x73c
created by github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).listAndStartWatcher
/home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:447 +0x89

Signed-off-by: Adam Bocim <adam.bocim@seznam.cz>
@abocim abocim force-pushed the ab-clustermesh-fixnilpointer branch from 9c1d8fa to 5864ed6 Compare April 29, 2022 10:58
@aanm
Copy link
Member

aanm commented Apr 29, 2022

/test

Job 'Cilium-PR-K8s-GKE' hit: #17628 (95.15% similarity)

@jrajahalme
Copy link
Member

/ci-external-workloads

Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a classic case where a watcher calls into use of a returned value before the function starting the watcher returns. I also checked that the new syncKVStoreKey() properly does what UpdateKeySync() did before in this case.

External workload CI test also passed.

@jrajahalme
Copy link
Member

@aanm GKE flake is unrelated to this change.

@aanm
Copy link
Member

aanm commented May 3, 2022

Thank you very much for the PR @abocim!

@aanm aanm merged commit 90c0d62 into cilium:master May 3, 2022
@aanm aanm removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label May 3, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.5 May 9, 2022
@aanm aanm added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.11.5
Backport pending to v1.10
Development

Successfully merging this pull request may close these issues.

None yet

5 participants