Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault after node rebooted #914

Closed
kgtw opened this issue Apr 17, 2020 · 6 comments
Closed

segfault after node rebooted #914

kgtw opened this issue Apr 17, 2020 · 6 comments
Assignees
Labels
bug priority/P0 Highest priority. Someone needs to actively work on this.

Comments

@kgtw
Copy link
Contributor

kgtw commented Apr 17, 2020

We use kured to reboot nodes for security patches, after a node got rebooted the cni pod consistently failed to start due to a segfault.

Kubernetes: v1.16.7
CNI Plugin: 0.7.5-1
AWS CNI: 1.5.5

====== Installing AWS-CNI ======
====== Starting amazon-k8s-agent ======
2020-04-17T10:55:47.126Z [INFO]	Starting L-IPAMD v1.5.5  ...
2020-04-17T10:55:47.161Z [INFO]	Testing communication with server
2020-04-17T10:55:47.162Z [INFO]	Running with Kubernetes cluster version: v1.16. git version: v1.16.7. git tree state: clean. commit: be3d344ed06bff7a4fc60656200a93c74f31f9a4. platform: linux/amd64
2020-04-17T10:55:47.162Z [INFO]	Communication with server successful
2020-04-17T10:55:47.162Z [INFO]	Starting Pod controller
2020-04-17T10:55:47.162Z [INFO]	Waiting for controller cache sync
2020-04-17T10:55:47.163Z [DEBUG]	Discovered region: eu-central-1
2020-04-17T10:55:47.164Z [DEBUG]	Found availability zone: eu-central-1a
2020-04-17T10:55:47.164Z [DEBUG]	Discovered the instance primary ip address: 172.21.82.254
2020-04-17T10:55:47.165Z [DEBUG]	Found instance-id: i-057d745cb8e27e423
2020-04-17T10:55:47.165Z [DEBUG]	Found instance-type: r5.2xlarge
2020-04-17T10:55:47.166Z [DEBUG]	Found primary interface's MAC address: 02:f5:dc:34:cd:30
2020-04-17T10:55:47.166Z [DEBUG]	Discovered 2 interfaces.
2020-04-17T10:55:47.167Z [DEBUG]	Found device-number: 1
2020-04-17T10:55:47.167Z [DEBUG]	Found account ID: 737873494095
2020-04-17T10:55:47.168Z [DEBUG]	Found eni: eni-0208359d78f41cd6f
2020-04-17T10:55:47.168Z [DEBUG]	Found device-number: 0
2020-04-17T10:55:47.169Z [DEBUG]	Found eni: eni-0b8b4a6f654608f8c
2020-04-17T10:55:47.169Z [DEBUG]	Found ENI eni-0b8b4a6f654608f8c is a primary ENI
2020-04-17T10:55:47.169Z [DEBUG]	Found security-group id: sg-04a968b307e8c8f8b
2020-04-17T10:55:47.170Z [DEBUG]	Found subnet-id: subnet-0c1dafc5ccb35c016
2020-04-17T10:55:47.170Z [DEBUG]	Found vpc-ipv4-cidr-block: 172.21.80.0/20
2020-04-17T10:55:47.171Z [DEBUG]	Found VPC CIDR: 172.21.80.0/20
2020-04-17T10:55:47.171Z [DEBUG]	Using WARM_IP_TARGET 10
2020-04-17T10:55:47.171Z [DEBUG]	Start node init
2020-04-17T10:55:47.171Z [DEBUG]	Total number of interfaces found: 2
2020-04-17T10:55:47.171Z [DEBUG]	Found ENI mac address : 02:13:9c:af:63:f4
2020-04-17T10:55:47.172Z [DEBUG]	Found ENI: eni-0208359d78f41cd6f, MAC 02:13:9c:af:63:f4, device 2
2020-04-17T10:55:47.173Z [DEBUG]	Found CIDR 172.21.80.0/22 for ENI 02:13:9c:af:63:f4
2020-04-17T10:55:47.173Z [DEBUG]	Found IP addresses [172.21.83.156] on ENI 02:13:9c:af:63:f4
2020-04-17T10:55:47.173Z [DEBUG]	Found ENI mac address : 02:f5:dc:34:cd:30
2020-04-17T10:55:47.174Z [DEBUG]	Using device number 0 for primary eni: eni-0b8b4a6f654608f8c
2020-04-17T10:55:47.174Z [DEBUG]	Found ENI: eni-0b8b4a6f654608f8c, MAC 02:f5:dc:34:cd:30, device 0
2020-04-17T10:55:47.174Z [DEBUG]	Found CIDR 172.21.80.0/22 for ENI 02:f5:dc:34:cd:30
2020-04-17T10:55:47.175Z [DEBUG]	Found IP addresses [172.21.82.254 172.21.81.119 172.21.83.51 172.21.83.19 172.21.82.178 172.21.81.221 172.21.80.29 172.21.80.158 172.21.82.94 172.21.83.197 172.21.80.228 172.21.83.38 172.21.81.102 172.21.83.40 172.21.83.75] on ENI 02:f5:dc:34:cd:30
2020-04-17T10:55:47.175Z [INFO]	Setting up host network...
2020-04-17T10:55:47.175Z [DEBUG]	Trying to find primary interface that has mac : 02:f5:dc:34:cd:30
2020-04-17T10:55:47.175Z [DEBUG]	Discovered interface: lo, mac:
2020-04-17T10:55:47.175Z [DEBUG]	Discovered interface: ens5, mac: 02:f5:dc:34:cd:30
2020-04-17T10:55:47.175Z [INFO]	Discovered primary interface: ens5
2020-04-17T10:55:47.175Z [DEBUG]	Setting RPF for primary interface: /proc/sys/net/ipv4/conf/ens5/rp_filter
2020-04-17T10:55:47.176Z [DEBUG]	Setup Host Network: iptables -N AWS-SNAT-CHAIN-0 -t nat
2020-04-17T10:55:47.178Z [DEBUG]	Setup Host Network: iptables -N AWS-SNAT-CHAIN-1 -t nat
2020-04-17T10:55:47.180Z [DEBUG]	Setup Host Network: iptables -A POSTROUTING -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-0
2020-04-17T10:55:47.180Z [DEBUG]	Setup Host Network: iptables -A AWS-SNAT-CHAIN-0 ! -d 172.21.80.0/20 -t nat -j AWS-SNAT-CHAIN-1
2020-04-17T10:55:47.180Z [DEBUG]	iptableRules: [nat/POSTROUTING rule first SNAT rules for non-VPC outbound traffic nat/AWS-SNAT-CHAIN-0 rule [0] AWS-SNAT-CHAIN nat/AWS-SNAT-CHAIN-1 rule last SNAT rule for non-VPC outbound traffic]
2020-04-17T10:55:47.180Z [DEBUG]	execute iptable rule : first SNAT rules for non-VPC outbound traffic
2020-04-17T10:55:47.182Z [DEBUG]	execute iptable rule : [0] AWS-SNAT-CHAIN
2020-04-17T10:55:47.184Z [DEBUG]	execute iptable rule : last SNAT rule for non-VPC outbound traffic
2020-04-17T10:55:47.186Z [DEBUG]	execute iptable rule : connmark for primary ENI
2020-04-17T10:55:47.187Z [DEBUG]	execute iptable rule : connmark restore for primary ENI
2020-04-17T10:55:47.188Z [DEBUG]	execute iptable rule : rule for primary address 172.21.82.254
2020-04-17T10:55:47.191Z [DEBUG]	Discovered ENI eni-0208359d78f41cd6f, trying to set it up
2020-04-17T10:55:47.262Z [INFO]	Synced successfully with APIServer
2020-04-17T10:55:47.262Z [INFO]	Add/Update for Pod node-problem-detector-6f6q9 on my node, namespace = node-problem-detector, IP =
2020-04-17T10:55:47.262Z [INFO]	Add/Update for Pod spire-agent-hjt5z on my node, namespace = spire, IP =
2020-04-17T10:55:47.262Z [INFO]	Add/Update for Pod fluentbit-2fgzs on my node, namespace = kube-logs, IP =
2020-04-17T10:55:47.262Z [INFO]	Add/Update for Pod kured-v9msv on my node, namespace = kured, IP =
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1407fb2]

goroutine 1 [running]:
github.com/aws/amazon-vpc-cni-k8s/pkg/awsutils.(*EC2InstanceMetadataCache).DescribeENI(0xc00023bb00, 0xc000156600, 0x15, 0x1c763c0, 0x1be056c, 0x1be056c, 0x75, 0xc00062bb88, 0x42bced)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/awsutils/awsutils.go:820 +0x332
github.com/aws/amazon-vpc-cni-k8s/ipamd.(*IPAMContext).getENIaddresses(0xc00065c000, 0xc000156600, 0x15, 0xc000058070, 0xc000058000, 0xc00041c000, 0x1be056c, 0x75, 0xc00062bbd0, 0x42bf11)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/ipamd/ipamd.go:754 +0x51
github.com/aws/amazon-vpc-cni-k8s/ipamd.(*IPAMContext).setupENI(0xc00065c000, 0xc000156600, 0x15, 0xc000156600, 0x15, 0xc00056c090, 0x11, 0x2, 0xc00066e310, 0xe, ...)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/ipamd/ipamd.go:709 +0x5a
github.com/aws/amazon-vpc-cni-k8s/ipamd.(*IPAMContext).nodeInit(0xc00065c000, 0x0, 0x0)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/ipamd/ipamd.go:301 +0x945
github.com/aws/amazon-vpc-cni-k8s/ipamd.New(0x1c27220, 0xc000500150, 0xc00058a300, 0x2e, 0x1, 0x0)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/ipamd/ipamd.go:242 +0x242
main._main(0x0)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/main.go:61 +0x324
main.main()
	/go/src/github.com/aws/amazon-vpc-cni-k8s/main.go:38 +0x22
@jaypipes
Copy link
Contributor

Hmm, interesting. The line referenced is this one:

https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.5.5/pkg/awsutils/awsutils.go#L820

It looks like we are assuming that a) result is a pointer to a properly-constructed ec2.DescribeNetworkInterfacesOutput struct, b) result.NetworkInterfaces has at least one element and b) result.NetworkInterfaces[0] has a non-nil Attachment field that has a non-nil AttachmentId field.

All of those assumptions should be asserted in the code. I'll push up a patch.

Thanks for the bug report @kgtw! :)

@jaypipes jaypipes self-assigned this Apr 17, 2020
@jaypipes jaypipes added bug priority/P0 Highest priority. Someone needs to actively work on this. labels Apr 17, 2020
jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue Apr 17, 2020
Practice good code safety in the `EC2MetadataCache.DescribeENI()` method
by not assuming that either the `DescribeNetworkInterfacesOutput`
struct's `NetworkInterfaces` field is not empty and that the first
`NetworkInterface` struct that collection has a non-nil `Attachment`
field.

Fixes Issue aws#914
jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue Apr 17, 2020
Practice good code safety in the `EC2MetadataCache.getENIAttachmentID()`
method by not assuming that either the `DescribeNetworkInterfacesOutput`
struct's `NetworkInterfaces` field is not empty and that the first
`NetworkInterface` struct that collection has a non-nil `Attachment`
field.

Fixes Issue aws#914 however note that with aws#909, the source code changed
dramatically and this patch will need to be written differently for
v1.5.x branches.
mogren pushed a commit that referenced this issue Apr 17, 2020
Practice good code safety in the `EC2MetadataCache.getENIAttachmentID()`
method by not assuming that either the `DescribeNetworkInterfacesOutput`
struct's `NetworkInterfaces` field is not empty and that the first
`NetworkInterface` struct that collection has a non-nil `Attachment`
field.

Fixes Issue #914 however note that with #909, the source code changed
dramatically and this patch will need to be written differently for
v1.5.x branches.
@mogren
Copy link
Contributor

mogren commented Apr 17, 2020

Fix merged

@mogren mogren closed this as completed Apr 17, 2020
@jaypipes
Copy link
Contributor

Note that the fix was merged to master (1.6 release series) and would need to be backported to the 1.5 release branch...

@kgtw
Copy link
Contributor Author

kgtw commented Apr 18, 2020

Thanks for the quick response @jaypipes !

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Apr 20, 2020
Practice good code safety in the `EC2MetadataCache.getENIAttachmentID()`
method by not assuming that either the `DescribeNetworkInterfacesOutput`
struct's `NetworkInterfaces` field is not empty and that the first
`NetworkInterface` struct that collection has a non-nil `Attachment`
field.

Fixes Issue aws#914 however note that with aws#909, the source code changed
dramatically and this patch will need to be written differently for
v1.5.x branches.
mogren pushed a commit that referenced this issue Apr 20, 2020
Practice good code safety in the `EC2MetadataCache.getENIAttachmentID()`
method by not assuming that either the `DescribeNetworkInterfacesOutput`
struct's `NetworkInterfaces` field is not empty and that the first
`NetworkInterface` struct that collection has a non-nil `Attachment`
field.

Fixes Issue #914 however note that with #909, the source code changed
dramatically and this patch will need to be written differently for
v1.5.x branches.
@kgtw
Copy link
Contributor Author

kgtw commented May 25, 2020

@jaypipes any chance of getting this back-ported to the 1.5 branch ?

@jaypipes
Copy link
Contributor

@kgtw apologies, I was on PTO. Will try to backport this into the 1.5 branch, sure thing!

jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue May 28, 2020
This is a manual backport of the changes in
5dfc31c since there were so many
changes in the awsutils.go file since 1.5.7 that trying to resolve
conflicts was an exercise in futility.

Issue aws#914
jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue May 28, 2020
This is a manual backport of the changes in
5dfc31c since there were so many
changes in the awsutils.go file since 1.5.7 that trying to resolve
conflicts was an exercise in futility.

Issue aws#914
mogren pushed a commit that referenced this issue May 28, 2020
This is a manual backport of the changes in
5dfc31c since there were so many
changes in the awsutils.go file since 1.5.7 that trying to resolve
conflicts was an exercise in futility.

Issue #914
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority/P0 Highest priority. Someone needs to actively work on this.
Projects
None yet
Development

No branches or pull requests

3 participants