Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatcar fails to boot on AWS m4.* instance types #665

Closed
jepio opened this issue Mar 10, 2022 · 10 comments
Closed

Flatcar fails to boot on AWS m4.* instance types #665

jepio opened this issue Mar 10, 2022 · 10 comments
Assignees
Labels
kind/bug Something isn't working platform/AWS

Comments

@jepio
Copy link
Member

jepio commented Mar 10, 2022

Description

Flatcar stable 3033.2.0 works, 3033.2.1 doesn't. The symptoms are no DHCP address being acquired on the network interface, so fetching instance metadata fails. Boot log shows more or less this:

[   70.884544] ignition[535]: GET http://169.254.169.254/2009-04-04/user-data: attempt #18
[   70.888100] ignition[535]: GET error: Get "http://169.254.169.254/2009-04-04/user-data": dial tcp 169.254.169.254:80: connect: network is unreachable

Impact

[ 1 sentence detailing the impact this bug is creating for you ]

Environment and steps to reproduce

  1. Set-up: [ describe the environment Flatcar/Lokomotive/Nebraska etc was running in when encountering the bug; Platform etc. ]
  2. Task: [ describe the task performing when encountering the bug ]
  3. Action(s): [ sequence of actions that triggered the bug, see example below ]
    a. [ requested the start of a new pod or container ]
    b. [ container image downloaded ]
  4. Error: [describe the error that was triggered]

Expected behavior

[ describe what you expected to happen at 4. above but instead got an error ]

Additional information

Please add any information here that does not fit the above format.

@jepio
Copy link
Member Author

jepio commented Mar 10, 2022

Releases with 5.15 fail differently, they can't find the root block device. xen-blkfront module is missing from initramfs, which would explain that. But also no sign of DHCP lease being acquired.

@pothos
Copy link
Member

pothos commented Mar 10, 2022

This is the full list of modules that were present in Stable but are not in Alpha: glue_helper.ko.xz md4.ko.xz aoe.ko.xz brd.ko.xz drbd.ko.xz nbd.ko.xz rbd.ko.xz xen-blkfront.ko.xz zram.ko.xz pps_core.ko.xz ptp.ko.xz nfs_ssc.ko.xz libarc4.ko.xz lru_cache.ko.xz zsmalloc.ko.xz

@jepio
Copy link
Member Author

jepio commented Mar 11, 2022

I've isolated this to a kernel change, here's the bisect log so far:

$ git bisect log
git bisect start
# good: [4e8c680af6d51ba9315e31bd4f7599e080561a2d] Linux 5.15.7
git bisect good 4e8c680af6d51ba9315e31bd4f7599e080561a2d
# bad: [efe3167e52a5833ec20ee6214be9b99b378564a8] Linux 5.15.27
git bisect bad efe3167e52a5833ec20ee6214be9b99b378564a8
# bad: [63dcc388662c3562de94d69bfa771ae4cd29b79f] Linux 5.15.16
git bisect bad 63dcc388662c3562de94d69bfa771ae4cd29b79f
# good: [57dcae4a8b93271c4e370920ea0dbb94a0215d30] Linux 5.15.10
git bisect good 57dcae4a8b93271c4e370920ea0dbb94a0215d30
# bad: [25960cafa06e6fcd830e6c792e6a7de68c1e25ed] Linux 5.15.12
git bisect bad 25960cafa06e6fcd830e6c792e6a7de68c1e25ed
# bad: [fb6ad5cb3b6745e7bffc5fe19b130f3594375634] Linux 5.15.11
git bisect bad fb6ad5cb3b6745e7bffc5fe19b130f3594375634

so something between 5.15.10 and 5.15.11 is responsible. Im testing on the 5.15 kernel because I believe this will lead to the answer of what is wrong in Flatcar 3033.2.1.

@jepio
Copy link
Member Author

jepio commented Mar 14, 2022

Full bisect points towards d8888cdabedf353ab9b5a6af75f70bf341a3e7df (or torvalds/linux@83dbf89 upstream):

$ git bisect log
git bisect start
# good: [4e8c680af6d51ba9315e31bd4f7599e080561a2d] Linux 5.15.7
git bisect good 4e8c680af6d51ba9315e31bd4f7599e080561a2d
# bad: [efe3167e52a5833ec20ee6214be9b99b378564a8] Linux 5.15.27
git bisect bad efe3167e52a5833ec20ee6214be9b99b378564a8
# bad: [63dcc388662c3562de94d69bfa771ae4cd29b79f] Linux 5.15.16
git bisect bad 63dcc388662c3562de94d69bfa771ae4cd29b79f
# good: [57dcae4a8b93271c4e370920ea0dbb94a0215d30] Linux 5.15.10
git bisect good 57dcae4a8b93271c4e370920ea0dbb94a0215d30
# bad: [25960cafa06e6fcd830e6c792e6a7de68c1e25ed] Linux 5.15.12
git bisect bad 25960cafa06e6fcd830e6c792e6a7de68c1e25ed
# bad: [fb6ad5cb3b6745e7bffc5fe19b130f3594375634] Linux 5.15.11
git bisect bad fb6ad5cb3b6745e7bffc5fe19b130f3594375634
# good: [257b3bb16634fd936129fe2f57a91594a75b8751] drm/amd/pm: fix a potential gpu_metrics_table memory leak
git bisect good 257b3bb16634fd936129fe2f57a91594a75b8751
# bad: [bbdaa7a48f465a2ee76d65839caeda08af1ef3b2] btrfs: fix double free of anon_dev after failure to create subvolume
git bisect bad bbdaa7a48f465a2ee76d65839caeda08af1ef3b2
# good: [c8e8e6f4108e4c133b09f31f6cc7557ee6df3bb6] bpf, selftests: Fix racing issue in btf_skc_cls_ingress test
git bisect good c8e8e6f4108e4c133b09f31f6cc7557ee6df3bb6
# bad: [5cb5c3e1b184da9f49e46119a0e506519fc58185] usb: xhci: Extend support for runtime power management for AMD's Yellow carp.
git bisect bad 5cb5c3e1b184da9f49e46119a0e506519fc58185
# good: [e7a8a261bab07ec1ed5f5bb990aacc4de9c08eb4] tty: n_hdlc: make n_hdlc_tty_wakeup() asynchronous
git bisect good e7a8a261bab07ec1ed5f5bb990aacc4de9c08eb4
# good: [4df1af29930b03d61fb774bfaa5100dbdb964628] PCI/MSI: Clear PCI_MSIX_FLAGS_MASKALL on error
git bisect good 4df1af29930b03d61fb774bfaa5100dbdb964628
# bad: [d8888cdabedf353ab9b5a6af75f70bf341a3e7df] PCI/MSI: Mask MSI-X vectors only on success
git bisect bad d8888cdabedf353ab9b5a6af75f70bf341a3e7df
# first bad commit: [d8888cdabedf353ab9b5a6af75f70bf341a3e7df] PCI/MSI: Mask MSI-X vectors only on success

@wincus
Copy link

wincus commented Mar 14, 2022

In case helps, I have also observed this behavior on c4 instances while t2 instances seem like are unaffected by this issue ( instances boot normally )

@jepio
Copy link
Member Author

jepio commented Mar 14, 2022

@wincus c4 instances would have the same issue because they use the same Intel VF for enhanced networking. t2 instances are unaffected because they use Xen networking.

I've already confirmed that reverting that commit fixes the issue in 5.15 and 5.10 based flatcar releases, so we will be reverting it in all channels while we work upstream to figure out what the actual bug is.

@jepio
Copy link
Member Author

jepio commented Mar 14, 2022

At least one other report confirms the bisect: https://lore.kernel.org/lkml/Ydh5OCudJKz5Y7jc@arighi-desktop/.

@jepio
Copy link
Member Author

jepio commented Mar 16, 2022

The fixes are present in all branches (3033/3139/3165/main), and will be part of all releases that we do next week.

@jepio jepio closed this as completed Mar 16, 2022
@wincus
Copy link

wincus commented Mar 17, 2022

wow amazing work! thanks @jepio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working platform/AWS
Projects
None yet
Development

No branches or pull requests

3 participants