Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove console=ttyS0 on metal #567

Closed
cgwalters opened this issue Jul 10, 2020 · 63 comments
Closed

remove console=ttyS0 on metal #567

cgwalters opened this issue Jul 10, 2020 · 63 comments
Assignees

Comments

@cgwalters
Copy link
Member

Moving this from https://bugzilla.redhat.com/show_bug.cgi?id=1839923

I think we should likely remove the console=ttyS0 we're injecting on bare metal by default. On real hardware we don't expect anything connected to the serial ports by default, and attempting to write to them can greatly slow down the boot. A web search for "linux console serial slow" turns up things like https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763601;msg=17

Today for the Live ISO we are already doing this, but the bare metal image still enables it by default. One can work around this by using the new installer options to remove kargs at least.

Anyone who wants to turn it on can of course.

@dustymabe
Copy link
Member

On real hardware we don't expect anything connected to the serial ports by default

I'm not sure that's true. In every enterprise environment I've worked in there were always either dedicated console servers attached to serial ports OR out of band management (aka lights out management) provided by the hardware that gives you access to the serial console. If your server crashes or becomes unrepsonsive the output of the serial console is one of the last chances you have to find some clues.

I think I'd much prefer to document in our FAQ that this problem can exist and show people how to change it if they are having trouble.

@dustymabe
Copy link
Member

I think I'd much prefer to document in our FAQ that this problem can exist and show people how to change it if they are having trouble.

Though I'm definitely interested in other thoughts and opinions here. Should we bring it up during the next community meeting?

@jlebon
Copy link
Member

jlebon commented Jul 10, 2020

Another approach which I've thought about a bit when working on #533 is to go the whole way and remove all the console kargs in the metal and ISO artifacts. The default behaviour of the kernel in that case is to simply find the first device that can act as a console. So e.g. if there's no VGA card, and ttyS1 is active, it'll automatically select that one. This also fixes the non-obvious behaviour that the first console argument wins "per type" (see #533 (comment)), which makes appending the console arg you want more awkward than it should be (i.e. you also need to delete the baked-in karg).

@cgwalters
Copy link
Member Author

I'm not sure that's true. In every enterprise environment I've worked in there were always either dedicated console servers attached to serial ports OR out of band management (aka lights out management) provided by the hardware that gives you access to the serial console.

Right; I guess this boils down to: is the "default" having a LOM attached or not? One or the other case will need to configure things.

Another approach which I've thought about a bit when working on #533 is to go the whole way and remove all the console kargs in the metal and ISO artifacts. The default behaviour of the kernel in that case is to simply find the first device that can act as a console. So e.g. if there's no VGA card, and ttyS1 is active, it'll automatically select that one.

Hmm...how is "active" detected for serial consoles? It sounds like the desired behavior in a LOM scenario is that even if a VGA card is present, output still goes to the serial console too.

@jlebon
Copy link
Member

jlebon commented Jul 10, 2020

It sounds like the desired behavior in a LOM scenario is that even if a VGA card is present, output still goes to the serial console too.

I'm not sure if that's possible in the generic case since different LOM systems could be using different port numbers, IIUC?

Hmm...how is "active" detected for serial consoles?

Here's what the docs say:

If no console device is specified, the first device found capable of acting as a system console will be used. At this time, the system first looks for a VGA card and then for a serial port. So if you don’t have a VGA card in your system the first serial port will automatically become the console.

So yeah, I guess it'd just be COM1, which wouldn't help if LOM is in COM2.

Hmm, was testing the fallback behaviour now with cosa run -c -- -vga none and removing all the console args, but it doesn't seem to work.

@dustymabe
Copy link
Member

dustymabe commented Jul 10, 2020

I'm not sure that's true. In every enterprise environment I've worked in there were always either dedicated console servers attached to serial ports OR out of band management (aka lights out management) provided by the hardware that gives you access to the serial console.

Right; I guess this boils down to: is the "default" having a LOM attached or not? One or the other case will need to configure things.

Yeah. It's definitely a question of sane defaults. To me we would need to answer questions like:

  • How much does the 'slow serial' problem affect actual users today?
  • What do other OS platforms do when configuring the serial console?
  • If we were to change our current settings would that be worse than what the current defaults are?
    • This gets into your question of the default of having a serial console attached or not.

So we have to dig in to how many people expect and depend on the current default behavior of the serial console getting kernel message output vs how many people experience slow booting systems because we default to outputting on the serial console? In which of these cases is it OK to document the current behavior and the workaround versus changing the default.

Another approach which I've thought about a bit when working on #533 is to go the whole way and remove all the console kargs in the metal and ISO artifacts. The default behaviour of the kernel in that case is to simply find the first device that can act as a console. So e.g. if there's no VGA card, and ttyS1 is active, it'll automatically select that one.

Hmm...how is "active" detected for serial consoles? It sounds like the desired behavior in a LOM scenario is that even if a VGA card is present, output still goes to the serial console too.

Yeah. I speak for only one person, but my preferred LOM scenario is to have serial output even if a VGA is present. I actually much prefer serial console to VGA in any cases where I'm in a non-graphical environment (copy/paste anyone?).

@dustymabe dustymabe added the meeting topics for meetings label Jul 10, 2020
@cgwalters
Copy link
Member Author

One thing I just found out the hard way while playing with a FSBCOS install that failed in the initramfs is that on this Lenovo T590 laptop (that doesn't have a serial device), systemd will fail to start emergency mode because it's trying to access the non-existent serial console or something.

It works to mount the boot partition after running coreos-installer and remove all of the console args.

@jamescassell
Copy link
Collaborator

One thing I just found out the hard way while playing with a FSBCOS install that failed in the initramfs is that on this Lenovo T590 laptop (that doesn't have a serial device), systemd will fail to start emergency mode because it's trying to access the non-existent serial console or something.

It works to mount the boot partition after running coreos-installer and remove all of the console args.

Sounds like we need to delete the entry if there's no serial port attached, to avoid such a problem...

@jlebon
Copy link
Member

jlebon commented Jul 14, 2020

One thing I just found out the hard way while playing with a FSBCOS install that failed in the initramfs is that on this Lenovo T590 laptop (that doesn't have a serial device), systemd will fail to start emergency mode because it's trying to access the non-existent serial console or something.

Yeah, I was bit by this as well when installing FCOS on my local server. I think it's because console=ttyS0 is last and so that's what gets /dev/console and what systemd tries to use. See also the docs we have about this.

Sounds like we need to delete the entry if there's no serial port attached, to avoid such a problem...

One thing I mentioned was the possibility of matching the state of console kargs of the install boot in the installed system. (So e.g. if the user booted with console=ttyS1, we match that; if there are no console= kargs at all, then we have none either.)

Clearly that doesn't cover all use cases since you could be installing differently than how you plan to run it, but it seems like a better default heuristic overall (that you of course should be able to override). OTOH, magic handling like this can also make things more confusing.

@spamcop
Copy link

spamcop commented Jul 14, 2020

One thing: you cannot assume console=ttyS0 is correct serial console. It also uses default value 9600n8 (=slow). Depending on server, correct value can be for example console=ttyS1 or console=ttyS0,115200 or there could be no serial console at all. Its better to leave it up to user to specify that or forward what he already specified using first boot as jlebon wrote.

https://www.kernel.org/doc/html/latest/admin-guide/serial-console.html

@dustymabe
Copy link
Member

As far as default speed value goes, our current default is console=tty0 console=ttyS0,115200n8, so it's at least not 9600n8 (slow).

@bgilbert bgilbert added jira for syncing to jira and removed meeting topics for meetings labels Jul 22, 2020
@dustymabe
Copy link
Member

We discussed this in the meeting today. While we didn't get unanimous agreement the general consensus was:

13:17:43     dustymabe | #agreed We'll drop all console= kargs on new bare-metal installs, and
                       | implement a platform-specific default on other platforms.  Upgrades of
                       | existing machines won't be affected.  We'll document how to configure
                       | alternate consoles.

The platform specific defaults on other platforms plays into #110 I believe.

@ohadlevy
Copy link

I hit this issue where ttyS0 was locking up my system, ttyS1 worked just fine, any chance we can drop it first and fix a per platform defaults in a follow up release? thanks.

@jlebon
Copy link
Member

jlebon commented Oct 28, 2020

I hit this issue where ttyS0 was locking up my system, ttyS1 worked just fine, any chance we can drop it first and fix a per platform defaults in a follow up release? thanks.

Yeah, we still need to action this. Note for the time being, you can modify the kargs at install time using coreos-installer install --append-karg/--delete-karg.

@cgwalters
Copy link
Member Author

I hit this issue where ttyS0 was locking up my system,

Yes, this was also part of a big downstream OpenShift/RHCOS customer issue. One middle ground here might be to remove the console=ttyS0 karg after a successful installation (e.g. after Ignition has run?)

@bgilbert
Copy link
Contributor

bgilbert commented Jul 1, 2021

One middle ground here might be to remove the console=ttyS0 karg after a successful installation (e.g. after Ignition has run?)

That doesn't really help the UX issue though. The main situation where people need to interact with the console is when Ignition has failed.

@cgwalters
Copy link
Member Author

cgwalters commented Jul 1, 2021

That doesn't really help the UX issue though. The main situation where people need to interact with the console is when Ignition has failed.

Right, I said remove it after Ignition has completed successfully. Or am I misunderstanding you?

EDIT: to clarify e.g. in the OpenShift case this would be part of the MCO firstboot which runs only after Ignition has completed.

Or, conceptually with the new Ignition kargs bits, it could be an Ignition fragment with shouldNotExist: ["console=ttyS0"] or so...except, hmm I think we still want to have the console for the firstboot OS update, just not after.

@cgwalters
Copy link
Member Author

A bunch of console links:

One thing that came up there is that still today the kernel doesn't distinguish well between "a driver hit an unexpected timeout" and "the machine is about to die".

Related to that, one recommendation in this customer case was "set kernel.printk to 4 4 1 7" to ignore debug/informational messages. This one needs analysis - what messages would we lose etc.?

@bgilbert
Copy link
Contributor

bgilbert commented Jul 1, 2021

Right, I said remove it after Ignition has completed successfully. Or am I misunderstanding you?

Yeah, I'm arguing that doesn't go far enough. When users hit an Ignition failure and try to debug it, they discover that output is being sent to a port they don't use, don't know about, and may not even have on their machine. Per #567 (comment), we should just drop console=ttyS0 by default.

@cgwalters
Copy link
Member Author

Per #567 (comment), we should just drop console=ttyS0 by default.

OK. For users who want to keep the serial (LOM users?) and are booting from the ISO, it seems to me they would need to use coreos-installer iso kargs modify --append=console=ttyS0,115200n8 before booting the ISO or so, right?

I find it really hard to understand the "blast radius" of this across the platforms. I guess the main concern here would be metal and probably vSphere.

@pamoedom
Copy link

pamoedom commented Jul 2, 2021

My 2 cents:

In our case (Installer QE), due to our external provider configuration, we need to modify the kernel arguments from "console=ttyS0,115200n8" to "console=ttyS1,115200n8" in order to properly redirect the console output, having the default value in place prevents us from using a simply MachineConfig (MC) solution like the following:

$ cat << EOF > 99-openshift-machineconfig-master-kargs.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-openshift-machineconfig-master-kargs
spec:
  kernelArguments:
    - 'console=ttyS1,115200n8'
EOF

In order to workaround this, we need to push via iPXE a custom ignition hook that changes the kernel arguments on the following manner:

#!ipxe

kernel $kernel initrd=main console=tty0 console=ttyS1,115200n8 coreos.inst.install_dev=/dev/sda coreos.inst.ignition_url=http://<IP>:8000/rhcos/ignitions/$cluster/master.ign coreos.live.rootfs_url=$rootfs ignition.config.url=http://<IP>:8000/rhcos/ignitions/$cluster/master-console-hook.ign ignition.firstboot ignition.platform.id=metal
initrd --name main $initramfs
boot

OR, when using TPM encryption, we use live iPXE mode to boot the console hook in memory to avoid tampering with the storage:

#!ipxe

kernel $kernel initrd=main console=tty0 console=ttyS1,115200n8 coreos.live.rootfs_url=$rootfs ignition.config.url=http://<IP>:8000/rhcos/ignitions/$cluster/master-console-hook.ign ignition.firstboot ignition.platform.id=metal
initrd --name main $initramfs
boot

Here an example of the console hook if needed:

# cat /var/www/html/rhcos/ignitions/$cluster/master-console-hook.ign
{"ignition":{"version":"3.1.0"},"systemd":{"units":[{"contents":"[Unit]\nDescription=Run after install\nAfter=coreos-installer.service\nBefore=coreos-installer.target\n\n[Service]\nType=oneshot\nExecStart=/usr/bin/coreos-installer install /dev/sda --delete-karg console=ttyS0,115200n8 --append-karg console=ttyS1,115200n8 --ignition-url http://<IP>:8000/rhcos/ignitions/$cluster/master.ign --insecure-ignition\n\n[Install]\nRequiredBy=coreos-installer.target\n","enabled":true,"name":"post-install-hook.service"}]}}

In summary, if we remove the default "console=ttyS0,115200n8", we can avoid using ignition hooks and simply push master/worker MCs with the kernel argument addition right? to me sounds like a good plan.

Best Regards.

@EmmanuelKasper
Copy link

EmmanuelKasper commented Jul 7, 2021

@pamoedom
could it be that the need for removing ttyS0 simply arise from systemd starting a getty only on the first found serial console ?
according to http://0pointer.de/blog/projects/serial-console.html

However, sometimes there's the need to manually configure a serial getty, for example, if more than one serial login prompt is needed or the kernel console should be redirected to a different terminal than the login prompt.

So maybe you could simply add a MachineConfig to create a getty for ttyS1, and you would get a serial console (maybe not owning /dev/console) on ttyS1 ? I agree it would not solve the slow boot problem.

@EmmanuelKasper
Copy link

OK please ignore my comment above: according to https://www.kernel.org/doc/Documentation/admin-guide/serial-console.rst

Note that you can only define one console per device type (serial, video).

@EmmanuelKasper
Copy link

Another use case where the presence of ttyS0 is problematic, is when you hook a serial console over IPMI, which defaults to ttyS1 on many systems.
As only one serial console is possible at the time (see link above), you need to modify the kargs manually on boot, or via a custom Ignition config, which although documented, is not trivial if you're not a CoreOs specialist.

@jlebon
Copy link
Member

jlebon commented Jul 9, 2021

Re-read the meeting notes for when this was discussed. It seems like there's a lot of hesitation because it's not clear of the proportion of systems relying on this vs those being hindered by it.

Should we have someone/a few folks just look at what the most prevalent setups are? E.g. let's look at Dell, HP, Intel, Cisco, etc... and see what their LOM systems expect.

As only one serial console is possible at the time (see link above), you need to modify the kargs manually on boot

To clarify, I guess this is something you probably have to do for most OSes if you're expecting ttyS1? (The adding ttyS1, not the removing ttyS0 part.)

@bgilbert
Copy link
Contributor

Hmm. It seems unsound to assume that the first serial port isn't attached to some non-console device, and that the only possible consoles are VGA and the first serial port.

Also, we've had reports of systems that boot slowly, or fail to boot, if Linux serial console is enabled. I think those particular cases were fixed solely by dropping console=ttyS0, but I'm not confident that GRUB serial console will never cause similar issues.

@ktdreyer

This comment was marked as off-topic.

@bgilbert bgilbert added the meeting topics for meetings label Jun 14, 2022
@bgilbert
Copy link
Contributor

Let's keep this ticket focused on Fedora CoreOS, please. Support questions for RHCOS, SNO, and the Assisted Installer are best pursued via other channels.

@bgilbert bgilbert removed the meeting topics for meetings label Jun 14, 2022
@bgilbert
Copy link
Contributor

@jlebon convinced me that coreos-installer sugar, including for iso customize, makes more sense than Butane sugar here. I'm going to try writing up a design doc.

@newkit
Copy link

newkit commented Jun 30, 2022

I am seeing this long boot times related to the console settings on Lenovo x3850 X6 servers. Would be great to see this in production release soon since the workaround of removing the console settings from the commandline is tedious.

@bgilbert
Copy link
Contributor

bgilbert commented Sep 8, 2022

No design doc, but coreos/coreos-installer#977 is a draft PR adding coreos-installer install --console and corresponding iso/pxe customize options. These allow specifying one or more consoles to be used by both the bootloader (GRUB) and the kernel.

@bgilbert
Copy link
Contributor

coreos-installer 0.16.0 includes support for install --console and iso/pxe customize --dest-console.

@bgilbert
Copy link
Contributor

The change has been announced to coreos-status. Please see the announcement message for details. The transition schedule is:

  • next: week of October 3
  • testing: week of November 28
  • stable: week of December 12

@bgilbert bgilbert self-assigned this Sep 17, 2022
@bgilbert bgilbert added the status/pending-next-release Fixed upstream. Waiting on a next release. label Sep 20, 2022
@jcpowermac
Copy link

After reviewing all the communication I didn't see a reason for vSphere not to be included in the platform.json. I can for troubleshooting purposes add a serial device to a FCOS/RHCOS virtual machine(s) and send that output to either a serial concentrator or to a file on the datastore.

What made me discover this work was an issue I am trying to investigate for OCP 4.12 on vSphere:
https://issues.redhat.com/browse/OCPBUGS-1047
where serial output is invaluable.

@dustymabe
Copy link
Member

The fix for this went into next stream release 37.20221003.1.0. Please try out the new release and report issues.

@bgilbert
Copy link
Contributor

@jcpowermac Thanks for the report. coreos/fedora-coreos-config#2038 enables secondary serial console on VMware, and also on OpenStack and VirtualBox.

bgilbert added a commit to coreos/fedora-coreos-config that referenced this issue Oct 28, 2022
The FCOS platform docs for VMware/VirtualBox include instructions on
connecting to the serial console to get a console log, and some users
make use of this functionality:

    coreos/fedora-coreos-tracker#567 (comment)

Re-enable secondary serial consoles on those platforms, so the serial
console gets as much information as possible without interfering with the
graphical console.
@dustymabe
Copy link
Member

The fix for this went into testing stream release 37.20221127.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. labels Nov 30, 2022
@dustymabe
Copy link
Member

The fix for this went into stable stream release 37.20221127.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests