Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVMeoF / TCP boot support #2184

Merged
merged 3 commits into from
Jun 26, 2023

Conversation

mwilck
Copy link
Contributor

@mwilck mwilck commented Feb 3, 2023

Changes

Add support for booting from NVMe over Fabrics according to the NVM Express Boot Specification v1.0 to dracut. Note that this is orthogonal to boot from NVMeoF over Fibre Channel, which is already supported by dracut (using features exclusive of the FC transport). The NVMe-oF Boot Spec is designed to be transport-agnostic (but v1.0 contains support for the TCP transport only).

The spec defines an ACPI table called "NBFT", in which the system firmware stores information about network configuration and NVMe subsystems used for booting (similar to the iBFT table for iSCSI). The code in this PR reads the NBFT table (parsing JSON as printed by nvme-cli), transforms it into ip= parameters that dracut's network setup code understands, and connects to the NVMe subsystems after network setup is complete. The code is modeled after the respective code for IBFT boot in the 95iscsi module.

This PR depends on linux-nvme/nvme-cli#1791, which implements the nvme-cli subcommands nvme show-nbft and nvme connect-nbft. Therefore this PR should not be merged before the syntax and semantics of the nvme subcommands have been consolidated (i.e. before the nvme-cli PR has been merged). In spite of that, I submit this PR here as RFC, to make the community aware of the feature. While it's expected that the nvme-cli command-line API will change somewhat, the fact that the NBFT content is printed in JSON format will not change, and the structure of the JSON itself will probably change only slightly, if at all. If the API changes in the nvme-cli PR, this PR will be adapted .

Checklist

  • [ x] I have tested it locally
  • [ x] I have reviewed and updated any documentation if relevant
  • I am providing new code and test(s) for it

Notes

It is difficult to get one's hands on a sample setup for testing and playing around with boot over NVMeoF. A proof-of-concept setup using the Linux NVMe target (nvmet) and qemu is under preparation and will be published soon under https://github.com/timberland-sig.

@github-actions github-actions bot added iscsi Issues related to the iscsi module modules Issue tracker for all modules nvmf Issues related to the nvmf module labels Feb 3, 2023
@mwilck
Copy link
Contributor Author

mwilck commented Feb 3, 2023

Adding @aafeijoo-suse , @johnmeneghini, @igaw, @hreinecke for information.

Copy link
Member

@aafeijoo-suse aafeijoo-suse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it looks good, as I've followed your development process and already reviewed it internally over the past months. Thanks for your work here.

Just two comments:

  • As you said, we must wait until nvme-cli consolidates the syntax of the commands used here.
  • It'd be great to add a new test for the nvmf module in the future (similar to TEST-30-ISCSI), but that shouldn't block this PR, as we don't have a test for the current implementation at this time, and the nvme-cli changes need to be available from our CI.

@LaszloGombos LaszloGombos added the enhancement Issue adding new functionality label Feb 5, 2023
@github-actions github-actions bot added the network Issues related to the network module label Feb 6, 2023
@mwilck
Copy link
Contributor Author

mwilck commented Feb 6, 2023

Updated with some minor IPv6 fixes (again, one for iSCSI, too).

I will squash these fixes before this is merged. For now I guess keeping them separate makes the review easier.

@mwilck mwilck mentioned this pull request Feb 8, 2023
3 tasks
@mwilck
Copy link
Contributor Author

mwilck commented Feb 8, 2023

@LaszloGombos, I have created #2192 now, which separates out the non-nvmf related changes from this PR.
I will rebase this PR onto #2192 later.

@github-actions github-actions bot added the network-legacy Issues related to the network-legacy module label Feb 9, 2023
@mwilck
Copy link
Contributor Author

mwilck commented Feb 9, 2023

I rebased onto #2192 now, and added another commit (91e1ee9), which enables a retry logic for nvme connects.

Copy link
Member

@aafeijoo-suse aafeijoo-suse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Martin, you need to fix the shfmt issues in nvmf-autoconnect.sh and parse-nvmf-boot-connections.sh:

https://github.com/dracutdevs/dracut/actions/runs/4136435631/jobs/7150292822#step:4:56

Also, I assume the part of commit 91e1ee9 message "Also, make sure that the initqueue script doesn't call exit()." is outdated, right?

I see the nvmf-autoconnect.sh never checks if the connection is already established, right? What if it runs again in this case, have you tried that? I'm just pointing this out, because initqueue --online without --onetime will run the nvmf-autoconnect.sh script for every network interface configured on the kernel command line, and also initqueue --settled and --timeout can run after the connection is set.

Other than that, the changes look good.

@mwilck
Copy link
Contributor Author

mwilck commented Feb 10, 2023

Thanks Martin, you need to fix the shfmt issues in nvmf-autoconnect.sh and parse-nvmf-boot-connections.sh:

https://github.com/dracutdevs/dracut/actions/runs/4136435631/jobs/7150292822#step:4:56

Argh. Why do I keep forgetting to run it? Sorry.

Also, I assume the part of commit 91e1ee9 message "Also, make sure that the initqueue script doesn't call exit()." is outdated, right?

I see the nvmf-autoconnect.sh never checks if the connection is already established, right?

No, that's currently impossible. It would be extremely hard and error-prone to get such a test right in a shell script. nvme connect-nbft does check this, though, and won't make another connection attempt for already-established connections. I have made a remark
on the nvme-cli PR, asking for a more graceful error handling in the special case that all NBFT-specified connections are already established.

What if it runs again in this case, have you tried that?

Yes, it does nothing but prints a (currrently misleading) error message and exits with error status. Anyway, note that the NBFT can include multiple connections and it's possible that some are established while others are not. nvme connect-nbft does "the right thing" in this case (tries to connect to the non-yet-connected targets and skips others), but it isn't easy to reflect this in the exit status. Admittedly, test coverage for such scenarios is very limited currently.

I'm just pointing this out, because initqueue --online without --onetime will run the nvmf-autoconnect.sh script for every network interface configured on the kernel command line, and also initqueue --settled and --timeout can run after the connection is set.

If this happens we'll see a lot of warnings from nvme-cli. But we will work on fixing that on the nvme-cli side.

The thing is, the single attempt that the original code is doing won't be sufficient in all cases, and it's better to see a few error messages than to fail booting. As this code evolves, we may be able to do this more elegantly in the future, without the need to retry blindly. But as I said, iSCSI does it the same way, and it has been around for more than a decade.

@mwilck
Copy link
Contributor Author

mwilck commented Feb 10, 2023

Pushed shfmt fixes.

@aafeijoo-suse
Copy link
Member

If this happens we'll see a lot of warnings from nvme-cli. But we will work on fixing that on the nvme-cli side.

Thanks for clarifying.

Pushed shfmt fixes.

Sorry to bother you again with this minor format requirement, but it's still failing for nvmf-autoconnect.sh:

https://github.com/dracutdevs/dracut/actions/runs/4143068641/jobs/7164486549#step:4:15

@johannbg
Copy link
Collaborator

We probably want to have this in a separated new NVMeoF module also @mwilck I notice that you are poking the network legacy module which we are planning on dropping so my question to you is how much do you think it's being used in the wild since downstream should have switched to iwd, connman,networkd,nm by now?

@mwilck
Copy link
Contributor Author

mwilck commented Feb 13, 2023

We probably want to have this in a separated new NVMeoF module

I don't think that's a good idea. It would cause the nvmf and nvmeof modules to possibly conflict. There's is really just nvmeof (NVMe over Fabrics). Fibre Channel (which is what the current nvmf module supports) is just one of multiple NVMeoF transports. By integrating NVMeoF / TCP into the nvmf module, we are able to eliminate possible conflicts. My implementation in the PR still gives precendence to the NVMeoF / FC boot method.

I agree that the namenvmf is not ideal for a transport-independent NVMeoF module, but I wouldn't want to change the existing name.

@mwilck I notice that you are poking the network legacy module

The only changes to 35legacy are those which you acked on #2192 already. Sorry for not making this clear.

which we are planning on dropping so my question to you is how much do you think it's being used in the wild since downstream should have switched to iwd, connman,networkd,nm by now

My distro (SUSE/openSUSE) is still using network-legacy, and will continue to do so for the life time of SLE15 at least (EOL currently planned for 2031). At least that's what I expect. I designed the NVMeoF changes such that they just set cmdline parameters like ip=, so that they should (in principle, at least) work with any network backend.

@aafeijoo-suse
Copy link
Member

@mwilck I notice that you are poking the network legacy module
which we are planning on dropping so my question to you is how much do you think it's being used in the wild since downstream should have switched to iwd, connman,networkd,nm by now

My distro (SUSE/openSUSE) is still using network-legacy, and will continue to do so for the life time of SLE15 at least (EOL currently planned for 2031). At least that's what I expect. I designed the NVMeoF changes such that they just set cmdline parameters like ip=, so that they should (in principle, at least) work with any network backend.

Yes, that's a recurring discussion. We (SUSE) will not be removing the network-legacy module anytime soon. I think it's also the only option for any other non-systemd distro for now (because the network-manager module depends on systemd : #1756).

While the network-legacy module is still present upstream, I'd suggest keeping it up to date, even more if the fixes are minor like the one submitted in this PR.

@johannbg
Copy link
Collaborator

@mwilck I notice that you are poking the network legacy module
which we are planning on dropping so my question to you is how much do you think it's being used in the wild since downstream should have switched to iwd, connman,networkd,nm by now

My distro (SUSE/openSUSE) is still using network-legacy, and will continue to do so for the life time of SLE15 at least (EOL currently planned for 2031). At least that's what I expect. I designed the NVMeoF changes such that they just set cmdline parameters like ip=, so that they should (in principle, at least) work with any network backend.

Yes, that's a recurring discussion. We (SUSE) will not be removing the network-legacy module anytime soon. I think it's also the only option for any other non-systemd distro for now (because the network-manager module depends on systemd : #1756).

While the network-legacy module is still present upstream, I'd suggest keeping it up to date, even more if the fixes are minor like the one submitted in this PR.

Yeah sure I just want to keep invested party aware that it's eventually going away.
The reality is that it wont happen until we introduce iwd ( and wifi support in the process ) and iwd wont be introduced until a) I or someone else finish it ( which wont happen for me until at best at easter vacation that is if I get one ) and b) it survives switch root which requires upstream changes to iwd which I have not even begun to discuss with upstream iwd.

Copy link
Collaborator

@johannbg johannbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwilck can you remove the relevant bit's in this PR that got merged in #2192 otherwise LGTM

In NBFT setups, VLAN can be configured in the firmware.
Add the 8021q module in hostonly mode even if VLAN is currently
not used to be prepared for such configuration change.
Since nvme-cli 2.0, configuration of subsystems to connect to is
stored under `/etc/nvme` in either `discovery.conf` or `config.json`.
Attempt discovery also if the latter exists, but not the former.
Also, install "config.json" if it's present on the root FS.

As before, "rd.nvmf.discover=fc,auto" will force either file to be ignored,
and NBFT-defined targets take precedence if found.
@mwilck mwilck force-pushed the timberland_final branch 2 times, most recently from ac66c00 to f58e1d5 Compare May 26, 2023 15:00
@mwilck mwilck changed the title WIP: NVMeoF / TCP boot support NVMeoF / TCP boot support May 26, 2023
@mwilck mwilck marked this pull request as ready for review May 26, 2023 15:04
@mwilck
Copy link
Contributor Author

mwilck commented May 26, 2023

Merged and squashed timberland-sig#10, and rebased.

This is ready for review now.

modules.d/95nvmf/nvmf-autoconnect.sh Show resolved Hide resolved
modules.d/95nvmf/nvmf-autoconnect.sh Show resolved Hide resolved
modules.d/95nvmf/module-setup.sh Outdated Show resolved Hide resolved
@mwilck
Copy link
Contributor Author

mwilck commented Jun 14, 2023

@aafeijoo-suse, your issues should be addressed now.

Add code to parse the Nvme-oF Boot Firmware Table (NBFT) according
to the NVM Express Boot Specification 1.0 [1]. The implementation in
dracut follows a similar general approach as iBFT support in the
iscsi module.

NBFT support requires two steps:

(1) Setting up the network and routing according to the
    HFI ("Host Fabric Interface") records in the NBFT,
(2) Establishing the actual NVMe-oF connection.

(1) is accomplished by reading the NBFT using JSON output from
the "nvme nbft show" command, and transforming it into command
line options ("ip=", "rd.neednet", etc.) understood by dracut's
network module and its backends. The resulting network setup code
is backend-agnostic. It has been tested with the "network-legacy"
and "network-manager" network backend modules. The network setup
code supports IPv4 and IPv6 with static, RA, or DHCP configurations,
802.1q VLANs, and simple routing / gateway setup.

(2) is done using the "nvme connect-all" command [2] in the netroot handler,
which is invoked by networking backends when an interface gets fully
configured. This patch adds support for "netboot=nbft". The "nbftroot"
handler calls nvmf-autoconnect.sh, which contains the actual connect
logic. nvmf-autoconnect.sh itself is preserved, because there are
other NVMe-oF setups like NVMe over FC which don't depend on the
network.

The various ways to configure NVMe-oF are prioritized like this:

 1 FC autoconnect from kernel commandline (rd.nvmf.discover=fc,auto)
 2 NBFT, if present
 3 discovery.conf or config.json, if present, and cmdline.d parameters,
   if present (rd.nvmf.discovery=...)
 4 FC autoconnect (without kernel command line)

The reason for this priorization is that in the initial RAM fs, we try
to activate only those connections that are necessary to mount the root
file system. This avoids confusion, possible contradicting or ambiguous
configuration, and timeouts from unavailable targets.

A retry logic is implemented for enabling the NVMe-oF connections,
using the "settled" initqueue, the netroot handler, and eventually, the
"timeout" initqueue. This is similar to the retry logic of the iscsi module.
In the "timeout" case, connection to all possible NVMe-oF subsystems
is attempted.

Two new command line parameters are introduced to make it possible to
change the priorities above:

 - "rd.nvmf.nonbft" causes the NBFT to be ignored,
 - "rd.nvmf.nostatic" causes any statically configured NVMe-oF targets
   (config.json, discovery.conf, and cmdline.d) to be ignored.

These parameters may be helpful to skip attempts to set up broken
configurations.

At initramfs build time, the nvmf module is now enabled if an NBFT
table is detected in the system.

[1] https://nvmexpress.org/wp-content/uploads/NVM-Express-Boot-Specification-2022.11.15-Ratified.pdf
[2] NBFT support in nvme-cli requires the latest upstream code (> v2.4).

Signed-off-by: Martin Wilck <mwilck@suse.com>
Co-authored-by: John Meneghini <jmeneghi@redhat.com>
Co-authored-by: Charles Rose <charles.rose@dell.com>
Copy link
Member

@aafeijoo-suse aafeijoo-suse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Copy link
Collaborator

@LaszloGombos LaszloGombos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

It would be good to add jq package to at least to the Fedora test container. Even if we do not have test ready for the nvmf dracut module, we can at least test to make sure that including this module does not interfere and does not have unexpected side effect.

@aafeijoo-suse aafeijoo-suse requested review from lnykryn and johannbg and removed request for danimo June 19, 2023 08:16
@aafeijoo-suse aafeijoo-suse dismissed johannbg’s stale review June 19, 2023 08:17

Already addressed.

Copy link
Member

@lnykryn lnykryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@pvalena pvalena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor nitpicks, otherwise LGTM.

has_nbft() {
local f found=
for f in /sys/firmware/acpi/tables/NBFT*; do
[ -f "$f" ] || continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks only for regular files. Is that intentional?

You could check for -e (existence) instead.

Copy link
Contributor Author

@mwilck mwilck Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Actually, no. Technically, this check serves to avoid a case where no NBFT* file exists, in which case $f would equal /sys/firmware/acpi/tables/NBFT* (as we don't use nullglob). But in the (extremely unlikely) case that a directory entry NBFT* existed in sysfs, and was not a regular file, we'd definitely want to skip it, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, folders, I didn't think of that - but It might be symlink or other file-like type as well f.e.; but if only regular files are expected/accepted, I'm fine with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only regular files should be accepted.

Note that [ (aka test) dereferences symlinks for all flags except -h and -L. Thus if /sys/firmware/acpi/tables/NBFT* was a symlink to some existing regular file, this test wouldn't skip it, which looks wrong. But this case would indicate a severely broken kernel. I believe we can ignore it. In the worst case, we'd include the NBFT code in the initramfs on a broken system.

done
[[ $found ]]
}

[[ $hostonly ]] || [[ $mount_needs ]] && {
pushd . > /dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not part of the PR, but does anyone know why pushd . is here? I think this is basically a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just checking that for_each_host_dev_and_slaves does not run popd? I'd love to know the reasoning behind this implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed not part of this PR, therefore I'd like to close this conversation.

AFAICS, the pushd/popd code (like lots of the initial nvmf code) has been copied from 95iscsi/module-setup.sh, where it had been added by @haraldh in commit b093aa2 ("beautified shell code") 10y ago, without further explanation. Perhaps at that time for_each_host_dev_and_slaves could change the working directory? I don't know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good to know the origin of it. Thanks for the response!

fi
if [ ! -f /sys/class/fc/fc_udev_device/nvme_discovery ] \
&& [ ! -f /etc/nvme/discovery.conf ] \
&& [ ! -f /etc/nvme/config.json ] && ! has_nbft; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: same usage of -f vs -e as stated above.

Copy link
Contributor Author

@mwilck mwilck Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-f is correct here. We wouldn't want to account for non-regular files by the respective names.

[ "$RD_DEBUG" != yes ] || set -x

if [ "$1" = timeout ]; then
[ ! -f /sys/class/fc/fc_udev_device/nvme_discovery ] \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: same usage of -f vs -e as stated above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, -f is correct here. If this existed and was anything else but a regular file, we'd have to ignore it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, thanks!

@johnmeneghini
Copy link
Contributor

johnmeneghini commented Jun 23, 2023 via email

@johnmeneghini
Copy link
Contributor

Note that these changes have been tested on Fedora with the Fedora QEMU based NVMe/TCP boot POC at:

https://github.com/timberland-sig/rh-linux-poc

I've been testing these patches with Fedora all along. I am able to boot with NVMe/TCP and they appear to work fine with a number of different network configurations, including a dual network nvme/tcp multipathing configuration.

Note that Fedora uses Network Manager, so the network manager code path in dracut has been tested with these patches.

@aafeijoo-suse aafeijoo-suse merged commit b490f6f into dracutdevs:master Jun 26, 2023
69 of 74 checks passed
@aafeijoo-suse
Copy link
Member

Thank you all for your work here.

@pvalena
Copy link
Contributor

pvalena commented Jun 26, 2023

@johnmeneghini Off course, it's not about the command itself, but about that full logic:

pushd .  
some_external_function
popd || exit

I.e. changing to current directory & exiting if popd fails.... what does logic this aim to capture/prevent/achieve?


EDIT: nevermind, It was just copied from iscsi code; so there probably are some underlying reasons - not related to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue adding new functionality modules Issue tracker for all modules nvmf Issues related to the nvmf module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants