Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prototype support for macvtap interfaces #2217

Merged
merged 8 commits into from Mar 3, 2021

Conversation

upxe
Copy link

@upxe upxe commented Oct 24, 2020

Firecracker uses /dev/net/tun to open/create tap interfaces. On the host these are typically bridged, using the standard Linux bridge implementation.

Linux macvtap interfaces are similar to bridges but simpler: the kernel doesn't snoop frames to maintain a FIB to map MAC addresses to interfaces and doesn't need to use flooding, because the full topology is known to the kernel.

However, macvtap interfaces are exposed to userland as /dev/tap%d device nodes, which Firecracker doesn't support. This commit adds support by checking whether the host_dev_name in the network configuration looks like a path name, and opening the device node if it does.

Tested thus:

    # Create a dummy interface to host the macvtap interfaces, plus two macvtap interfaces in bridge mode
    sudo ip link add dummy0 type dummy
    sudo ip link add link dummy0 name vtap0 type macvtap mode bridge
    sudo ip link add link dummy0 name vtap1 type macvtap mode bridge
    sudo ip link set dummy0 up
    sudo ip link set vtap0 up
    sudo ip link set vtap1 up
    sudo chmod 660 /dev/tap*             # For proof-of-concept purposes only
    sudo chown root:$(id -g) /dev/tap*   # For proof-of-concept purposes only

Configure two VMs thus; look in /dev for device node names; use ip link to get tap MAC address etc:

    {
        ...
            "network-interfaces" [
                {
                    ...
                    "host_dev_name: "/dev/tapXX",
                    "guest_mac": "<see host's ip link output>"
                }
            ]
        ...
    }

Start the VMs, give them IP addresses on the same subnet; they can then ping each other etc.

You must use the correct MAC address in the VM configuration, i.e. the MAC address that the host kernel has assigned to the tap interface.

Reason for This PR

macvtap (see #1933) allows VMs on a host to be networked without using the Linux bridge code; it is simpler and not prone to ARP spoofing.

The implementation in the commit proves that macvtap-based networking works; I'm after feedback please as to whether macvtap support would be considered for merging. If so then I'd look at configuring properly rather than leeching off the existing /dev/net/tun configuration code.

Alternatively, a more flexible approach would be to allow Firecracker to adopt a file descriptor as a network interface and read/write frames to it. (In other words, in this case, open /dev/tapXX as e.g. FD 3, then exec() firecracker and have it treat FD 3 as the tap interface). This would let me do other useful things on the host, but would require a jailer change to keep the relevant FD[s] open.

Description of Changes

Extend Tap::open_named() in virtio/net/tap.rs to support opening /dev/tapXX directly, rather than using /dev/net/tun to open/create tap interfaces. This allows Firecracker to use tap interfaces create via macvtap on the host.

  • This functionality can be added in rust-vmm.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

PR Checklist

I haven't looked at these as this is an RFC; if macvtap (or passing in FDs) is acceptable then I'll submit a PR that meets them.

[Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • All commits in this PR are signed (git commit -s).
  • The reason for this PR is clearly provided (issue no. or explanation).
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any newly added unsafe code is properly documented.
  • Any API changes are reflected in firecracker/swagger.yaml.
  • Any user-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.

@serban300
Copy link
Contributor

Hi @upxe ,

Thanks for the contribution ! This might be very useful. We would like to do some performance tests first, and we can discuss next steps after. In the meanwhile, just dropping a few questions:

  • is this compatible with the snapshotting functionality ? For example right now when we restore from a snapshot we just recreate a tap with the same name as the old one. Do you see any complications here when using a macvtap instead ?
  • currently we have some unit tests that simulate frame tx/rx on the tap. Is this possible using the macvtap ?

@upxe
Copy link
Author

upxe commented Oct 27, 2020

  • is this compatible with the snapshotting functionality ? For example right now when we restore from a snapshot we just recreate a tap with the same name as the old one. Do you see any complications here when using a macvtap instead ?

I think that this depends on how you name your tap devices. The default (on Debian at any rate) isn't very useful: the XX in /tap/XX is the interface index, entirely unrelated to whatever name you gave the interface, and that appears in the ip link output. You might be able to convince udev (or whatever you use to populate /dev) to give you a better name, but just opening the same device node by name is probably a bad idea for these interfaces.

However, /sys/devices/virtual/net/<interface name>/macvtap/*/dev tells you what device node to open (or, indeed, to create in the jailer chroot). So: shouldn't be a problem to figure out what device to create or open, but you couldn't use the same code path as the existing tap device does.

This all means that we would probably need to distinguish between tap and macvtap interfaces at the configuration level, rather than relying on the interface name looking like /dev/something. That's probably a good idea anyway though.

  • currently we have some unit tests that simulate frame tx/rx on the tap. Is this possible using the macvtap ?

Haven't look into this yet - I raised the PR to see whether there was interest in principle in supporting this before diving in too deep. In theory, shouldn't be a problem, because once you've opened the device it's more-or-less a normal tap device. I shall have a look, but have had a few problems running the test suite - the whole unit test + integration test run (tools/devtool test) takes ages on my laptop, so I'm running cargo test within a shell in the Docker container. That seems to work, but a few tests fail in the current master.

@gbionescu
Copy link

Hi @upxe.
I just wanted to do a quick follow-up on this PR. Recently, I ran a quick test on this PR using iperf3 and noticed a slight decrease in throughput when reported to the master Firecracker version, but also saw that the overall CPU usage in the host is lower, which is something that we would like to look into more.

I will redo the tests and come back with the exact test scenario and numbers.

@pclesr
Copy link

pclesr commented Nov 18, 2020

Being able to specify the absolute path to the taped is extremely useful, as it allows unique names .

In an embedded environment, there is the desire to put each VM into a network namespace. Using MacVTap makes the configuration easier, as the Macvtap can be attached directly to the veth device in the network namespace without having to make another bridge. This simplifies the setup considerably. The environment is also sandboxed (like with the jailer) with other namespaces like mount, pid, etc to help isolate the environment.

@acatangiu acatangiu added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Nov 25, 2020
@acatangiu
Copy link
Contributor

Also asked for in #1933

@gbionescu
Copy link

Hi @upxe. I think that we can start working towards merging this, but the first step would be to point the PR towards a feature branch. I've created one called feature/macvtap - please update the PR when you have the time.

@upxe upxe changed the base branch from master to feature/macvtap November 28, 2020 10:52
@upxe
Copy link
Author

upxe commented Nov 28, 2020

I've updated the PR; unless you're happy with the current approach (which was just a PoC, and I think is a bit too simple-minded), it would be useful to have a steer on how you want to proceed, e.g.

  • Whether you want to support both the current /dev/net/tun approach and macvtap interfaces, or see the macvtap approach replacing the existing one.

  • Whether checking for an absolute path is OK to use the macvtap implementation, or whether it look up the interface index given the interface name and see whether a suitable /dev/tapN exists. (Would probably play better with snapshotting.)

Apart from the above, I'm aware at the moment that I need to:

  • Add jailer support.
  • Ensure that snapshotting works. This will have to be aware of MAC addresses - the current code appears to persist the MAC address, in which case snapshot restoration will also have to set the host MAC address of the macvtap interface. I think that that's fine though.
  • Look into unit testing.
  • Consider doc changes.
  • Sign commits.

@gbionescu
Copy link

@upxe, indeed, all the items that you mentioned will have to be added, but I haven't looked into what the best approach would be to steer it towards the final implementation, given all the implications that it may have with snapshotting and other areas such as isolation between microVMs. This is why, as a first step, we should start working towards merging it into a feature branch.

Related to your first two bullets, preferably we would like to have both /dev/net/tun and macvtap available. Having absolute paths inside the snapshots would probably be a bad idea, correct.

Also, I would also test it through the Python framework, not only using rust unit tests. The snapshotting feature would also have to be tested through Python tests when macvtap snapshots are restored.

@dianpopa dianpopa added Status: Author and removed Status: Awaiting review Indicates that a pull request is ready to be reviewed labels Dec 15, 2020
@georgepisaltu
Copy link
Contributor

Hi @upxe!

Do you still plan on contributing to this PR?

@upxe
Copy link
Author

upxe commented Jan 6, 2021

Do you still plan on contributing to this PR?

I do; the day job intervened in December. Might get a chance to look at it this weekend.

btw, I had (thought that I'd) put it onto a feature branch on 28 Nov so was a little confused by @gc-plp's comment about "merging towards a feature branch" - if I haven't interpreted that correctly then please let me know.

Thanks.

@gbionescu
Copy link

@upxe, the PR is pointed to the correct branch now, but before merging we would need to have the items that you mentioned here, addressed.

@casibbald
Copy link

@upxe Been going through this and this PR is a great addition to FC, thank you so much. It happens to be functionality that is a blocker on a concept we are looking at.
Day job permitting, how are you progressing with the additional requirements to get the PR ready for final review and merging?

@upxe
Copy link
Author

upxe commented Jan 23, 2021

I'm looking at Jailer integration and there's a snag: macvtap interfaces are (AFAICT - shall investigate further) accessed via per-interface device nodes in /dev, rather than via /dev/net/tun. These individual device nodes aren't in the jailed firecracker's view of /dev, and the Jailer can't create then (like it does for /dev/kvm and /dev/net/tun) because it doesn't know which interfaces the VM will use.

Options that I can think of (unless I can somehow make the interfaces accessible via /dev/net/tun):

  1. Put a read-only sysfs mount into the chroot (or bind-bound /sys/devices/virtual/net). Simplest and most flexible, but exposes rather more of the host to the VM than is ideal.

  2. Jailer finds all macvtap interfaces (via sysfs) and creates device nodes in the chroot for all such interfaces that the firecracker uid/gid would be able to open. Potentially exposes unnecessary interfaces to the VM, but can be avoided by assigning a unique UID/GID to each VM, which is probably a good idea anyway. The macvtap interfaces must exist before running the Jailer.

  3. Make a jailer user declare via command-line args the names of any macvtap interfaces that the VM will use, so that the Jailer can create the relevant device nodes. Puts minimal device nodes into the chroot, at the expense of some managerial/orchestration complexity.

Do the maintainers have any views on those? I'm minded to prefer option 2, unless the host may have lots of interfaces, in which case 1.

@casibbald
Copy link

casibbald commented Jan 23, 2021 via email

@gbionescu
Copy link

gbionescu commented Jan 25, 2021

@upxe I think that the question to ask here would be, what happens if a sandbox escape occurs and an attacker is able to access the jailer environment? Which of the three options would ensure that the attacker doesn't disturb a neighboring VM?

From this angle, option 3 seems to be ensure a minimal exposure of host devices inside the jailer. As you mentioned, there is some additional complexity passed on to the orchestrator, but a piece of extra logic would still have to be added in order to connect a microVM through a macvtap interface (e.g. creating the actual interface before launching the jailer).

What's your opinion here?

@gbionescu
Copy link

Hi @upxe
Have you had any chance to continue the work on this PR?

@upxe
Copy link
Author

upxe commented Feb 13, 2021

I'm happy to go with a command line argument in the jailer to have it create some additional device nodes.

The last remaining issue for now (haven't considered save/restore yet) is that I'm currently using sysfs to map interface names to device nodes, and sysfs isn't available in the jailed root. But if we're creating device nodes in the jailer then we just name them appropriately and have firecracker look for them if /sys isn't mounted.

If the host has a macvtap interface called foo that the user wants to use in a VM then we end up with:

  1. User runs the jailer with e.g. --macvtap foo.
  2. Jailer looks in host sysfs for the foo interface's major/minor, and creates /dev/net/foo in the jailed root.
  3. When firecracker wants to open a network interface "foo":
    1. If /dev/net/foo exists then open it and treat as a macvtap interface.
    2. Else if /sys exists (i.e. no jailer) then see whether it's a macvtap interface, act as per jailer (and the existing code on this branch).
    3. Else (no sys, no /dev/net/foo) use /dev/net/tun as per normal.

Incidentally, we can distinguish tap from mavctap via the ethtool GDRVINFO ioctl, but AFAICT we'd still need sysfs to figure out the major/minor numbers for the device node.

@raduiliescu raduiliescu linked an issue Feb 15, 2021 that may be closed by this pull request
@dianpopa dianpopa self-requested a review February 15, 2021 09:55
@dianpopa dianpopa added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Feb 16, 2021
@dianpopa dianpopa force-pushed the macvtap branch 2 times, most recently from e421ba8 to bb07771 Compare March 1, 2021 18:36
@dianpopa dianpopa requested a review from a team March 1, 2021 18:44
Comment on lines 292 to 295
// Since namespaces are shared by default when creating a new process using fork or clone,
// unshare() is used to disassociate the current process from the default shared namespace.
SyscallReturnCode(unsafe { libc::unshare(libc::CLONE_NEWNS) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explicitly add in the comment that this unshares the mount namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.into_empty_result()
.map_err(Error::MountSysfs)?;

// Unmount the default sysfs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// Unmount the default sysfs.
// Unmount the current sysfs since it's describing the previous namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

georgepisaltu
georgepisaltu previously approved these changes Mar 2, 2021
src/utils/src/net/macvtap.rs Outdated Show resolved Hide resolved
src/utils/src/net/macvtap.rs Outdated Show resolved Hide resolved
src/devices/src/virtio/net/tap.rs Outdated Show resolved Hide resolved
src/utils/src/net/macvtap.rs Outdated Show resolved Hide resolved
src/utils/src/net/macvtap.rs Outdated Show resolved Hide resolved
upxe and others added 8 commits March 2, 2021 18:09
Utilities to detect macvtap network interfaces.

We're about to need them in both jailer and firecracker, so move
from the latter into a separate module.

Needs unit tests

Signed-off-by: Ametros <github@ametros.net>
Firecracker network interfaces use /dev/net/tun to create tap interfaces.
On the host these are typically bridged, using the standard Linux bridge
implementation.

Linux macvtap interfaces are similar to bridges but simpler: the kernel
doesn't snoop frames to maintain a FIB to map MAC addresses to interfaces,
because the full topology is known to the kernel.

However, macvtap interfaces are exposed to userland as /dev/tap%d device
nodes, which Firecracker doesn't support. This commit adds support by
checking whether the `host_dev_name` looks like a path name, and opening
the device node if it does.

Tested thus:

    # Create a dummy interface to host the macvtap interfaces, plus two
    # macvtap interfaces in bridge mode
    sudo ip link add dummy0 type dummy
    sudo ip link add link dummy0 name vtap0 type macvtap mode bridge
    sudo ip link add link dummy0 name vtap1 type macvtap mode bridge
    sudo ip link set dummy0 up
    sudo ip link set vtap0 up
    sudo ip link set vtap1 up
    sudo chmod 666 /dev/tap*   # For proof-of-concept purposes only

    # Configure two VMs thus; use ip link to get tap d
    {
        ...
        {
            "network-interfaces" [
                {
                    ...
                    "host_dev_name: "/dev/tapX",
                    "guest_mac": "<see ip link output>",
                }
            ]
        }
        ...
    }

You must use the correct MAC address in the VM, i.e. the MAC address that the
host kernel has assigned to the tap interface.

To create a Tap instance given an interface name X, see whether
/sys/devices/virtual/net/<X>/macvtap is a directory.

If so then the interface is a macvtap interface, so look in that
directory for a device name and open /dev/<that name>.

Otherwise use the original open_named() code to open a tap interface
via /dev/net/tun.

Signed-off-by: Ametros <github@ametros.net>
macvtap interfaces are opened by device node, and (AFAICT) there's
no way to discover the device major/minor numbers without sysfs,
which isn't mounted in the jailer chroot.

This commit adds a --macvtap=foo argument to the jailer; for each such
argument, look up the device major/minor numbers for macvtap interface
foo and create /dev/net/foo in the jailer chroot.

Signed-off-by: Ametros <github@ametros.net>
Some other changes were made that make possible
creating a set up for testing the macvtap support.

Signed-off-by: Diana Popa <dpopa@amazon.com>
Signed-off-by: Diana Popa <dpopa@amazon.com>
Signed-off-by: Diana Popa <dpopa@amazon.com>
Signed-off-by: Diana Popa <dpopa@amazon.com>
This commit should be revised once merging to master
since we want to add unit test for the functionality.

Signed-off-by: Diana Popa <dpopa@amazon.com>
Copy link
Contributor

@dianpopa dianpopa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gc-plp @georgepisaltu @acatangiu @luminitavoicu I addressed your feedback, PTAL! Thanks

src/utils/src/net/macvtap.rs Outdated Show resolved Hide resolved
src/utils/src/net/macvtap.rs Outdated Show resolved Hide resolved
src/utils/src/net/macvtap.rs Outdated Show resolved Hide resolved
src/utils/src/net/macvtap.rs Outdated Show resolved Hide resolved
@dianpopa dianpopa added Status: Awaiting review Indicates that a pull request is ready to be reviewed and removed Status: Author labels Mar 2, 2021
@gbionescu gbionescu merged commit 0a56fbf into firecracker-microvm:feature/macvtap Mar 3, 2021
@majek
Copy link

majek commented Sep 2, 2022

Rebase please - the branch conflicts with main.

@upxe
Copy link
Author

upxe commented Sep 19, 2022

done on upxe:macvtap; don't know how to update this pull request because it's showing as merged + closed

@dzenc
Copy link

dzenc commented Dec 2, 2023

Despite the fact that this is marked merged + closed, it doesn't seem to be in there? Did this "die on the vine" somewhere? This seems like a lot of work, and a lot of value to be dropped. (I would like to make use of it!)

What can I do to help?

@roypat
Copy link
Contributor

roypat commented Dec 4, 2023

Hi @dzenc,

Despite the fact that this is marked merged + closed, it doesn't seem to be in there? Did this "die on the vine" somewhere? This seems like a lot of work, and a lot of value to be dropped. (I would like to make use of it!)

This PR only got merged to a feature branch, which never made it to main, which is why you cannot find it anywhere. We are tracking macVtap support in #1933, but have not continued working on it so far due to limited requests for this feature.

What can I do to help?

If you want, you can comment on #1933 with your usecase, to see if it can gain some traction again :)

@dzenc
Copy link

dzenc commented Dec 7, 2023

Comment added to #1933. Thanks for listening! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Network] Add support for MacVTap interfaces