Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable networking for hypervisor based container runtimes #237

Merged
merged 5 commits into from
Dec 13, 2016

Conversation

sameo
Copy link

@sameo sameo commented Dec 6, 2016

This pull request adds networking support for hypervisor based OCI compatible container runtimes.
This has been tested with Clear Containers, and together with the cri-o branch this PR allows us to fully run CRI-O with KVM based containers.

Problem statement

Hypervisor based container runtimes prepare their virtual machine (kernel arguments, QEMU process, monitoring process) for a pod when receiving the sandbox container creation command.
Part of the VM preparation typically involves bridging the networking namespace existing interfaces with hypervisor supported interfaces (e.g. TAP). There typically is one bridge per networking namespace veth pair where the QEMU TAP interface and the veth peer are bridged together.
This means we need the sandbox networking namespace to be running and configured before CRI-O asks us to create the sandbox container.

With the current code, the CNI plugin sets the networking namespace after creating the sandbox container which means we always end up with dangling veth peers and disconnected containers within any given pod.

Proposed solution

This PR attempts to fix the above problem by creating and configuring the sandbox networking namespace before creating the sandbox container. This is done in 2 steps:

  1. For each pod, create a persistent namespace under /var/run/netns. This is done by the CNI ns package. The created namespace is removed and cleaned up at pod stop and removal time.

  2. Change the sandbox creation and networking configuration order in the code. We now create the networking namespace, call the default CNI plugin on it, add the networking namespace path to the container OCI configuration, and then create the sandbox container.

@sameo
Copy link
Author

sameo commented Dec 6, 2016

@laijs @feiskyer Your feedback here would be greatly appreciated.

@sameo
Copy link
Author

sameo commented Dec 6, 2016

Adding @mcastelino and @jjlakis as well.

@runcom
Copy link
Member

runcom commented Dec 6, 2016

/cc @rajatchopra

func netNsGet(nspath string) (*sandboxNetNs, error) {
netNS, err := ns.GetNS(nspath)
if err != nil {
return &sandboxNetNs{}, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, err

return &sandboxNetNs{}, err
}

sNetNs := &sandboxNetNs{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return the struct directly

@@ -38,6 +61,59 @@ func (s *sandbox) removeContainer(c *oci.Container) {
s.containers.Delete(c.Name())
}

func (s *sandbox) netNs() ns.NetNS {
return s.netns.ns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this check for s.netns to be nil as well? sa netNsPath does below

@sameo sameo force-pushed the topic/sandbox_netns branch 2 times, most recently from 79813d9 to 7c1d9e7 Compare December 6, 2016 14:03
@mrunalp
Copy link
Member

mrunalp commented Dec 6, 2016

The approach sounds good to me. We were planning to move out network/ipc namespace fds anyways so we can have pause-less pod for containers.


func (s *sandbox) netNsRemove() error {
if s.netns == nil {
return fmt.Errorf("No networking namespace")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: errors that aren't being logged should start with lowercase (everywhere in this PR).

@mikebrow
Copy link
Contributor

mikebrow commented Dec 6, 2016

Reviewed the code LGTM.

@mrunalp
Copy link
Member

mrunalp commented Dec 7, 2016

I see some lint issues in the build that you can reproduce locally by running make lint

@laijs
Copy link
Contributor

laijs commented Dec 7, 2016

@gao-feng

@laijs
Copy link
Contributor

laijs commented Dec 7, 2016

qemu supports net-devices hotplug, does this feature help in your problem?

@gao-feng
Copy link
Contributor

gao-feng commented Dec 7, 2016

Seems you add a pre-requirement for oci-runtime manager, the manager shouldn't ask runtime to create netns itself.

In oci spec, prestart hook is used to setup network. https://github.com/opencontainers/runtime-spec/blob/master/config.md#prestart, If the problem you met can be resolved if cri-o uses hooks too?

@sameo
Copy link
Author

sameo commented Dec 7, 2016

@mrunalp Strangely enough, I could not reproduce those errors locally...Anyway, this is fixed now.

@sameo
Copy link
Author

sameo commented Dec 7, 2016

@laijs

qemu supports net-devices hotplug, does this feature help in your problem?

We thought about that and saw that you guys are using that feature with runV, iiuc. It could help but we believe this PR provides a better long term approach for the following reasons:

  1. We don't want to rely on a qemu specific feature for that purpose, trying to be hypervisor agnostic as much as possible.
  2. It feels more natural to prepare the pod network before creating it. The current pod creation order seems to be driven by the fact that we simply can't create the pod and container networking namespace before starting the pod process. With this PR we now have a choice as to when and how we create the whole pod network.
  3. As @mrunalp explained, the current approach relies on pods running a pause container which could be avoided by building permanent and process agnostic namespaces (networking and others).
  4. Eventually we'd like to avoid having to scan the networking namespace for interfaces and getting the whole networking setup through the OCI config file (either from additional annotations or through a future OCI spec extension). We can only do that if the networking setup is built before create.

@sameo
Copy link
Author

sameo commented Dec 7, 2016

@gao-feng

Seems you add a pre-requirement for oci-runtime manager, the manager shouldn't ask runtime to create netns itself.

I'm not sure I fully understand your point here. My interpretation of the Linux namespace OCI specification is:

  1. If a namespace type is missing, the container inherit the runtime's one
  2. If a namespace type is present but the namespace path is not provided, the runtime must create the container a new namespace to join.
  3. If a namespace type is present and a namespace path is provided, the runtime must not create a new namespace and simply make the container join an existing one.

That PR makes ocid move from case 2 to case 3, it's not getting out of spec. Am I missing something?

If the problem you met can be resolved if cri-o uses hooks too?

Not really because those are called after the container is created.

@mcastelino
Copy link

@gao-feng The OCI spec today does have the provision to make the container join an existing namespace.

However that does not address the scanning issue, which is the need to scan the network namespace, to discover the interfaces, routes... and connect/propagate them to the virtual machine.

In an ideal solution, if that information was provided to the runtime (in the case of the oci hooks, on hook return) and in the case of cri at pod creation time, this would avoid the need for a hypervisor runtime to scan the network namespace in order to discover the network configuration of the POD.

@gnawux
Copy link

gnawux commented Dec 8, 2016

@sameo

We don't want to rely on a qemu specific feature for that purpose, trying to be hypervisor agnostic as much as possible.

Xen support NIC hotplug as well, it's hypervisor agnostic indeed.

@gnawux
Copy link

gnawux commented Dec 8, 2016

From the runV perspective:

  1. We like the idea overall, i.e. we don't like to employ a the netns listener and monitor the ns.
  2. The current runV support the listener pattern because of the OCI/CNI/docker compatibility. @sameo are you working on the OCI extension side (i.e. to make it OCI officially)?

}

if ns.Path == "" {
return "", fmt.Errorf("empty networking namespace")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seams it breaks the case when the pod uses the host netns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were several host network related issues, they should be fixed now.
Thanks for pointing that out.

type sandboxNetNs struct {
sync.Mutex
ns ns.NetNS
closed bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could closed be restored back while ocid is restarted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ocid tries to restore a sandbox, we have 2 possibilities:

  1. The sandbox has been stopped and the netns closed.
  2. The sandbox is still READY and the netns is still opened.

I understand your concern is related to 1, and this is how we handle it with this PR: When trying to get the netns for a restored sandbox we set the sandbox netns to nil if it was stopped before stopping ocid:

        // We add a netNS only if we can load a permanent one.
	// Otherwise, the sandbox will live in the host namespace.
	netNsPath, err := configNetNsPath(m)
	if err == nil {
		netNS, nsErr := netNsGet(netNsPath)
		// If we can't load the networking namespace
		// because it's closed, we just set the sb netns
		// pointer to nil. Otherwise we return an error.
		if nsErr != nil && nsErr != errSandboxClosedNetNS {
			return nsErr
		}

		sb.netns = netNS
	}

Having a close netns is actually equivalent to not having one. Please let me know if that makes sense to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That makes sense.

@feiskyer
Copy link
Contributor

feiskyer commented Dec 8, 2016

LGTM with the approach, just concern about restoring sandboxNetNs.closed after ocid restarted.

@sameo sameo force-pushed the topic/sandbox_netns branch 3 times, most recently from b60233e to 5e8e754 Compare December 8, 2016 23:52
@sameo
Copy link
Author

sameo commented Dec 10, 2016

@mrunalp I suppose I will have to also provide a fix for that PR to work with any runC that will not contain this fix. I'm thinking about creating an additional symlink to the persistent networking namespace, although that's slightly on the hackish side.

@mrunalp
Copy link
Member

mrunalp commented Dec 10, 2016

@sameo Nice! Up to you if you want to create the symlink work around patch in the short term. I am not too worried till we get to runc 1.0. We could maintain the version that we test with in the wiki for cri-o.

@sameo
Copy link
Author

sameo commented Dec 12, 2016

@mrunalp I pushed a commit that creates an additional symlink for working around the runC issue.
I'd appreciate if you could give it a go and let me know if you still see the same issues.

@mrunalp
Copy link
Member

mrunalp commented Dec 12, 2016

@sameo Some tests were failing for me presumably because of the other issue about annotations? Could you rebase? I will retest.

Samuel Ortiz added 5 commits December 12, 2016 19:48
We will need it for our persistent networking
namespace work.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Because they need to prepare the hypervisor networking interfaces
and have them match the ones created in the pod networking
namespace (typically to bridge TAP and veth interfaces), hypervisor
based container runtimes need the sandbox pod networking namespace
to be set up before it's created. They can then prepare and start
the hypervisor interfaces when creating the pod virtual machine.

In order to do so, we need to create per pod persitent networking
namespaces that we pass to the CNI plugin. This patch leverages
the CNI ns package to create such namespaces under /var/run/netns,
and assign them to all pod containers.
The persitent namespace is removed when either the pod is stopped
or removed.

Since the StopPodSandbox() API can be called multiple times from
kubelet, we track the pod networking namespace state (closed or
not) so that we don't get a containernetworking/ns package error
when calling its Close() routine multiple times as well.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order for hypervisor based container runtimes to be able to
fully prepare their pod virtual machines networking interfaces,
this patch sets the pod networking namespace before creating the
sandbox container.

Once the sandbox networking namespace is prepared, the runtime
can scan the networking namespace interfaces and build the pod VM
matching interfaces (typically TAP interfaces) at pod sandbox
creation time. Not doing so means those runtimes would have to
rely on all hypervisors to support networking interfaces hotplug.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
With the networking namespace code added, we were reaching a
gocyclo complexitiy of 52. By moving the container creation and
starting code path out, we're back to reasonable levels.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to workaround a bug introduced with runc commit bc84f833,
we create a symbolic link to our permanent networking namespace so
that runC realizes that this is not the host namespace.

Although this bug is now fixed upstream (See commit f33de5ab4), this
patch works with pre rc3 runC versions.
We may want to revert that patch once runC 1.0.0 is released.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
@sameo
Copy link
Author

sameo commented Dec 12, 2016

@mrunalp Rebased now.
Tested here with runC commit 47ea5c75ebeb40a317d2cfa95f9c3536c00c1eea and it passes.

@mrunalp
Copy link
Member

mrunalp commented Dec 12, 2016

@sameo Thanks. The tests pass for me now :)

@sameo
Copy link
Author

sameo commented Dec 12, 2016

@mrunalp Ah, nice :-) Let me know if there's anything else I should work on for this PR.

@mrunalp
Copy link
Member

mrunalp commented Dec 13, 2016

LGTM

@mrunalp mrunalp merged commit bd585c2 into cri-o:master Dec 13, 2016
egernst pushed a commit to egernst/cri-o that referenced this pull request Nov 26, 2018
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 16, 2019
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 16, 2019
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 16, 2019
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 16, 2019
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 16, 2019
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 16, 2019
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 16, 2019
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 17, 2019
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 18, 2019
This includes:
    change the options ManageNetworkNSLifecycle to ManageNSLifecycle
    add a new set of files: internal/lib/sandbox/namespaces* that takes care of namespace related functionality
    create a generic NamespaceIface interface for interacting with all three kinds of namespaces
    refactor some of runPodSandbox to reduce cyclomatic complexity
    use pinns for managing
    remove symlink methods:
        the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 19, 2019
This includes:
    change the options ManageNetworkNSLifecycle to ManageNSLifecycle
    add a new set of files: internal/lib/sandbox/namespaces* that takes care of namespace related functionality
    create a generic NamespaceIface interface for interacting with all three kinds of namespaces
    refactor some of runPodSandbox to reduce cyclomatic complexity
    use pinns for managing
    remove symlink methods:
        the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander added a commit to haircommander/cri-o that referenced this pull request Dec 20, 2019
This includes:
    change the options ManageNetworkNSLifecycle to ManageNSLifecycle
    add a new set of files: internal/lib/sandbox/namespaces* that takes care of namespace related functionality
    create a generic NamespaceIface interface for interacting with all three kinds of namespaces
    refactor some of runPodSandbox to reduce cyclomatic complexity
    use pinns for managing
    remove symlink methods:
        the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily

Signed-off-by: Peter Hunt <pehunt@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants