-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable networking for hypervisor based container runtimes #237
Conversation
Adding @mcastelino and @jjlakis as well. |
/cc @rajatchopra |
func netNsGet(nspath string) (*sandboxNetNs, error) { | ||
netNS, err := ns.GetNS(nspath) | ||
if err != nil { | ||
return &sandboxNetNs{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, err
return &sandboxNetNs{}, err | ||
} | ||
|
||
sNetNs := &sandboxNetNs{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return the struct directly
@@ -38,6 +61,59 @@ func (s *sandbox) removeContainer(c *oci.Container) { | |||
s.containers.Delete(c.Name()) | |||
} | |||
|
|||
func (s *sandbox) netNs() ns.NetNS { | |||
return s.netns.ns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this check for s.netns
to be nil
as well? sa netNsPath
does below
79813d9
to
7c1d9e7
Compare
The approach sounds good to me. We were planning to move out network/ipc namespace fds anyways so we can have pause-less pod for containers. |
|
||
func (s *sandbox) netNsRemove() error { | ||
if s.netns == nil { | ||
return fmt.Errorf("No networking namespace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: errors that aren't being logged should start with lowercase (everywhere in this PR).
7c1d9e7
to
495c8e1
Compare
Reviewed the code LGTM. |
I see some lint issues in the build that you can reproduce locally by running |
qemu supports net-devices hotplug, does this feature help in your problem? |
Seems you add a pre-requirement for oci-runtime manager, the manager shouldn't ask runtime to create netns itself. In oci spec, prestart hook is used to setup network. https://github.com/opencontainers/runtime-spec/blob/master/config.md#prestart, If the problem you met can be resolved if cri-o uses hooks too? |
495c8e1
to
d4d0674
Compare
@mrunalp Strangely enough, I could not reproduce those errors locally...Anyway, this is fixed now. |
We thought about that and saw that you guys are using that feature with runV, iiuc. It could help but we believe this PR provides a better long term approach for the following reasons:
|
I'm not sure I fully understand your point here. My interpretation of the Linux namespace OCI specification is:
That PR makes ocid move from case 2 to case 3, it's not getting out of spec. Am I missing something?
Not really because those are called after the container is created. |
@gao-feng The OCI spec today does have the provision to make the container join an existing namespace. However that does not address the scanning issue, which is the need to scan the network namespace, to discover the interfaces, routes... and connect/propagate them to the virtual machine. In an ideal solution, if that information was provided to the runtime (in the case of the oci hooks, on hook return) and in the case of cri at pod creation time, this would avoid the need for a hypervisor runtime to scan the network namespace in order to discover the network configuration of the POD. |
Xen support NIC hotplug as well, it's hypervisor agnostic indeed. |
From the runV perspective:
|
} | ||
|
||
if ns.Path == "" { | ||
return "", fmt.Errorf("empty networking namespace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seams it breaks the case when the pod uses the host netns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were several host network related issues, they should be fixed now.
Thanks for pointing that out.
type sandboxNetNs struct { | ||
sync.Mutex | ||
ns ns.NetNS | ||
closed bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could closed
be restored back while ocid is restarted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ocid tries to restore a sandbox, we have 2 possibilities:
- The sandbox has been stopped and the netns closed.
- The sandbox is still READY and the netns is still opened.
I understand your concern is related to 1, and this is how we handle it with this PR: When trying to get the netns for a restored sandbox we set the sandbox netns to nil if it was stopped before stopping ocid:
// We add a netNS only if we can load a permanent one.
// Otherwise, the sandbox will live in the host namespace.
netNsPath, err := configNetNsPath(m)
if err == nil {
netNS, nsErr := netNsGet(netNsPath)
// If we can't load the networking namespace
// because it's closed, we just set the sb netns
// pointer to nil. Otherwise we return an error.
if nsErr != nil && nsErr != errSandboxClosedNetNS {
return nsErr
}
sb.netns = netNS
}
Having a close netns is actually equivalent to not having one. Please let me know if that makes sense to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That makes sense.
LGTM with the approach, just concern about restoring |
b60233e
to
5e8e754
Compare
aec684f
to
497ad00
Compare
@mrunalp I suppose I will have to also provide a fix for that PR to work with any runC that will not contain this fix. I'm thinking about creating an additional symlink to the persistent networking namespace, although that's slightly on the hackish side. |
@sameo Nice! Up to you if you want to create the symlink work around patch in the short term. I am not too worried till we get to runc 1.0. We could maintain the version that we test with in the wiki for cri-o. |
497ad00
to
28e08d3
Compare
@mrunalp I pushed a commit that creates an additional symlink for working around the runC issue. |
@sameo Some tests were failing for me presumably because of the other issue about annotations? Could you rebase? I will retest. |
We will need it for our persistent networking namespace work. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Because they need to prepare the hypervisor networking interfaces and have them match the ones created in the pod networking namespace (typically to bridge TAP and veth interfaces), hypervisor based container runtimes need the sandbox pod networking namespace to be set up before it's created. They can then prepare and start the hypervisor interfaces when creating the pod virtual machine. In order to do so, we need to create per pod persitent networking namespaces that we pass to the CNI plugin. This patch leverages the CNI ns package to create such namespaces under /var/run/netns, and assign them to all pod containers. The persitent namespace is removed when either the pod is stopped or removed. Since the StopPodSandbox() API can be called multiple times from kubelet, we track the pod networking namespace state (closed or not) so that we don't get a containernetworking/ns package error when calling its Close() routine multiple times as well. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order for hypervisor based container runtimes to be able to fully prepare their pod virtual machines networking interfaces, this patch sets the pod networking namespace before creating the sandbox container. Once the sandbox networking namespace is prepared, the runtime can scan the networking namespace interfaces and build the pod VM matching interfaces (typically TAP interfaces) at pod sandbox creation time. Not doing so means those runtimes would have to rely on all hypervisors to support networking interfaces hotplug. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
With the networking namespace code added, we were reaching a gocyclo complexitiy of 52. By moving the container creation and starting code path out, we're back to reasonable levels. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to workaround a bug introduced with runc commit bc84f833, we create a symbolic link to our permanent networking namespace so that runC realizes that this is not the host namespace. Although this bug is now fixed upstream (See commit f33de5ab4), this patch works with pre rc3 runC versions. We may want to revert that patch once runC 1.0.0 is released. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
28e08d3
to
0df8200
Compare
@mrunalp Rebased now. |
@sameo Thanks. The tests pass for me now :) |
@mrunalp Ah, nice :-) Let me know if there's anything else I should work on for this PR. |
LGTM |
snap: fix build errors
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
This includes: change the options ManageNetworkNSLifecycle to ManageNSLifecycle add a new set of files: internal/lib/sandbox/namespaces* that takes care of namespace related functionality create a generic NamespaceIface interface for interacting with all three kinds of namespaces refactor some of runPodSandbox to reduce cyclomatic complexity use pinns for managing remove symlink methods: the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
This includes: change the options ManageNetworkNSLifecycle to ManageNSLifecycle add a new set of files: internal/lib/sandbox/namespaces* that takes care of namespace related functionality create a generic NamespaceIface interface for interacting with all three kinds of namespaces refactor some of runPodSandbox to reduce cyclomatic complexity use pinns for managing remove symlink methods: the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
This includes: change the options ManageNetworkNSLifecycle to ManageNSLifecycle add a new set of files: internal/lib/sandbox/namespaces* that takes care of namespace related functionality create a generic NamespaceIface interface for interacting with all three kinds of namespaces refactor some of runPodSandbox to reduce cyclomatic complexity use pinns for managing remove symlink methods: the symlink functionality was used as a crutch for a bug in runc (cri-o#237). that bug has since been fixed (runc#1149), so remove this patch as it clutters the code unnecessarily Signed-off-by: Peter Hunt <pehunt@redhat.com>
This pull request adds networking support for hypervisor based OCI compatible container runtimes.
This has been tested with Clear Containers, and together with the cri-o branch this PR allows us to fully run CRI-O with KVM based containers.
Problem statement
Hypervisor based container runtimes prepare their virtual machine (kernel arguments, QEMU process, monitoring process) for a pod when receiving the sandbox container creation command.
Part of the VM preparation typically involves bridging the networking namespace existing interfaces with hypervisor supported interfaces (e.g. TAP). There typically is one bridge per networking namespace veth pair where the QEMU TAP interface and the veth peer are bridged together.
This means we need the sandbox networking namespace to be running and configured before CRI-O asks us to create the sandbox container.
With the current code, the CNI plugin sets the networking namespace after creating the sandbox container which means we always end up with dangling veth peers and disconnected containers within any given pod.
Proposed solution
This PR attempts to fix the above problem by creating and configuring the sandbox networking namespace before creating the sandbox container. This is done in 2 steps:
For each pod, create a persistent namespace under
/var/run/netns
. This is done by the CNI ns package. The created namespace is removed and cleaned up at pod stop and removal time.Change the sandbox creation and networking configuration order in the code. We now create the networking namespace, call the default CNI plugin on it, add the networking namespace path to the container OCI configuration, and then create the sandbox container.