Skip to content
This repository has been archived by the owner on Dec 13, 2018. It is now read-only.

A container can join pid namespace of another container. #609

Closed
wants to merge 6 commits into from

Conversation

dqminh
Copy link
Contributor

@dqminh dqminh commented Jun 1, 2015

Fix #604

This supports setting configs.NEWPID to an existing PID namespace for a container's init process. It leverages the same C code infrastructure for execin to clone a new init process with the wanted PID namespace.

Edit: Because of some sharing in code between init and execin, execin also supports joining userns of the init process now. Init process cant join mnt/userns of another init process yet.

@@ -60,6 +60,62 @@ static int clone_parent(jmp_buf * env)
return child;
}

void nsinit_pid()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we combine this logic into nsexec() itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried that but the end result wasnt as clean. Since spawning an init process with a known pid namespace and exec a new process inside another container are two very different operations , i think that having two functions describes the intent better.

@avagin
Copy link
Contributor

avagin commented Jun 2, 2015

The same hack is required for mount and user namespaces, because a multithreaded process can't change them.
http://man7.org/linux/man-pages/man2/setns.2.html
'''
A process may not be reassociated with a new mount namespace if it is multithreaded.
'''

What do you think about creating a directory with namespaces in container's root directory. We can bind-mount all required namespaces there before starting the container and then bind-mount container's namespace files to use them for executing processes inside CT. In this case we can use the same code of nsexec.c for starting the first and other processes.

@LK4D4
Copy link
Contributor

LK4D4 commented Jun 2, 2015

I agree with @avagin that bind-mounting namespace can be right way to do such stuff, because nsexec.c supposed to be very small layer only for setting namespaces and expanding it can lead us to support pretty big C code in golang runtime, which probably will hit some limitations.
I'm not against merging this though if this feature is needed by someone right now :)

@dqminh
Copy link
Contributor Author

dqminh commented Jun 3, 2015

@LK4D4 There's a few related issues in docker moby/moby#13453 and moby/moby#10163. Overall i think it's better to do this properly (if this is not the best way) since 1.7 is already in RC now so we have time :)

@avagin I dont understand the part about bind-mounting namespaces. Do you have an example for that ? My current understanding of that is that we have a special directory that contains namespace descriptors that we can detect when starting the container and use them to setns with some code from nsexec.c ?

@mrunalp
Copy link
Contributor

mrunalp commented Jun 3, 2015

@dqminh The suggestion is to bind mount /proc/self/ns/* files into some directory for each container (Just like how ip netns add works. It bind mounts net namespace fds under /var/run/netns). After that is done, new processes could use those paths to join the container. This could also be used for joining subset of other container's namespaces.

@avagin could correct me if I am wrong.

@mrunalp
Copy link
Contributor

mrunalp commented Jun 3, 2015

@LK4D4
Copy link
Contributor

LK4D4 commented Jun 3, 2015

Also, you can destroy original namespace, so bind-mounting is like "copy" namespace.

@mrunalp
Copy link
Contributor

mrunalp commented Jun 3, 2015

Right, as long the mounted namespace fds exist, the original container can just die without affecting those that joined those namespace fds.

@mrunalp
Copy link
Contributor

mrunalp commented Jun 3, 2015

Not sure how that will behave will pid namespaces though since pid 1 going down brings down the other processes.

@dqminh
Copy link
Contributor Author

dqminh commented Jun 3, 2015

The suggestion is to bind mount /proc/self/ns/* files into some directory for each container (Just like how ip netns add works

Ah i see. But that would not change the current implementation right, since all libcontainer needs to know is still the paths to the namespace descriptors.

Not sure how that will behave will pid namespaces though since pid 1 going down brings down the other processes.

i think pid 1 going down will kill the rest of the processes, and you cant join using the open file descriptor anymore (http://man7.org/linux/man-pages/man7/pid_namespaces.7.html).

@avagin
Copy link
Contributor

avagin commented Jun 3, 2015

Ah i see. But that would not change the current implementation right, since all libcontainer needs to know is still the paths to the namespace descriptors.

@dqminh I don't understand what you mean. Could you elaborate?

@LK4D4
Copy link
Contributor

LK4D4 commented Jun 3, 2015

@dqminh for main process we use Cloneflags for creating namespaces. New idea is to create namespaces separately and then pass them even for first process.

@avagin
Copy link
Contributor

avagin commented Jun 3, 2015

@LK4D4 @dqminh I don't suggest to create namespaces separately. For example, It's impossible for pidns. For the init process we will mount all existent namespaces there and set clone flags for others.

@LK4D4
Copy link
Contributor

LK4D4 commented Jun 3, 2015

@avagin What point of bind-mounting then? Just to collect them in one directory?
I think then we can pass just space-separated namespace list to nsexec.c where it will be generated from config.Namespaces for init and from state.NamespacePaths for other.

@dqminh
Copy link
Contributor Author

dqminh commented Jun 3, 2015

Right, we cant really seperate the creation of namespaces and processes at least for pid namespace. It always requires an anchor process.

The current way joining namespaces works for init process is that container has config.Namespaces which keys values mapping of namespace to their descriptor path. The value can either be empty ( which then the key is used in CloneFlags ), or points to a valid namespace descriptor ( which we use to join namespace in joinExistingNamespaces ). So from libcontainer, the only requirements for init to join different namespaces is to set the key-value pair in config.Namespaces

This works for net ( the only current usecase afaik ), but doesnt work for pid ( and probably mount/userns as @avagin pointed out ). This patch adds another intermediate clone to put the init process in correct pid namespaces.

My understanding is that even if we bind-mount namespaces, we will still pass the mount destination as value to config.Namespaces and perform setns before exec-ing the target process, so it will not be different from the current way.

I think then we can pass just space-separated namespace list to nsexec.c where it will be generated from config.Namespaces for init and from state.NamespacePaths for other.

Hmm yes, that is another way to do it so we can reuse that blocks of code between init and execin. It also helps when we want to join another namespaces besides pid. Probably joinExistingNamespaces can be removed in that case.

@avagin
Copy link
Contributor

avagin commented Jun 3, 2015

@avagin What point of bind-mounting then? Just to collect them in one directory?

  • collect in one directory
  • be independent from the init process. If we have a container without
    pidns, we will able to enter into it even when the first process died. For
    container with pidns, this approach avoids a problem when the init process
    died and another process is created with the same pid.

I'm not sure that we need to do this. It's just an idea how it can be fixed.

I think the we can pass just space-separated namespace list to nsexec.c
where it will be generated from config.Namespaces for init and from
state.NamespacePaths for other.

Yes, we can.

@dqminh dqminh force-pushed the pidns branch 2 times, most recently from 3d824b5 to 629eca4 Compare June 10, 2015 10:53
@dqminh
Copy link
Contributor Author

dqminh commented Jun 10, 2015

@LK4D4 @avagin i added some code to pass a list of namespaces to be joined to both init process and execin process. So they behave the same now on setns part.

All setns now is done in C layer (joinExistingNamespaces is not necessary)

fd = openat(tfd, namespaces[i], O_RDONLY);
char *ns, *saveptr;
while ((ns = strtok_r(nspaths, ",", &saveptr))) {
fd = open(ns, O_RDONLY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you enter into a mount namespace, you can't be sure that you will able to enter in other namespaces, because nspaths may be inaccessible from this mount namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I think the order of nspaths has to be set from the caller ( which we are not doing now for init, but we do for exec ). wdyt ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a caller can do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. The order of namespaces is now enforced in orderNamespacePaths. I also tried to add NEWUSER there but some tests failed. Added a note on NEWUSER and will try again in another patch only for NEWUSER.

@dqminh
Copy link
Contributor Author

dqminh commented Jun 15, 2015

@avagin I pushed a new patch that unifies the ordering of namespaces in init and exec, and a few more changes. PTAL.

Also i notice that joining mount namespace for init process will not work right now because it's still trying to setupRootfs. I can write a new PR to fix that ( if we want a init process to join mnt namespace of another container ? )

if len(nsMaps) > 0 {
nsPaths, err := orderNamespacePaths(nsMaps)
if err != nil {
return nil, newSystemError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orderNamepsapcePaths() returns SystemError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

configs.NEWNS,
}
// For now, only join user namespace if this is an exec in process and the
// container supports user namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have explanation about this restriction.

@avagin
Copy link
Contributor

avagin commented Jun 23, 2015

LGTM (except a few minor comments).

@hqhq
Copy link
Contributor

hqhq commented Jul 8, 2015

@dqminh Can you backport this PR to runc? Thanks.

@dqminh
Copy link
Contributor Author

dqminh commented Jul 8, 2015

@hqhq Thanks for reminding me !. Things are a bit hectic on my side this week, but i should be able get it done by the weekend (or sooner if possible)

An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.  When a shared-pidns container exits,
the original container will keep running.

This removes joinExistingNamespaces in Go layer because all namespaces will be
setns in C layer.

This also changes setns process to requires a list of namespaces to be joined
rather than requires only the pid of init process and deduct the namespace from
it.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
if nsinit specifies `--id` option, use that id as cgroups' name instead of
the directory name. It's helpful where we want to start multiple containers
from the same directory for example.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
- check if namespace is supported by the current kernel
- check that the pathname doesnt contain comma `,`. This is the character we
  used to join paths together and pass it over as a single env variable so we
  can split it out in C layer.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
Doing this helps with ordering of NEWNS, since we dont have to worry about
set mount namespace too early leadings to old /proc becomes inaccessible.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
execin process can join user namespace of the init process.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
execin process can join user namespace of the parent container only when
the init process has joined user namespace explicitly.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
@LK4D4
Copy link
Contributor

LK4D4 commented Jul 14, 2015

ported to runc

@LK4D4 LK4D4 closed this Jul 14, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
6 participants