Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Shim API v2 #2426

Closed
crosbymichael opened this issue Jun 27, 2018 · 60 comments · Fixed by #2434
Closed

[proposal] Shim API v2 #2426

crosbymichael opened this issue Jun 27, 2018 · 60 comments · Fixed by #2434
Assignees

Comments

@crosbymichael
Copy link
Member

crosbymichael commented Jun 27, 2018

Shim API for Runtimes

Authors:

More VM based runtimes have internal state and more abstract actions.
A CLI approach introduces issues with state management.
This proposal introduces a shim API for solving these state issues at the shim layer in containerd.
The goals is to provide an API that various runtimes can implement to add support in containerd while
still having control of state and abstract actions.

v2 Task(shim) API

Previous Shim API

syntax = "proto3";

package containerd.runtime.linux.shim.v1;

service Shim {
	rpc State(StateRequest) returns (StateResponse);
	rpc Create(CreateTaskRequest) returns (CreateTaskResponse);
	rpc Start(StartRequest) returns (StartResponse);
	rpc Delete(google.protobuf.Empty) returns (DeleteResponse);
	rpc DeleteProcess(DeleteProcessRequest) returns (DeleteResponse);
	rpc ListPids(ListPidsRequest) returns (ListPidsResponse);
	rpc Pause(google.protobuf.Empty) returns (google.protobuf.Empty);
	rpc Resume(google.protobuf.Empty) returns (google.protobuf.Empty);
	rpc Checkpoint(CheckpointTaskRequest) returns (google.protobuf.Empty);
	rpc Kill(KillRequest) returns (google.protobuf.Empty);
	rpc Exec(ExecProcessRequest) returns (google.protobuf.Empty);
	rpc ResizePty(ResizePtyRequest) returns (google.protobuf.Empty);
	rpc CloseIO(CloseIORequest) returns (google.protobuf.Empty);
	rpc ShimInfo(google.protobuf.Empty) returns (ShimInfoResponse);
	rpc Update(UpdateTaskRequest) returns (google.protobuf.Empty);
	rpc Wait(WaitRequest) returns (WaitResponse);
}

New API

syntax = "proto3";

package containerd.task.v2;

service Task {
        rpc State(StateRequest) returns (StateResponse);
        rpc Create(CreateTaskRequest) returns (CreateTaskResponse);
        rpc Start(StartRequest) returns (StartResponse);
        rpc Delete(DeleteRequest) returns (DeleteResponse);
        rpc Pids(PidsRequest) returns (PidsResponse);
        rpc Pause(google.protobuf.Empty) returns (google.protobuf.Empty);
        rpc Resume(google.protobuf.Empty) returns (google.protobuf.Empty);
        rpc Checkpoint(CheckpointTaskRequest) returns (google.protobuf.Empty);
        rpc Kill(KillRequest) returns (google.protobuf.Empty);
        rpc Exec(ExecProcessRequest) returns (google.protobuf.Empty);
        rpc ResizePty(ResizePtyRequest) returns (google.protobuf.Empty);
        rpc CloseIO(CloseIORequest) returns (google.protobuf.Empty);
        rpc Update(UpdateTaskRequest) returns (google.protobuf.Empty);
        rpc Wait(WaitRequest) returns (WaitResponse);
        rpc Stats(StatsRequest) returns (StatsResponse);
}

Shim Inputs

Bundle

The OCI bundle is still the main source of configuration for shims.
The shim should not write to any other location on disk except the bundle.
The bundle can be used as a workspace for the shim with any additional state.

├── io.containerd.runtime.v2
│   └── default
│       └── redis
│           ├── config.json
│           └── rootfs/

Configuration

Configuration for shims can be passed via Opts or defaults defined within
the containerd /etc/containerd/config.toml.

Shim Outputs

GRPC

The shim grpc service is the main source of interaction with the shim.
The shim is also expected to write a shim.pid file for containerd to read in case
it is no longer able to access the shim via the GRPC api.
This pid will be used to SIGKILL the shim in case of a forceful shutdown.

UX

> ctr run --runtime io.containerd.runtime.v2.process
> ctr run --runtime io.containerd.runtime.v2.gvisor
> ctr run --runtime io.containerd.runtime.v2.kata
> ctr run --runtime io.containerd.runtime.v2.oci

The existing runtime will continue to work for upgrades where containers are running under v1 shims.

@jstarks
Copy link
Contributor

jstarks commented Jun 27, 2018

Right now I think the shim uses ttrpc. Would you expect this to move to standard gRPC?

@Random-Liu
Copy link
Member

Random-Liu commented Jun 27, 2018

LGTM overall other than 3 questions:

  1. Should we define events behavior? E.g. is publishing TaskExit required?
  2. For K8s, rpc Stats(StatsRequest) returns (StatsResponse) will be called every 10 seconds for each container. For 100 containers, I'm not sure whether the rpc will introduce significant memory overhead. I'm less concern about the stats collection itself, because there is no much difference with today. We can start with this, and see whether there is significant regression. If there is, we may need to revisit this api, and find a more efficient way to expose stats.
  3. I found that there is another case that containerd may talk with runc directly https://github.com/containerd/containerd/blob/master/runtime/linux/runtime.go#L471. I think the cleanup is still required. Is it possible to abstract it away? Or we decide to live with it, and make sure all runtimes should at least implement runc delete?

@crosbymichael
Copy link
Member Author

@jstarks we talked about this at our Dockercon meetup. We want to support both ttrpc and grpc. One things that we need to do on the windows side is get ttrpc working over named pipes but this should be pretty effortless.

@crosbymichael
Copy link
Member Author

crosbymichael commented Jun 27, 2018

@Random-Liu

  1. Ya we'll keep updating this proposal with more details as we hit them. There is only 1-2 events that the shim publishes today and I'll document all of them or change this if we find a better way.
  2. Doing work cost cpu/memory plus we have no choice as some vm based runtimes will not have cgroups.
  3. We are working on abstracting everything into the shim API. There will be no split of responsibility between the shim and daemon.

@jstarks
Copy link
Contributor

jstarks commented Jun 27, 2018

@crosbymichael @jterry75 already has named pipes working with ttrpc. I think he can prepare a PR soon.

@crosbymichael
Copy link
Member Author

@jstarks perfect!

@jterry75
Copy link
Contributor

I like the proposal. A few questions:

  1. Can we remove the use of unix signals as it doesn't make any sense when running these shims on Windows? Today the shim uses SIGCHLD, SIGTERM, SIGINT, SIGUSR1. It would be better if we could call a gRPC method that is intended to "signal" the shim itself.
  2. Today the linux runtime is responsible for unmounting rootfs for example in terminate. It will be hard to have a generic runtime above this shim interface with assumptions like this. For example, in Windows we wont have a rootfs on the host at all even though we do have the bundle path. Not mounting/unmounting should take place above the shim api under the bundle path.

Thanks!

@crosbymichael
Copy link
Member Author

  1. If you are refering to signaling the shim itself, the only time we use that is for SIGKILL when its not working or other debug handlers. I think for debugging we can add some debug apis, and @ehazlett and I already chatted about this some. As for removing signals all together, we will need to nuke it if we cannot talk over grpc to a broken shim.
  2. Ya, we are trying to abstract out anything that the runtime code in the containerd daemon is doing and move that to the shim. terminate is a tricky one because on linux, we still have to cleanup the rootfs. We will have to work on a clean solution for this.

@crosbymichael
Copy link
Member Author

We have an ongoing branch for this development here:
https://github.com/crosbymichael/containerd/tree/shimv2

@crosbymichael
Copy link
Member Author

One thing we have to figureout is how to handle OOM events. Should we use the events API for this or a grpc streaming endpoint for oom events? @Random-Liu @dmcgowan ?
Maybe:

service Task {
    rpc OOM(OOMRequest) returns (stream OOMResponse);
}

@Random-Liu
Copy link
Member

@crosbymichael It seems that ttrpc doesn't support streaming yet. https://github.com/containerd/ttrpc/blob/fa6c68143601db58b3636ee9948aad2fe08ed1ea/services.go#L36

The current event api + containerd binary approach works for me as well. At least we understand its performance better. :)

@dmcgowan
Copy link
Member

@crosbymichael how many different event types are there?

I think streaming is always better, does the callback end up connection to containerd through grpc and sending the event? I wonder how expensive implementing streaming is... any ideas @stevvooe ?

@crosbymichael
Copy link
Member Author

@dmcgowan the shim only does exit events right now, oom would be the second type the shim emits

@crosbymichael
Copy link
Member Author

@jterry75 for your second question the other day. I just worked on a solution for it: crosbymichael@5c678b6#diff-cd9adfde13a79bd4e2b11470dbabf55fR36

Basically we need the shim to have two commands as its binary. One that serves the rpc api and one that can cleanup/delete when a previous shim died. This way the runtime can unmount or cleanup any internal state from a crashed shim.

@jterry75
Copy link
Contributor

jterry75 commented Jul 2, 2018

@crosbymichael - Yes this is exactly what I was thinking to shift the responsibility of partial cleanup one layer lower to a shim.

As for my earlier question about signals. It's not just debugging: see here. Today the port that I have done for containerd-shim to Windows only handles the debugging case via events how the Docker daemon does it but I would like to have debugging/reaping, etc done via API call rather than signals in all of these cases.

-Justin

@sboeuf
Copy link

sboeuf commented Jul 4, 2018

Hi @crosbymichael, I have a few questions to clarify how this API is going to be used.

  1. If we think about the runc case, does that mean we need to implement a daemon running a gRPC server, and this server would implement the Task API ? If yes, I guess this daemon will leverage directly libcontainer, right ?
  2. Don't you think it might be nice to introduce the concept of pod at this API level ? This would allow for clearer integration with the upper stack.
  3. How are the standard input/outputs handled, do we still have a need for a containerd-shim process ?
  4. What do we expect from Pids() ? The list of processes IDs from the container perspective or from the host perspective ?

I might be completely off topic here, let me know if that's the case :)

@Random-Liu
Copy link
Member

Random-Liu commented Jul 4, 2018

@sboeuf The new Task API is just the new Shim API if I understand correctly. We can implement different shims for linux container, windows, kata-container and gvisor.

  1. If we think about the runc case, does that mean we need to implement a daemon running a gRPC server, and this server would implement the Task API ?

You need to implement your own containerd-shim, which is a per container process.

If that is a daemon, we haven't thought about it yet /cc @crosbymichael.

If yes, I guess this daemon will leverage directly libcontainer, right ?

It doesn't matter. As long as the new shim api is implemented, containerd should be able to use it. In theory, it is possible to use libcontainer directly to implement the shim, but I guess for linux container, using runc is still a better solution because it is very stable and widely used now.

  1. Don't you think it might be nice to introduce the concept of pod at this API level ? This would allow for clearer integration with the upper stack.

That would be ideal, but I'm not sure whether we can do that now. OCI runtime spec is still the standard runtime config, and there is no pod concept there yet. Unless we introduce another pod level spec in containerd.

  1. How are the standard input/outputs handled, do we still have a need for a containerd-shim process ?

You implement the new shim api, so you can do input/output however you want as long as you expose required pipes on the host.

  1. What do we expect from Pids() ? The list of processes IDs from the container perspective or from the host perspective ?

It doesn't matter. As long as you keep it consistent across the shim api implementation: always internal or always host. Containerd and containerd/cri will never assume the pid exists on the host.

@sboeuf
Copy link

sboeuf commented Jul 5, 2018

@Random-Liu thanks for those answers, I think they clarify a lot. Here are more questions based on your replies:

You need to implement your own containerd-shim, which is a per container process.

Now I understand but I am wondering how this would work with Kata Containers for instance. This proposal tries to address the issue of VM based container runtimes, and this would work perfectly in a simple case like Docker, where we would get only one container per VM (per pod).
But in case of CRI containerd calling into containerd, then calling into the Kata Containers Task API, that would not fit properly. CRI containerd would provide some annotation to specify we need to run a container as a pod POD1, this would call into containerd which would call into Kata runtime shim implementation process. This would start the VM after initializing all the handlers it needs.
Now, let's say the CRI asks for a new container, CRI containerd will ask for this container with an annotation saying this should be a regular container, part of the existing pod POD1, and eventually will call into containerd. Problem is, containerd has no notion of pod here, and it would start a new Kata runtime shim implementation process, which has no way to access the data initialized by the other process (the one which started the VM/pod). It could if it was looking into the filesystem for some existing pod, but we would loose the whole interest of this proposal. And more than that, we would have one containerd-shim process per container, where we could have one per pod, which would save some memory footprint on the host.

It doesn't matter. As long as the new shim api is implemented, containerd should be able to use it. In theory, it is possible to use libcontainer directly to implement the shim, but I guess for linux container, using runc is still a better solution because it is very stable and widely used now.

Yes I understand, particularly because there would be no direct benefit for the runc case.

You implement the new shim api, so you can do input/output however you want as long as you expose required pipes on the host.

Where do they come from ? How do you know which pipes the container should expose ?

It doesn't matter. As long as you keep it consistent across the shim api implementation: always internal or always host. Containerd and containerd/cri will never assume the pid exists on the host.

Ok, if the consumer of this API will not assume the PIDs exist on the host, we should definitely provide the PIDs as they're seen by the container. I am saying this because I guess that Pids() will be used to show what's inside the container, like a ps, right ?

As a side note, @crosbymichael is presenting to Kata Architecture Committee meeting on Monday July 9th at 8AM PST. Feel free to join if you want to talk with Kata community about it :)

@cpuguy83
Copy link
Member

cpuguy83 commented Jul 5, 2018

Forgive my ignorance here, but couldn't this be handled by ExecProcess which (I think?) can have an envelope describing things like the image to run the process with or other options? (just spitballing).

In the API definition, ExecProcess just uses a google.protobuf.Any for the process spec, so conceivably this can pass whatever is necessary.

@sboeuf
Copy link

sboeuf commented Jul 9, 2018

@cpuguy83 ExecProcess handles a different use case. It will start a new process in an existing container, therefore, there is no such thing as the image description since the container is already running.

And even if we were able to pass any kind of information through ExecProcess, we would need containerd to be aware about the difference between pods and containers.

@cpuguy83
Copy link
Member

cpuguy83 commented Jul 9, 2018

You haven't explained why containerd needs to be pod aware. In my mind saying containerd needs to be pod aware is akin to saying Linux needs to be pod aware.

A pod is a scheduling concept to ensure two workloads land on the same node and have certain API's between each member of a pod (primarily 127.0.0.1 being the same in each member of a pod).

cri(-containerd) is pod aware and can make use of everything containerd has to offer (and then some), which is really way more than what Docker offers even.

Or is the issue more that this is a failure of the OCI spec to abstract more than a traditional linux container?

@sboeuf
Copy link

sboeuf commented Jul 9, 2018

@cpuguy83 Ok let me try to be clearer (I hope I will be :)).
CRI-containerd is pod aware, yes. And it passes some annotations to the OCI runtime (kata-runtime for instance) in order to let it know what kind of container it is (regular container or pod). The difference being to create a VM (case of a pod) or a container inside an existing VM (case of regular container). Based on this, you understand that the highest granularity for a VM based runtime is the pod unit, not the container.

Now, if we take a look at the proposal, which is to avoid having states being stored in case of VM based container runtimes, we would need to provide a pod level shim API. Otherwise, we would end up (from containerd), starting shims at a container level, but we could not share knowledges between containers of the same pod because we would have several shim entities running.

@sboeuf
Copy link

sboeuf commented Jul 9, 2018

@crosbymichael to follow up after the discussion during the Kata Containers architecture committee meeting, here are a few thoughts:

What would it take to make containerd "pod aware"?
I think the important thing here would be to make sure that by default, if nothing is specified, containerd would behave exactly the way it does right now. But in case it finds some flags/annotations that could be specified by upper layer (CRI-containerd), it would have this knowledge about pod or container.

How things should be handled differently?
In case the containerd layer would detect a pod case, it would start a first shim to handle the whole pod. When a container (tagged as a regular container) is expected to be created, if it is tagged as running into a pod, containerd should simply reuse the existing shim, instead of starting a new one. And the existing shim implementation would make the difference based on the annotations that it already knows about to make the difference about action to take based on pod or container granularity.

I am suggesting this because it would make a lot of sense from VM based shim implementation to handle everything at the pod level (which is the VM level), and I truly believe this would not modify containerd code too much.
Also, I understand there might be some constraints/blockers from your perspective, and I'd like to hear about it, in order to see if we could reach a point where it would fit all use cases properly :)

@Random-Liu
Copy link
Member

Random-Liu commented Jul 10, 2018

Where do they come from ? How do you know which pipes the container should expose ?

@sboeuf It's part of the shim api https://github.com/containerd/containerd/blob/master/runtime/shim/v1/shim.proto#L60-L62.

And more than that, we would have one containerd-shim process per container, where we could have one per pod, which would save some memory footprint on the host.

@sboeuf From what I can see, the shim api is pretty generic, every request has an id associated. It means that from the api perspective, it doesn't matter whether it is a daemon, a per-container shim or a per-pod shim behind.

The only thing that matters is how the shim is started, stopped and cleaned up, which we need to resolve for shim v2 api anyway. A simple idea is that we can define a protocol to start/stop the shim process, e.g.

  1. containerd-shim start [OCI Spec] should return the shim socket and an id.
  2. containerd-shim stop [ID] should stop/cleanup the shim process. This also solves the cleanup issue.

Then for the 3 scenarios above:

  1. Daemon: containerd-shim start just returns the daemon socket. containerd-shim stop does nothing.
  2. Pod-shim: containerd-shim start starts the shim if it is a sandbox, reuses the existing shim if it is a container. (Only need to checkpoint sandbox id, all other states are in the shim process).
  3. Container-shim: Does what it does today.

In this way, the logic doesn't need to reside in containerd, everything can be done in the shim binary. And it provides enough flexibility for people to innovate. :)

@crosbymichael Do you see any problem with this proposal?

@sboeuf
Copy link

sboeuf commented Jul 10, 2018

@Random-Liu the scenarios you're describing looks perfectly good to me ! But I think containerd still needs to be aware at some point, otherwise how can it decide to start (or not) a new shim ?
The shim is not starting himself, meaning that containerd may have this information. But I agree once the shim has been started, it's the shim implementation responsibility to know what to do based on annotations. That's why I was mentioning a very limited impact on containerd.
Do you think we can completely get rid of the notion of pod from containerd perspective ??

@Random-Liu
Copy link
Member

Random-Liu commented Jul 10, 2018

But I think containerd still needs to be aware at some point, otherwise how can it decide to start (or not) a new shim ?

It is decided in the containerd-shim binary, in the containerd-shim start command. As for how containerd-shim start is implemented, it doesn't matter to containerd. (Think about how runc create vs. kata create works with containerd)

Do you think we can completely get rid of the notion of pod from containerd perspective ??

We are not getting rid of the pod notion from containerd. We just try to contain the pod notion in plugins, and actually other containerd features should also be contained in plugins.
It would be good if we can contain the pod notion in the cri plugin and specific shim "plugins". However, if it turns out to be impossible, we may need to come back and revisit whether we should add pod notion into the containerd core. :P

@sboeuf
Copy link

sboeuf commented Jul 10, 2018

Sorry, by "getting rid of", I meant "pod agnostic". I agree it's better if we can keep containerd pod agnostic.

I think I am misunderstanding something here, and that's why I need some clarification about the sequence. With this new API, what happens when containerd creates a new container, how does it start talking with the gRPC server implemented by containerd-kata-shim ?
Is it spawning the binary first ? Or does it expect to find the gRPC already running somewhere ?

@Random-Liu
Copy link
Member

Random-Liu commented Jul 10, 2018

I don't think we need a stop command because the new API already has a Shutdown rpc for doing this.

I think the previous cleanup logic is to call runc delete when containerd-shim exits unexpectedly (See cleanupAfterDeadShim), which I think is very useful.

I believe Shutdown rpc still assumes the shim is running, which doesn't work in this case.

@crosbymichael
Copy link
Member Author

@Random-Liu ya, we already have a shim delete command in v2 to replace that.

@sboeuf
Copy link

sboeuf commented Jul 10, 2018

@Random-Liu ah yes good point, this would be an extra security.

@sboeuf
Copy link

sboeuf commented Jul 10, 2018

@crosbymichael what's the full picture of this v2 ?

  • shim start
  • shim delete
  • rpc calls for the rest of the lifecycle management

Is that it, or am I missing some parts ?

@crosbymichael
Copy link
Member Author

@sboeuf yes, that is it

@sboeuf
Copy link

sboeuf commented Jul 10, 2018

@crosbymichael thanks for the clarification 👍

@Random-Liu
Copy link
Member

@Random-Liu ya, we already have a shim delete command in v2 to replace that.

Yeah, that is good enough. I don't know we have a delete command already. I just meant it. :)

@crosbymichael
Copy link
Member Author

@Random-Liu @sboeuf please take a look at this commit:

3c3b72c

@sboeuf
Copy link

sboeuf commented Jul 11, 2018

@crosbymichael will do !

@sboeuf
Copy link

sboeuf commented Jul 11, 2018

@crosbymichael 2 comments, but LGTM

@Random-Liu
Copy link
Member

/cc @resouer

@bergwolf
Copy link
Contributor

@Random-Liu @sboeuf @crosbymichael With this containerd-shim start/stop design, it seems that the shim V2 grpc requests should also be extended to indicate each request is targeting which container, so that we can use the same grpc server to handle different containers of the same sandbox in kata-shim. WDYT?

Or do you expect the kata-shim to start a new grpc server with a new grpc address for each container created?

@Random-Liu
Copy link
Member

Random-Liu commented Jul 12, 2018 via email

@bergwolf
Copy link
Contributor

@Random-Liu hmm, it already has a container id and an exec id in the exec request. The design looks good to me. Thanks!

BTW, one implication with the process related requests is that exec_id is unique within a sandbox, because we do not pass a container ID together. I checked containerd code and found that it is a random string w/o collision check but I guess the chance of collision is so small that we can safely ignore it (at least for now).

@crosbymichael
Copy link
Member Author

Its up to the runtime to error if there is a collision with an exec_id, especially with this model going forward. Containerd keeps very minimal state about running containers and it's up to the shims to hold and validate state on the system.

@crosbymichael
Copy link
Member Author

@Random-Liu I just added any missing IDs and the containerd id on Exec in my last commit for my branch

@sboeuf
Copy link

sboeuf commented Jul 12, 2018

As long as we get the container_id through Exec() (which seems to be the case), the exec_id will be unique per container, but does not need to be unique per sandbox. And I agree it is the shim implementation responsibility to enforce this.

@bergwolf
Copy link
Contributor

Containerd keeps very minimal state about running containers and it's up to the shims to hold and validate state on the system.

Then would it make sense to let the shim create and return an exec id instead? Because in current API, by the time a shim finds out the collision, it can do nothing but fail the operation and containerd would have to retry with a different exec_id because the exec_id is created by containerd rather than passed by the upper layers. It's just a minor issue though. Mostly not worth the engineer efforts since the chance of collision is negligible.

@sboeuf Except for the Exec() operation, there are other process APIs like ResizePty(), CloseIO() etc. and they do not take a container_id parameter. That's why I think exec_id needs to be unique within a sandbox.

@crosbymichael
Copy link
Member Author

exec_id's come from the user, not containerd so you have to fail if they collide. The users usually have this figured out as they are for specific things, healthchecks, or random ids. Its not up to us.

@bergwolf I'll have to update those methods that work on exec'd processes to take both.

@sboeuf
Copy link

sboeuf commented Jul 12, 2018

@bergwolf ah yes good point, they're tied to a process, so it'd be cool to have the container_id for those too.

@bergwolf
Copy link
Contributor

@crosbymichael Ah, you are right. I was looking at the cri plugin code actually and exec_id is generated as a random string there.

@crosbymichael
Copy link
Member Author

Ok, i updated my PR fixing the last rpcs with exec and container ids

@cpuguy83
Copy link
Member

This looks great. Thanks for explaining the issue @sboeuf!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants