Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker: add Destroy functionality #4784

Merged
merged 9 commits into from Dec 12, 2019
Merged

Conversation

cirocosta
Copy link
Member

@cirocosta cirocosta commented Nov 18, 2019

Why do we need this PR?

As part of #4783, this PR adds the
functionality required to get garden's Destroy working with containerd.

Changes proposed in this pull request

Contributor Checklist

Reviewer Checklist

  • Code reviewed
  • Tests reviewed
  • PR acceptance performed
  • Documentation reviewed
  • Release notes reviewed
  • New config flags added?

worker/backend/libcontainerd/client.go Show resolved Hide resolved
worker/backend/backend.go Show resolved Hide resolved
worker/backend/backend.go Show resolved Hide resolved
@cirocosta cirocosta changed the title worker: add initial Destroy functionality worker: add Destroy functionality Nov 18, 2019
@deniseyu
Copy link
Contributor

deniseyu commented Nov 26, 2019

End of Tuesday notes:

  • We learned that in order to destroy a container with a Task inside it, we first have to shut down the Task. This is something that Guardian did for us behind the scenes, but with containerd we will have to do it ourselves. To minimize surprise, it would be good to bring across the same behaviours (e.g. which termination signal is used, how long are the timeouts, etc). Here is an example of how a container must be gracefully shut down using containerd's API: https://containerd.io/docs/getting-started/

  • The task shutdown API calls are asynchronous, because containerd daemon can take any length of time to actually remove the container. To handle this we'll need to Wait on an exit code or some other signal.

  • There is a Container.Stop method in the Garden API which we will eventually want to connect up to the task shutdown logic that we're currently implementing inside the Destroy. We'll also want to refactor that out into its own method.

  • For tomorrow: let's experiment and see if we can bring up a Container without any task defined. Containerd blocks deletion when there's a running task, but what if there's no task at all?? Update for academic curiosity we did this, and found that it is indeed possible to create a container with no task running, which is also subsequently deletable without having to do any task operations

@deniseyu
Copy link
Contributor

Here's a useful script to destroy all containers across all containerd namespaces:

https://gist.github.com/deniseyu/55c573026a83e3e1b0011201bcf7b3fc

💥

@deniseyu
Copy link
Contributor

deniseyu commented Nov 28, 2019

End of Thursday notes (we should probably rename our branch 😭)

  • We forgot that before deleting a container, we have to call Task.Delete. I've added that to the case <- ctx.Done() branch of the Task.Kill logic.

  • That broke integration tests, which should now be fixed (NOTE: you have to un-comment out the Task.Delete in case <- exitStatus -- the real integration tests run fast and don't hit the timeout. It's just commented out now to remind Next Week Denise to add unit tests). We are very close to having the lifecycle of a container implemented!

  • I made a failed attempt to unit-test channels. HALP ciro

Still need to do:

  • Properly test the non-timeout branch of Task.Kill. The rest of the unit tests currently pass, but only because the context timeout is being triggered after I hacked some faked methods to return a channel that will never return -- this isn't the normal path and should probably not be the default behavior for "happy path" tests.

  • Think about parameterizing the context cancellation deadline. It's a pain to test a 10s timeout (unless anyone knows a Go equivalent for timecop?)
    Ciro mentioned that perhaps the namespace can hold onto it - still needs some investigation.

  • We may want to dig deeper into exit codes, and what non-zero exit codes might imply. But maybe we don't care and just trust that things are dandy if no error is returned from containerd.

  • Consider abstracting the Task.Wait + Task.Kill + Task.delete logic into a named method that can become our implementation of Garden's Container.Stop interface. Open to debate on what philosophically constitutes a Container.Stop in the containerd world though

@deniseyu deniseyu marked this pull request as ready for review December 5, 2019 19:02
@deniseyu deniseyu requested a review from a team December 5, 2019 19:02
@deniseyu
Copy link
Contributor

deniseyu commented Dec 5, 2019

Going to split out Lookup into a different task, because writing that integration test would be a good opportunity to build out Handle(). We also set up a Dockerfile for running containerd integration tests which will be a separate PR.

@deniseyu deniseyu force-pushed the containerd-monday branch 2 times, most recently from d132766 to f65bb43 Compare December 5, 2019 19:16
@deniseyu
Copy link
Contributor

deniseyu commented Dec 5, 2019

End of Thursday notes:

@pivotal-bin-ju and I finished building out the local integration test tooling and we're able to run and develop the tests in Docker now, woo. We thought about adding a test for Lookup, but decided that since Destroy implicitly calls Lookup, for now there's not much value in adding an integration test for it - this may change if we want to do more interesting things as part of Lookup.

Next week we'll work on filling out the rest of the Garden client and container and process interfaces! Maybe once we get the process stuff working (can refer to Alex's spike branch) we'll be able to see a basic task running in the UI.

Ciro S. Costa and others added 8 commits December 10, 2019 14:14
- still need to actually invoke containerd's Destroy and add integration
tests
- add typed validation error for missing inputs

Signed-off-by: Denise Yu <dyu@pivotal.io>
Signed-off-by: Denise Yu <dyu@pivotal.io>
- backend: lookup calls client's `GetContainer`, which wraps
  containerd's LoadContainer

ps.: this functionality still lacks an integration test.

Signed-off-by: Denise Yu <dyu@pivotal.io>
Co-authored-by: Ciro S. Costa <cscosta@pivotal.io>
need to push in order to test integration tests on linux

SQUASH ME LATER

Signed-off-by: Denise Yu <dyu@pivotal.io>
Signed-off-by: Denise Yu <dyu@pivotal.io>
- add task fakes
- start implementing finding & killing task, a prereq
  for removing the container

Signed-off-by: Denise Yu <dyu@pivotal.io>
- This enables us to test the "happy path" where a task
  can be deleted before the context timeout triggers. This
  means that containerd was able to gracefully shut down
  a running task with SIGTERM.

Signed-off-by: Bin Ju <bju@pivotal.io>
Reset task kill timeout to 10s

Signed-off-by: Bin Ju <bju@pivotal.io>
Signed-off-by: Denise Yu <dyu@pivotal.io>
worker/backend/backend.go Show resolved Hide resolved
worker/backend/backend.go Outdated Show resolved Hide resolved
return InputValidationError{}
}

container, err := b.client.GetContainer(ctx, handle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk what containerd does to lookup containers but I feel like we should include a timeout for this context as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, that's true - we can add this in the next PR where the client timeout is configurable, we'll just decorate every context at the start of each method with the timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added it as a to-do item here: #4783 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task runs don't have timeouts, but pretty much every other operation does :)

worker/backend/backend.go Show resolved Hide resolved
Signed-off-by: Bin Ju <bju@pivotal.io>
_, err = task.Delete(ctx) // todo: we're swallowing exitcodes in both these forks, do we care?
return err
case <-ctx.Done():
err = task.Kill(ctx, syscall.SIGKILL) // should return GRPC DeadlineExceeded error type, wrapped up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ctx has already expired. So task.Kill may or may not be a noop. Does it make more sense to have another ctx?


select {
case <-exitStatus:
_, err = task.Delete(ctx) // todo: we're swallowing exitcodes in both these forks, do we care?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an edge case where the task ran to completion prior to being killed and that result gets swallowed currently.

@@ -148,7 +216,20 @@ func (b *Backend) BulkMetrics(handles []string) (metrics map[string]garden.Conta
//
// Errors:
// * Container not found.
func (b *Backend) Lookup(handle string) (container garden.Container, err error) { return }
func (b *Backend) Lookup(handle string) (garden.Container, error) {
ctx := namespaces.WithNamespace(context.Background(), b.namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for this to be a context.TODO as we intend it to be replaced ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

}

return b.client.Destroy(ctx, handle)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about to refactor the logic as below (move killTasks into else branch, to have less return route path):

	task, err := container.Task(ctxWithTimeout, nil)
	
	if err != nil { //error occurs
		if !errdefs.IsNotFound(err) { //a real error but the task can not be found
			return ClientError { InnerError: err }
		}
                // do nothing if we could not find the task
	} else {
               // kill the task
		err = killTasks(ctxWithTimeout, task)
		if err != nil {
			return ClientError{ InnerError: err }
		}
	}
        // no matter the task exists or is killed, the container should be destroyed anyway. 
	err = b.client.Destroy(ctxWithTimeout, handle)
	if err != nil {
		return ClientError{ InnerError: err }
	}

const maxTaskKillWaitTime = 10 * time.Second

if handle == "" {
return InputValidationError{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better if the handle check is at the first line of the function.?

}

select {
case <-exitStatus:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not ignore the exitCode. From the comment of the source code:

// ExitStatus encapsulates a process's exit status.
// It is used by `Wait()` to return either a process exit code or an error

It may not return the error immediately after we SIGTERM the task.

_, err = task.Delete(ctx) // todo: we're swallowing exitcodes in both these forks, do we care?
return err
case <-ctx.Done():
err = task.Kill(ctx, syscall.SIGKILL) // should return GRPC DeadlineExceeded error type, wrapped up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need another wait after we SIGKILL. the task.Kill will return the status of GRPC call.


func (s *BackendSuite) TestLookupGetContainer() {
s.client.GetContainerReturns(new(libcontainerdfakes.FakeContainer), nil)
container, err := s.backend.Lookup("non-existent-handle")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-existent-handle could be changed to some existing container

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this test is bad right now -- related to the incomplete implementation of Lookup. Will be sorted out later 😂

s.Equal(testNamespace, namespace)
}

func (s *BackendSuite) TestDestroyWaitError() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestDestroyWaitError -> TestDestroyTaskWaitError


err := s.backend.Destroy("some-handle")

s.Equal(2, fakeTask.KillCallCount())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correctness, assert that syscall.SIGKILL was passed to fakeTask.Kill

container containerd.Container, err error,
)

// Destroy stops any running tasks on a container and remove the container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: removes

Copy link
Contributor

@xtreme-sameer-vohra xtreme-sameer-vohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some inline comments

@deniseyu
Copy link
Contributor

deniseyu commented Dec 11, 2019

thanks @xtreme-sameer-vohra for the inline comments! We implemented your suggestions but we're going to fold them into the top of this branch: #4881 because in the last 2 days we restructured the backend and I'm a coward for merge conflicts 🏋️‍♀️

@deniseyu deniseyu dismissed xtreme-sameer-vohra’s stale review December 12, 2019 14:54

Implemented changes in branch at #4881

Copy link
Contributor

@xtreme-sameer-vohra xtreme-sameer-vohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with Denise and Bin, suggestions for this PR will be incorporated in https://github.com/concourse/concourse/pull/4881/files

@xtreme-sameer-vohra xtreme-sameer-vohra merged commit 00a65bd into master Dec 12, 2019
@xtreme-sameer-vohra xtreme-sameer-vohra deleted the containerd-monday branch December 12, 2019 15:28
@jamieklassen jamieklassen mentioned this pull request Dec 12, 2019
20 tasks
@jamieklassen jamieklassen added this to the v5.8.0 milestone Dec 13, 2019
@jamieklassen jamieklassen added the release/documented Documentation and release notes have been updated. label Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/documented Documentation and release notes have been updated.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants