-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: add synchronize between userCloseFunc and rpc call #87
Conversation
when containerd runtime plugin exites abnormally, ttrpc connection will closed and userCloseFunc will be called to handle cleanup the resources created by containerd shim. current rpc call will also return err. But these two steps are asynchronous. after rpc call return err, upper application such as k8s may restart container. but start may fail due to cleanup not finish, some resources not be released. and this leaked resources leads to failed inplace-update the pod again. Fixed containerd#88
@@ -263,7 +265,6 @@ func (c *Client) run() { | |||
|
|||
defer func() { | |||
c.conn.Close() | |||
c.userCloseFunc() | |||
close(c.userCloseWaitCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line needs to be moved too?
Also, can we have an unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think UserOnCloseWait function may not necessary. Cause userCloseFunc() and rpc call are already synchronous in this situation.
I think the problem mentioned in #68 can also be fixed by the current patch. rpc call will not return error until userCloseFunc() finish, then the Task.Delete goroutine works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have to wait for the closeFunc, you can call UserOnCloseWait
. But the shim is killed unexpectedly and the userCloseFunc()
can be called in async-defer. If we put the userCloserFunc()
in Close()
, no one will call it. Hope it can help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find a suitable place to add UserOnCloseWait. I find UserOnCloseWait is call in shim Delete, but this function will not be called when shim is killed unexpectedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the task is killed unexpectedly, the cleanup will be called and no need to sync the cleanup. The UserOnCloseWait is only be called when delete task manually because it needs sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #88
and userCloseFunc will be called to handle cleanup the resources created by
containerd shim. current rpc call will also return err. But these two step are
asynchronous.
after rpc call return err, upper application such as k8s may restart container.
but start may fail due to cleanup not finish, some resources not be released.
and this leaked resources leads to failed inplace-update the pod again.
If the cleanupAfterDeadShim
doesn't finish, I think the task record is still there. The recreate with same ID will fail.
Please provide more information about your case, it will help. Thanks
If the cleanupAfterDeadShim doesn't finish, restart will fail cause the bundle path exits. ks8 will continuously restart the container until success. It will work, but we think it may better to solve this problem in ttrpc. |
I don't think it is problem because the bundle data is still here. The record is still there. And one more thing is that there is no way to restart container in kubernetes which only recreate. |
when containerd runtime plugin exites abnormally, ttrpc connection will closed
and userCloseFunc will be called to handle cleanup the resources created by
containerd shim. current rpc call will also return err. But these two step are
asynchronous.
after rpc call return err, upper application such as k8s may restart container.
but start may fail due to cleanup not finish, some resources not be released.
and this leaked resources leads to failed inplace-update the pod again.