Skip to content

Conversation

@attlee-wang
Copy link

@attlee-wang attlee-wang commented Sep 10, 2024

My service is a multi-goroutine daemon service, and nri-plugin is one of the modules of my service.
In nri-plugin I did not use WithOnClose() to set the exit callback function, I would rebuild a new stub when exiting. I found that when the stup exception occurred, the daemon service main was actually exited. I felt that it was very unfriendly to exit the deamon service due to a goroutine exception.
After checking, it is found that there is a problem with the following code:

// Handle a lost connection.
func (stub *stub) connClosed() {
	stub.Lock()
	stub.close()
	stub.Unlock()
	if stub.onClose != nil {
		stub.onClose()
		return
	}

	os.Exit(0) // exits the process
}

os.Exit(0) exits the process,for the nri module package, using runtime.Goexit() is more elegant. runtime.Goexit() will only exit the current goroutine, not the daemon process.

Signed-off-by: attlee-wang <wangdong102425@gmail.com>
@attlee-wang
Copy link
Author

@klihub @mikebrow @fuweid can anyone help to review this ? thanks.

@klihub
Copy link
Member

klihub commented Sep 10, 2024

My service is a multi-goroutine daemon service, and nri-plugin is one of the modules of my service. In nri-plugin I did not use WithOnClose() to set the exit callback function, I would rebuild a new stub when exiting.

@attlee-wang Not directly related to this PR, but it might be useful for you: ever since we have merged PR #91, it became possible to restart a stub by using a Stop() followed by a Start().

I found that when the stup exception occurred, the daemon service main was actually exited. I felt that it was very unfriendly to exit the deamon service due to a goroutine exception. After checking, it is found that there is a problem with the following code:
...

This behavior has been discussed and criticized earlier a few times, and it is fair to say that not without a reason. I have a few question/comments related to this.

  1. This original, and admittedly a bit harshly/hastily implemented behavior, was partly in place to make pre-connected/runtime-launched plugins exit once they lose their pre-connected transport (a socketpair() which cannot be externally re-established). It would be nice and also fairly easily possible to retain this behavior. We could either save the fact in the original connection setup/adoption code that we took a socketpair and then use that in the connection lost handler, or we could simply re-check the same env. var and use its existence for the same.
  2. Cannot you work around this immediate problem in your current code by simply creating your stub with a WithOnClose() handler to prevent the stub internally doing a os.Exit(0) ? Your handler could either do nothing or, for instance using a channel, signal your upper layer logic to recreate (or Stop()/Start()) the stub, if you are neither running it with stub.Run() nor waiting for its termination with stub.Wait() ?

}

os.Exit(0)
runtime.Goexit() //The exit code can be determined when the coroutine exits
Copy link
Member

@klihub klihub Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why a runtime.Goexit() would be better here than not doing anything (other than maybe recording the fact that we lost connection) and simply returning instead? If the stub has been started by stub.Start() or stub.Run(), it should be sitting in ttrpc.Server.Serve() which then should (soon) return once we return from here. That in turn should cause stub.Run() to return, if that is what the caller used to start the stub, or otherwise stub.Wait() to return if that is what the caller uses to wait for the stub to stop.

I think what we could do here is only call os.Exit() if this is a pre-connected plugin. So,

  • either record in connect() the fact that we've taken a socketpair from the env., and
  • use that to decide to os.Exit() here, or check here for the presence of the same env. var and simply use that fact to exit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I think something like this could work, too:

diff --git a/pkg/stub/stub.go b/pkg/stub/stub.go
index 88fd2e1..b68344f 100644
--- a/pkg/stub/stub.go
+++ b/pkg/stub/stub.go
@@ -502,6 +502,15 @@ func (stub *stub) connect() error {
                        return fmt.Errorf("invalid socket (%d) in environment: %w", fd, err)
                }
 
+               userOnClose := stub.onClose
+               stub.onClose = func() {
+                       if userOnClose != nil {
+                               userOnClose()
+                       }
+                       log.Infof(noCtx, "Connection from environment lost, exiting...")
+                       os.Exit(0)
+               }
+
                return nil
        } 

@klihub
Copy link
Member

klihub commented Sep 11, 2024

ping @attlee-wang

@attlee-wang
Copy link
Author

attlee-wang commented Sep 24, 2024

  1. Cannot you work around this immediate problem in your current code by simply creating your stub with a WithOnClose() handler to prevent the stub internally doing a os.Exit(0) ? Your handler could either do nothing or, for instance using a channel, signal your upper layer logic to recreate (or Stop()/Start()) the stub, if you are neither running it with stub.Run() nor waiting for its termination with stub.Wait() ?

@klihub It's not that I don't use WithOnClose(). I looked at the stub.go package and exist stub.Stop(). I subjectively thought that stub.Stop() could be used. After all, WithOnClose() is just an optional function.

@samuelkarp
Copy link
Member

I think #173 resolves the same problem, but does so in a way that exposes the error to the plugin better. Thanks for your contribution @attlee-wang.

@samuelkarp samuelkarp closed this May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants