stub: support restart after stub stopped #91

zhaodiaoer · 2024-07-04T09:03:50Z

In some scenarios such as runtime restart or the occurrence of nri request timeout, the ttrpc connections between the plugin and the runtime will be actively closed by the runtime, even the underlying network connection will be closed together. After this, the plugin must need to re-register to the adaptation side, but now the stub object cannot be reused for this; if the running plugin wants to reconnect to the runtime, the only way is to create a new stub for the plugin.

This commit introduce a new stub option WithAutoRestartDelay. The plugin developer can use this to build a reuseable stub, which will wait for a confiurable time duration and restart automatically after the ttrpc connection is lost, then plugin will be auto-reregistered to the adaptation.

zhaodiaoer · 2024-07-04T09:35:06Z

@klihub @fuweid @mikebrow can anyone help to review this ? thanks.

klihub · 2024-07-04T10:14:26Z

@klihub @fuweid @mikebrow can anyone help to review this ? thanks.

My first question is that do we really want to initiate a reconnection from the stub itself ? Instead of letting some higher-level application logic decide if, and when, attempting a reconnection is both desirable and safe ?

We have the WithOnClose() stub option which should allow applications to detect when a connection to the runtime is gone and then trigger at will a reconnection on their own accord.

zhaodiaoer · 2024-07-04T14:07:15Z

@klihub @fuweid @mikebrow can anyone help to review this ? thanks.

My first question is that do we really want to initiate a reconnection from the stub itself ? Instead of letting some higher-level application logic decide if, and when, attempting a reconnection is both desirable and safe ?

We have the WithOnClose() stub option which should allow applications to detect when a connection to the runtime is gone and then trigger at will a reconnection on their own accord.

I quite approve of the idea of leaving the reconnection work to the higher-level application to handle. However, in the actual attempt, I discovered the following problems that are difficult to solve by the higher-level application logic. Most of them come from the current internal logic of the stub:

When the initial stub.conn is closed, it is not nil and no any logic will reset it to nil. Therefore, stub.connect() will do nothing. The multiplex required for reconnection will be constructed base the old stub.conn (the ttrpc connection stage comes after multiplex construct). Here, it is required that stub.conn is a stdnet.Conn implementation that is still usable after being closed. Maybe there are some cunning ways to achieve this, but it is inconsistent with the semantics defined by the stdnet.Conn interface.
The handler provided by WithOnClose() cannot take stub object, so it cannot operate stub.conn directly, only can send some notify to other functions to do reconnection work.
Last minor problem: just like the stub.conn variable mentioned in the 1st problem, once stub.started is set to true, there is no chance to reset it to false, new multiplex and new ttrpc connections will not be created.

To summarize the problems described above, most of the current connection handling logic is defined within the stub, and there are not enough exposed public methods to implement custom reconnection logic. If my understanding is incorrect, or if there are currently any proposals for optimizing the design of the plugin <-> runtime connection management, please let me know. I am willing to implement the code for these designs. But before that, if a convenient reconnection handling option can be provided to plugin developers for easy use, it might also be a good choice.

klihub · 2024-07-04T15:35:25Z

When the initial stub.conn is closed, it is not nil and no any logic will reset it to nil. Therefore, stub.connect() will do nothing

Sorry, I was not specific enough. I did not mean to try reusing/reconnecting the already disconnected stub. I meant to decide/check whether and when re-establishing the connection is safe, then create a new stub and connect that one...

zhaodiaoer · 2024-07-05T02:47:32Z

When the initial stub.conn is closed, it is not nil and no any logic will reset it to nil. Therefore, stub.connect() will do nothing

Sorry, I was not specific enough. I did not mean to try reusing/reconnecting the already disconnected stub. I meant to decide/check whether and when re-establishing the connection is safe, then create a new stub and connect that one...

oh i got you, In fact, currently we do achieve the purpose of reconnection and re-registration plugin by rebuilding the stub object.

I think a rather crucial point here is what the long-term positioning of the stub is like. Does it only cover the lifecycle of a single connection or the lifecycle of the plugin? If it is an external type of plugin, it is a relatively common requirement for its operation cycle to cover multiple connections. Rebuilding the entire stub object every time the connection is disconnected is a bit counterintuitive.

klihub · 2024-07-05T12:39:23Z

When the initial stub.conn is closed, it is not nil and no any logic will reset it to nil. Therefore, stub.connect() will do nothing

Sorry, I was not specific enough. I did not mean to try reusing/reconnecting the already disconnected stub. I meant to decide/check whether and when re-establishing the connection is safe, then create a new stub and connect that one...

oh i got you, In fact, currently we do achieve the purpose of reconnection and re-registration plugin by rebuilding the stub object.

I think a rather crucial point here is what the long-term positioning of the stub is like. Does it only cover the lifecycle of a single connection or the lifecycle of the plugin?

Well, (be)for(e) answering that question there is a related question to consider: 'what is a plugin ?' or more precisely 'can the plugins lifecycle be different from the lifecycle of the connection of the plugin ?'.

These might sound very philosophical at first, outright ridiculous even, before you start to consider things from the runtime's perspective. On the client side, sure, there is a clear difference between the lifecycles of the plugin and its connection. On the runtime side however, there isn't a clear distinction. Once a connection is closed/lost, the runtime considers the plugin gone forever. If the plugin was an internal one, we would need to restart it to get it connected, up, and running again. That creates a new connection, a new process, and a new struct for the plugin. Clearly a new instance from all aspects. If it was an external plugin, even if it does not exit but reconnects from within the same process, a new connection and a new internal struct is created in the runtime. Again, a new instance from the runtime's point of view. So from the runtime's perspective the lifecycles of the connection and the plugin are the same...

If it is an external type of plugin, it is a relatively common requirement for its operation cycle to cover multiple connections. Rebuilding the entire stub object every time the connection is disconnected is a bit counterintuitive.

That is a fair and valid point. I'd still argue that we should make the stub restartable and reconnectable by the plugin, instead of trying to make the stub restart and reconnect itself.

Considering that it is completely unpredictable in what context a 'connection lost' event comes and in what state the plugin itself is (as opposed to the stub) at that moment, only the application can know when it is safe to reconnect and, for instance, what kind of locking or other synchronization need to be taken care of before this can be attempted.

So IMO, we should reset/clear enough of the internal state of the stub that it should be possible to safely restart/reconnect it. And maybe this should always be done explicitly... perhaps only by an explicit call to Stop() ?

So then for a reconnectable plugin

high level plugin logic would need to install an OnClose() handler to notice lost connections
arrange it so that the stub is Stop()ped once connection is lost (not necessarily from the handler)
Stop() would cause the stub to reset enough of its internal state
eventually the plugin would call Start() on the stub again to reconnect, once this is safe to do

zhaodiaoer · 2024-07-07T07:24:11Z

So then for a reconnectable plugin

high level plugin logic would need to install an OnClose() handler to notice lost connections

arrange it so that the stub is Stop()ped once connection is lost (not necessarily from the handler)

Stop() would cause the stub to reset enough of its internal state

eventually the plugin would call Start() on the stub again to reconnect, once this is safe to do

Hi! Klihub, thank you very much for your explanation! After these discussion, I think I understand your concern, that is, it's okay to make the stub on the client side restartable, but it's unsafe to let it restart "automatically".

At present, should I first modify the current patch to achieve “Stop() would cause the stub to reset enough of its internal state”, I think this is a good start.

klihub · 2024-07-08T06:40:20Z

So then for a reconnectable plugin

high level plugin logic would need to install an OnClose() handler to notice lost connections

arrange it so that the stub is Stop()ped once connection is lost (not necessarily from the handler)

Stop() would cause the stub to reset enough of its internal state

eventually the plugin would call Start() on the stub again to reconnect, once this is safe to do

Hi! Klihub, thank you very much for your explanation! After these discussion, I think I understand your concern, that is, it's okay to make the stub on the client side restartable, but it's unsafe to let it restart "automatically".

At present, should I first modify the current patch to achieve “Stop() would cause the stub to reset enough of its internal state”, I think this is a good start.

Yes, I also think that would be the best to start with.

zhaodiaoer · 2024-07-16T07:35:57Z

So then for a reconnectable plugin

high level plugin logic would need to install an OnClose() handler to notice lost connections

arrange it so that the stub is Stop()ped once connection is lost (not necessarily from the handler)

Stop() would cause the stub to reset enough of its internal state

eventually the plugin would call Start() on the stub again to reconnect, once this is safe to do

Hi! Klihub, thank you very much for your explanation! After these discussion, I think I understand your concern, that is, it's okay to make the stub on the client side restartable, but it's unsafe to let it restart "automatically".
At present, should I first modify the current patch to achieve “Stop() would cause the stub to reset enough of its internal state”, I think this is a good start.

Yes, I also think that would be the best to start with.

I modified the implementation according to the conclusion of the previous communication. There are some key modifications:

make stub.close() can be called many times, and all variables that need to be reset when reconnect to runtime will be handled.
stub.Run(ctx) no longer exits when the plugin service exits. instead, it runs in blocking manner and exits only when plugin service returns a critical error or the given context done. Before the Run() returned, the stub.Start() can be called repeatedly.
fix the potential deadlock issue in stub.Wait().

Now, there are two ways in which the high level plugin logic uses the stub:

Call stub.Run(ctx) and give appropriate context, users can receive connection lost event via onClose() and decide whether to end the Run() or restart plugin via call Start(). simply, stub.Start() can be put in onClose() to achieve the effect of auto restart.
Call stub.Start(ctx) and call stub.Wait() explicitly, after Wait() returned, users can decide what to do next stop.

But at present, I still have some things that I haven't thought through: During the execution of the stub.Run(ctx) function, which error reports from the plugin service should be regarded as critical error and cause the stub.Run(ctx) function to return? Currently, my approach is to tolerate all errors defined by the ttrpc layer as recoverable. I don't know if this is reasonable.

zhaodiaoer · 2024-07-18T02:50:07Z

@klihub can you give me more opinion, thanks.

pkg/stub/stub.go

klihub · 2024-07-18T08:50:01Z

So then for a reconnectable plugin

high level plugin logic would need to install an OnClose() handler to notice lost connections

arrange it so that the stub is Stop()ped once connection is lost (not necessarily from the handler)

Stop() would cause the stub to reset enough of its internal state

eventually the plugin would call Start() on the stub again to reconnect, once this is safe to do

Hi! Klihub, thank you very much for your explanation! After these discussion, I think I understand your concern, that is, it's okay to make the stub on the client side restartable, but it's unsafe to let it restart "automatically".
At present, should I first modify the current patch to achieve “Stop() would cause the stub to reset enough of its internal state”, I think this is a good start.

Yes, I also think that would be the best to start with.

I modified the implementation according to the conclusion of the previous communication. There are some key modifications:

make stub.close() can be called many times, and all variables that need to be reset when reconnect to runtime will be handled.

stub.Run(ctx) no longer exits when the plugin service exits. instead, it runs in blocking manner and exits only when plugin service returns a critical error or the given context done. Before the Run() returned, the stub.Start() can be called repeatedly.

I don't think this is a good idea and it is not what we discussed earlier. The plugin should be considered running only when it is serving requests. Once it stops serving requests, it should not be considered running and Run() should return.

zhaodiaoer · 2024-07-18T09:29:52Z

So then for a reconnectable plugin

high level plugin logic would need to install an OnClose() handler to notice lost connections

arrange it so that the stub is Stop()ped once connection is lost (not necessarily from the handler)

Stop() would cause the stub to reset enough of its internal state

eventually the plugin would call Start() on the stub again to reconnect, once this is safe to do

Hi! Klihub, thank you very much for your explanation! After these discussion, I think I understand your concern, that is, it's okay to make the stub on the client side restartable, but it's unsafe to let it restart "automatically".
At present, should I first modify the current patch to achieve “Stop() would cause the stub to reset enough of its internal state”, I think this is a good start.

Yes, I also think that would be the best to start with.

I modified the implementation according to the conclusion of the previous communication. There are some key modifications:

make stub.close() can be called many times, and all variables that need to be reset when reconnect to runtime will be handled.

stub.Run(ctx) no longer exits when the plugin service exits. instead, it runs in blocking manner and exits only when plugin service returns a critical error or the given context done. Before the Run() returned, the stub.Start() can be called repeatedly.

I don't think this is a good idea and it is not what we discussed earlier. The plugin should be considered running only when it is serving requests. Once it stops serving requests, it should not be considered running and Run() should return.

Frankly speaking, this is also the aspect that I have spent a considerable amount of time on but still haven't figured out how to handle better. The positioning of Run() is quite similar to that of Start(), except that unlike Start(), Run() can obtain the error returned by the service. However, I don't think this should be the key point to differentiate the two. Therefore, some modifications have been made to enhance the strength of Run() in terms of the attributes for managing the running cycle.

Regarding the restart requirement I initially proposed, the goal can now be achieved by using Start() and Wait() in combination. The modification of Run() is not necessary and this part of the modification can be reverted.

zhaodiaoer · 2024-07-18T10:03:56Z

I don't think this is a good idea and it is not what we discussed earlier. The plugin should be considered running only when it is serving requests. Once it stops serving requests, it should not be considered running and Run() should return.

@klihub How to optimize Run() is indeed worthy of long-term consideration. I have rolled back all the modifications about Run(). Thanks for your information.

pkg/stub/stub.go

klihub

@zhaodiaoer This looks now good to me, apart from a bit unclear godoc comment about Run(). I gave this a brief practical test this with a modified template plugin which internally Stop()ped itself periodically and otherwise sat in a Run() loop. It seemed to work fine, although I did not try to torture it much.

Could you still fix/update that unclear comment ?

zhaodiaoer · 2024-07-19T02:26:03Z

@zhaodiaoer This looks now good to me, apart from a bit unclear godoc comment about Run(). I gave this a brief practical test this with a modified template plugin which internally Stop()ped itself periodically and otherwise sat in a Run() loop. It seemed to work fine, although I did not try to torture it much.

Could you still fix/update that unclear comment ?

sure, let me do that

klihub

LGTM.

zhaodiaoer · 2024-07-22T07:09:16Z

kindly ping @fuweid

pkg/stub/stub.go

klihub · 2024-07-24T14:01:26Z

@fuweid If you have some spare cycles, PTAL.

fuweid · 2024-07-30T14:02:04Z

pkg/stub/stub.go

@@ -401,6 +401,7 @@ func (stub *stub) Start(ctx context.Context) (retErr error) {

 	log.Infof(ctx, "Started plugin %s...", stub.Name())

+	stub.started = true


Line 398 is still working? We can fix it in the follow-up.

if err = <-stub.cfgErrC; err != nil { return err }

I think before we need to let stub.Configure() may been called more than once, and these code are must run after stub.register(), we can let it remain as it is for the time being, the currently implementation of stub's restart is depend that stub.register() is executed every time after restart .

pkg/stub/stub.go

In some scenarios such as runtime restart or the occurrence of nri request timeout, the ttrpc connections between the plugin and the runtime will be actively closed by the runtime, even the underlying network connection will be closed together. After this, the plugin must need to re-register to the adaptation side, but now the stub object cannot be reused for this; if the running plugin wants to reconnect to the runtime, the only way is to create a new stub for the plugin. This commit has split the lifecycle of the stub and the ttrpc connection to better support the development of the external type of plugins. The plugin developer can build stub once and use it to connect to adaptation side many times, just need re-call Start() function. Signed-off-by: Lei Liu <liulei.pt@bytedance.com>

fuweid

LGTM

Thanks

mikebrow

LGTM

zhaodiaoer force-pushed the stub_support_restart branch from 491312c to 8af88db Compare July 4, 2024 09:16

zhaodiaoer force-pushed the stub_support_restart branch from 8af88db to 2d6ea6f Compare July 16, 2024 06:45

zhaodiaoer changed the title ~~stub: support restart after ttrpc connection lost~~ stub: support restart after stub stopped Jul 16, 2024

klihub reviewed Jul 18, 2024

View reviewed changes

pkg/stub/stub.go Outdated Show resolved Hide resolved

klihub reviewed Jul 18, 2024

View reviewed changes

pkg/stub/stub.go Outdated Show resolved Hide resolved

klihub reviewed Jul 18, 2024

View reviewed changes

pkg/stub/stub.go Outdated Show resolved Hide resolved

zhaodiaoer force-pushed the stub_support_restart branch 3 times, most recently from d52bb52 to e18aee5 Compare July 18, 2024 10:00

klihub reviewed Jul 18, 2024

View reviewed changes

pkg/stub/stub.go Outdated Show resolved Hide resolved

klihub self-requested a review July 18, 2024 15:22

klihub requested changes Jul 18, 2024

View reviewed changes

zhaodiaoer force-pushed the stub_support_restart branch from e18aee5 to 1e59c1b Compare July 19, 2024 02:29

zhaodiaoer requested a review from klihub July 19, 2024 03:56

klihub approved these changes Jul 19, 2024

View reviewed changes

klihub requested a review from fuweid July 19, 2024 07:12

mikebrow reviewed Jul 22, 2024

View reviewed changes

pkg/stub/stub.go Outdated Show resolved Hide resolved

zhaodiaoer force-pushed the stub_support_restart branch from 1e59c1b to 316daf5 Compare July 25, 2024 03:02

fuweid reviewed Jul 30, 2024

View reviewed changes

mikebrow added the v1.0 required label Aug 1, 2024

mikebrow added this to the 1.0 milestone Aug 1, 2024

zhaodiaoer force-pushed the stub_support_restart branch 2 times, most recently from c2ba51c to 4e66aa2 Compare August 4, 2024 13:47

zhaodiaoer force-pushed the stub_support_restart branch from 4e66aa2 to 242661f Compare August 6, 2024 12:36

fuweid approved these changes Aug 6, 2024

View reviewed changes

fuweid requested a review from mikebrow August 6, 2024 15:23

mikebrow approved these changes Aug 6, 2024

View reviewed changes

mikebrow merged commit 5fd8cf6 into containerd:main Aug 6, 2024
8 checks passed

klihub mentioned this pull request Sep 10, 2024

fix: Prevent calling the main process exit without WithOnClose() #108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stub: support restart after stub stopped #91

stub: support restart after stub stopped #91

zhaodiaoer commented Jul 4, 2024

zhaodiaoer commented Jul 4, 2024

klihub commented Jul 4, 2024 •

edited

Loading

zhaodiaoer commented Jul 4, 2024

klihub commented Jul 4, 2024 •

edited

Loading

zhaodiaoer commented Jul 5, 2024 •

edited

Loading

klihub commented Jul 5, 2024 •

edited

Loading

zhaodiaoer commented Jul 7, 2024

klihub commented Jul 8, 2024

zhaodiaoer commented Jul 16, 2024

zhaodiaoer commented Jul 18, 2024

klihub commented Jul 18, 2024

zhaodiaoer commented Jul 18, 2024

zhaodiaoer commented Jul 18, 2024

klihub left a comment •

edited

Loading

zhaodiaoer commented Jul 19, 2024

klihub left a comment

zhaodiaoer commented Jul 22, 2024

klihub commented Jul 24, 2024

fuweid Jul 30, 2024

zhaodiaoer Aug 4, 2024

fuweid left a comment

mikebrow left a comment

		@@ -401,6 +401,7 @@ func (stub *stub) Start(ctx context.Context) (retErr error) {

		log.Infof(ctx, "Started plugin %s...", stub.Name())

		stub.started = true

stub: support restart after stub stopped #91

stub: support restart after stub stopped #91

Conversation

zhaodiaoer commented Jul 4, 2024

zhaodiaoer commented Jul 4, 2024

klihub commented Jul 4, 2024 • edited Loading

zhaodiaoer commented Jul 4, 2024

klihub commented Jul 4, 2024 • edited Loading

zhaodiaoer commented Jul 5, 2024 • edited Loading

klihub commented Jul 5, 2024 • edited Loading

zhaodiaoer commented Jul 7, 2024

klihub commented Jul 8, 2024

zhaodiaoer commented Jul 16, 2024

zhaodiaoer commented Jul 18, 2024

klihub commented Jul 18, 2024

zhaodiaoer commented Jul 18, 2024

zhaodiaoer commented Jul 18, 2024

klihub left a comment • edited Loading

Choose a reason for hiding this comment

zhaodiaoer commented Jul 19, 2024

klihub left a comment

Choose a reason for hiding this comment

zhaodiaoer commented Jul 22, 2024

klihub commented Jul 24, 2024

fuweid Jul 30, 2024

Choose a reason for hiding this comment

zhaodiaoer Aug 4, 2024

Choose a reason for hiding this comment

fuweid left a comment

Choose a reason for hiding this comment

mikebrow left a comment

Choose a reason for hiding this comment

klihub commented Jul 4, 2024 •

edited

Loading

klihub commented Jul 4, 2024 •

edited

Loading

zhaodiaoer commented Jul 5, 2024 •

edited

Loading

klihub commented Jul 5, 2024 •

edited

Loading

klihub left a comment •

edited

Loading