-
Notifications
You must be signed in to change notification settings - Fork 112
pss: Limit by time the retries of messages in the outbox #1861
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I would prefer an exponential backoff solution instead.
Using the solution outlined in this PR, we are triggering the forwarding code without any delay for 10 minutes. I think we should skip triggering the code in exponential intervals up to a certain limit.
Furthermore, I believe that the problem outlined in #1804 could just as well be solved by capabilities, if we imagine that on fail to forward we make an audit pass to see if there are peers that match precisely the full address, which do not have pss capability, and if so the message is not re-enqueued.
Finally, we could push handling the problem to the calling code, if we had a way of signalling that the message actually failed. It's tempting to think that synchronous send is a better bet after all (this has been raised by @nonsense @zelig too). We could also imagine that message queue handler is a separate component that wraps the send, leaving the option to use queue (with receipt and channel for failed messages), but also having the option of direct send and just getting the synchronous result immediately.
Furthermore, I'm worried about the performance of pss as such with push sync using it directly, and especially if we are using synchronous sends (I've already experienced serious lags in message sending because of this using test clusters). My intuition tells me that push sync should be separate client code to a kademlia forwarding service, and where we have the possiblity to do some resource prioritization between chunks and messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share the same opinion about exponential backoff with @nolash. That would be better, even with a jitter, to avoid repetitive retry bursts.
pss/outbox/outbox.go
Outdated
@@ -151,3 +184,7 @@ func (o *Outbox) requeue(slot int) { | |||
func (o *Outbox) len() int { | |||
return cap(o.slots) - len(o.slots) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentSize() and len() have the same function bodies and CurrentSize is used only in tests. CurrenSize can be removed.
Guys, we are comparing here with the current code. The current code retries the forwarding without delay ad infinitum. This PR tries to at least, put a limit to those retries because it was something that popped up during some tests. This solution is compatible to a more sophisticated solution using exponential backoff, but even we do that in the future, we would like to put a max time in outbox limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. However I am unsure whether this is actually needed if the conclusion is that we change our minds about using asynchronous sends after all, and do not want to provide an outbox to keep it as an option. That said, now that the code is written it would be silly not to include it, even if it gets taken out again later.
In general I would like to see less rash patchwork decisions like this one; the urgency of duct taping push sync before devcon should not have dictated decisions in how pss is designed.
t.Errorf("Expected one message in outbox, instead got %v", numMessages) | ||
} | ||
|
||
mockClock.Set(now.Add(duration / 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you not need the same retry as below to verify here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, bellow I'm waiting for something to happen, in this case I should wait for something not to happen. It is hard to do that because in the case everything works fine I should wait the whole time, delaying the tests execution. In the last case, even though there are 10 iterations waiting, in the normal case, it will take only 10 milliseconds to finish.
What I can do is wait a couple of iterations.
Whenever a message forwarding fails, the time since first added to the outbox is checked and if certain time has passed (10 minutes) the message is not requeued.
Additionally a new counter
pss.forward.expired
is increased each time a message is expired.Tillina clock is used for testability.
This PR solves issue #1804.