pss: Limit by time the retries of messages in the outbox #1861

kortatu · 2019-10-04T09:58:23Z

Whenever a message forwarding fails, the time since first added to the outbox is checked and if certain time has passed (10 minutes) the message is not requeued.
Additionally a new counter pss.forward.expired is increased each time a message is expired.
Tillina clock is used for testability.
This PR solves issue #1804.

nolash

To be honest, I would prefer an exponential backoff solution instead.

Using the solution outlined in this PR, we are triggering the forwarding code without any delay for 10 minutes. I think we should skip triggering the code in exponential intervals up to a certain limit.

Furthermore, I believe that the problem outlined in #1804 could just as well be solved by capabilities, if we imagine that on fail to forward we make an audit pass to see if there are peers that match precisely the full address, which do not have pss capability, and if so the message is not re-enqueued.

Finally, we could push handling the problem to the calling code, if we had a way of signalling that the message actually failed. It's tempting to think that synchronous send is a better bet after all (this has been raised by @nonsense @zelig too). We could also imagine that message queue handler is a separate component that wraps the send, leaving the option to use queue (with receipt and channel for failed messages), but also having the option of direct send and just getting the synchronous result immediately.

Furthermore, I'm worried about the performance of pss as such with push sync using it directly, and especially if we are using synchronous sends (I've already experienced serious lags in message sending because of this using test clusters). My intuition tells me that push sync should be separate client code to a kademlia forwarding service, and where we have the possiblity to do some resource prioritization between chunks and messages.

janos

I share the same opinion about exponential backoff with @nolash. That would be better, even with a jitter, to avoid repetitive retry bursts.

janos · 2019-10-15T12:50:59Z

pss/outbox/outbox.go

@@ -151,3 +184,7 @@ func (o *Outbox) requeue(slot int) {
 func (o *Outbox) len() int {
 	return cap(o.slots) - len(o.slots)
 }
+


CurrentSize() and len() have the same function bodies and CurrentSize is used only in tests. CurrenSize can be removed.

kortatu · 2019-10-15T14:26:42Z

To be honest, I would prefer an exponential backoff solution instead.

Using the solution outlined in this PR, we are triggering the forwarding code without any delay for 10 minutes. I think we should skip triggering the code in exponential intervals up to a certain limit.

Furthermore, I believe that the problem outlined in #1804 could just as well be solved by capabilities, if we imagine that on fail to forward we make an audit pass to see if there are peers that match precisely the full address, which do not have pss capability, and if so the message is not re-enqueued.

Finally, we could push handling the problem to the calling code, if we had a way of signalling that the message actually failed. It's tempting to think that synchronous send is a better bet after all (this has been raised by @nonsense @zelig too). We could also imagine that message queue handler is a separate component that wraps the send, leaving the option to use queue (with receipt and channel for failed messages), but also having the option of direct send and just getting the synchronous result immediately.

Furthermore, I'm worried about the performance of pss as such with push sync using it directly, and especially if we are using synchronous sends (I've already experienced serious lags in message sending because of this using test clusters). My intuition tells me that push sync should be separate client code to a kademlia forwarding service, and where we have the possiblity to do some resource prioritization between chunks and messages.

Guys, we are comparing here with the current code. The current code retries the forwarding without delay ad infinitum. This PR tries to at least, put a limit to those retries because it was something that popped up during some tests.

This solution is compatible to a more sophisticated solution using exponential backoff, but even we do that in the future, we would like to put a max time in outbox limit.

nolash

I see your point. However I am unsure whether this is actually needed if the conclusion is that we change our minds about using asynchronous sends after all, and do not want to provide an outbox to keep it as an option. That said, now that the code is written it would be silly not to include it, even if it gets taken out again later.

In general I would like to see less rash patchwork decisions like this one; the urgency of duct taping push sync before devcon should not have dictated decisions in how pss is designed.

nolash · 2019-10-15T23:49:40Z

pss/outbox/outbox_test.go

+		t.Errorf("Expected one message in outbox, instead got %v", numMessages)
+	}
+
+	mockClock.Set(now.Add(duration / 2))


Would you not need the same retry as below to verify here?

Well, bellow I'm waiting for something to happen, in this case I should wait for something not to happen. It is hard to do that because in the case everything works fine I should wait the whole time, delaying the tests execution. In the last case, even though there are 10 iterations waiting, in the normal case, it will take only 10 milliseconds to finish.
What I can do is wait a couple of iterations.

kortatu self-assigned this Oct 4, 2019

kortatu added pss builds on open PR Builds on a PR that is not yet merged. It is blocked until then and the diff won't make sense yet. ready for review labels Oct 4, 2019

kortatu mentioned this pull request Oct 4, 2019

Pushsync improvements #1857

Closed

janos mentioned this pull request Oct 4, 2019

pss: Limit the number of workers forwarding messages in outbox #1860

Merged

kortatu added ready for review and removed ready for review labels Oct 4, 2019

kortatu added 2 commits October 15, 2019 11:41

pss: limit retry time for messages in outbox

eda51fd

pss: Fixed worker freed

814e428

kortatu force-pushed the issue-1804 branch from 613eed8 to 814e428 Compare October 15, 2019 10:17

kortatu added ready for review and removed builds on open PR Builds on a PR that is not yet merged. It is blocked until then and the diff won't make sense yet. labels Oct 15, 2019

kortatu requested review from nonsense and janos October 15, 2019 10:21

nolash reviewed Oct 15, 2019

View reviewed changes

janos reviewed Oct 15, 2019

View reviewed changes

Removed CurrentSize method. len() method promoted to public Len()

228a77a

janos approved these changes Oct 15, 2019

View reviewed changes

nolash approved these changes Oct 15, 2019

View reviewed changes

Waiting in tests to be sure that the message is not expired

5de68ca

nonsense merged commit cdcbd0a into ethersphere:master Oct 16, 2019

kortatu deleted the issue-1804 branch October 16, 2019 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pss: Limit by time the retries of messages in the outbox #1861

pss: Limit by time the retries of messages in the outbox #1861

kortatu commented Oct 4, 2019 •

edited

Loading

nolash left a comment •

edited

Loading

janos left a comment

janos Oct 15, 2019

kortatu commented Oct 15, 2019

nolash left a comment

nolash Oct 15, 2019

kortatu Oct 16, 2019

pss: Limit by time the retries of messages in the outbox #1861

pss: Limit by time the retries of messages in the outbox #1861

Conversation

kortatu commented Oct 4, 2019 • edited Loading

nolash left a comment • edited Loading

Choose a reason for hiding this comment

janos left a comment

Choose a reason for hiding this comment

janos Oct 15, 2019

Choose a reason for hiding this comment

kortatu commented Oct 15, 2019

nolash left a comment

Choose a reason for hiding this comment

nolash Oct 15, 2019

Choose a reason for hiding this comment

kortatu Oct 16, 2019

Choose a reason for hiding this comment

kortatu commented Oct 4, 2019 •

edited

Loading

nolash left a comment •

edited

Loading