Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

pss: Limit by time the retries of messages in the outbox #1861

Merged
merged 4 commits into from
Oct 16, 2019

Conversation

kortatu
Copy link
Contributor

@kortatu kortatu commented Oct 4, 2019

Whenever a message forwarding fails, the time since first added to the outbox is checked and if certain time has passed (10 minutes) the message is not requeued.
Additionally a new counter pss.forward.expired is increased each time a message is expired.
Tillina clock is used for testability.
This PR solves issue #1804.

@kortatu kortatu self-assigned this Oct 4, 2019
@kortatu kortatu added pss builds on open PR Builds on a PR that is not yet merged. It is blocked until then and the diff won't make sense yet. ready for review labels Oct 4, 2019
@kortatu kortatu mentioned this pull request Oct 4, 2019
@kortatu kortatu added ready for review and removed builds on open PR Builds on a PR that is not yet merged. It is blocked until then and the diff won't make sense yet. labels Oct 15, 2019
Copy link
Contributor

@nolash nolash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I would prefer an exponential backoff solution instead.

Using the solution outlined in this PR, we are triggering the forwarding code without any delay for 10 minutes. I think we should skip triggering the code in exponential intervals up to a certain limit.

Furthermore, I believe that the problem outlined in #1804 could just as well be solved by capabilities, if we imagine that on fail to forward we make an audit pass to see if there are peers that match precisely the full address, which do not have pss capability, and if so the message is not re-enqueued.

Finally, we could push handling the problem to the calling code, if we had a way of signalling that the message actually failed. It's tempting to think that synchronous send is a better bet after all (this has been raised by @nonsense @zelig too). We could also imagine that message queue handler is a separate component that wraps the send, leaving the option to use queue (with receipt and channel for failed messages), but also having the option of direct send and just getting the synchronous result immediately.

Furthermore, I'm worried about the performance of pss as such with push sync using it directly, and especially if we are using synchronous sends (I've already experienced serious lags in message sending because of this using test clusters). My intuition tells me that push sync should be separate client code to a kademlia forwarding service, and where we have the possiblity to do some resource prioritization between chunks and messages.

Copy link
Member

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share the same opinion about exponential backoff with @nolash. That would be better, even with a jitter, to avoid repetitive retry bursts.

@@ -151,3 +184,7 @@ func (o *Outbox) requeue(slot int) {
func (o *Outbox) len() int {
return cap(o.slots) - len(o.slots)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentSize() and len() have the same function bodies and CurrentSize is used only in tests. CurrenSize can be removed.

@kortatu
Copy link
Contributor Author

kortatu commented Oct 15, 2019

To be honest, I would prefer an exponential backoff solution instead.

Using the solution outlined in this PR, we are triggering the forwarding code without any delay for 10 minutes. I think we should skip triggering the code in exponential intervals up to a certain limit.

Furthermore, I believe that the problem outlined in #1804 could just as well be solved by capabilities, if we imagine that on fail to forward we make an audit pass to see if there are peers that match precisely the full address, which do not have pss capability, and if so the message is not re-enqueued.

Finally, we could push handling the problem to the calling code, if we had a way of signalling that the message actually failed. It's tempting to think that synchronous send is a better bet after all (this has been raised by @nonsense @zelig too). We could also imagine that message queue handler is a separate component that wraps the send, leaving the option to use queue (with receipt and channel for failed messages), but also having the option of direct send and just getting the synchronous result immediately.

Furthermore, I'm worried about the performance of pss as such with push sync using it directly, and especially if we are using synchronous sends (I've already experienced serious lags in message sending because of this using test clusters). My intuition tells me that push sync should be separate client code to a kademlia forwarding service, and where we have the possiblity to do some resource prioritization between chunks and messages.

Guys, we are comparing here with the current code. The current code retries the forwarding without delay ad infinitum. This PR tries to at least, put a limit to those retries because it was something that popped up during some tests.

This solution is compatible to a more sophisticated solution using exponential backoff, but even we do that in the future, we would like to put a max time in outbox limit.

Copy link
Contributor

@nolash nolash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. However I am unsure whether this is actually needed if the conclusion is that we change our minds about using asynchronous sends after all, and do not want to provide an outbox to keep it as an option. That said, now that the code is written it would be silly not to include it, even if it gets taken out again later.

In general I would like to see less rash patchwork decisions like this one; the urgency of duct taping push sync before devcon should not have dictated decisions in how pss is designed.

t.Errorf("Expected one message in outbox, instead got %v", numMessages)
}

mockClock.Set(now.Add(duration / 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you not need the same retry as below to verify here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, bellow I'm waiting for something to happen, in this case I should wait for something not to happen. It is hard to do that because in the case everything works fine I should wait the whole time, delaying the tests execution. In the last case, even though there are 10 iterations waiting, in the normal case, it will take only 10 milliseconds to finish.
What I can do is wait a couple of iterations.

@nonsense nonsense merged commit cdcbd0a into ethersphere:master Oct 16, 2019
@kortatu kortatu deleted the issue-1804 branch October 16, 2019 09:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants