refactor(pushsync): Don't account for failed retries due to accounting #1662

aloknerurkar · 2021-05-06T05:17:36Z

#1638

During pushsync retries if we fail to reserve amount with peer, we don't consider the attempt and try again with a different peer. Previously if we fail to reserve the amount, we did not even try to push anything
Added a FailedRequestCache in pushsync to remember failed traces. So if we failed to push {peer, chunk} multiple times, we will not try the same peer for the chunk and try a different peer
We added some metrics in the pushsync code to figure out how many attempts we made etc

This change is

aloknerurkar · 2021-05-06T05:27:19Z

Current changes include:

Rebasing @acud 's changes on the latest master and fixing conflicts
Adding logic to discount failed retry attempts (for both originator and forwarder)
Adding test to ensure above behavior

zelig · 2021-05-10T08:28:54Z

pkg/pushsync/pushsync.go

+			}
+			if err != nil {
+				logger.Debugf("could not push to peer %s: %v", peer.String(), err)
+				resultC <- &pushResult{err: err}


this error does not seem to be read anywhere.
Consider using the errgroup package.

zelig · 2021-05-10T08:33:25Z

pkg/pushsync/pushsync.go

+		Stamp:   stamp,
+	}); err != nil {
+		_ = streamer.Reset()
+		return nil, true, fmt.Errorf("chunk %s deliver to peer %s: %w", ch.Address().String(), peer.String(), err)


the String() call is not needed on %s arguments

pkg/pushsync/pushsync.go

acud

Reviewed 2 of 2 files at r2.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @aloknerurkar, @metacertain, and @zelig)

pkg/pushsync/pushsync.go, line 354 at r2 (raw file):

Previously, zelig (Viktor Trón) wrote…

this error does not seem to be read anywhere.
Consider using the errgroup package.

errgroup is not needed since there is no concurrency here

pkg/pushsync/pushsync.go, line 365 at r2 (raw file):

		select {
		case r := <-resultC:
			if r.receipt != nil {

check if r.err != nil, log the error if not and continue to the next retry

pkg/pushsync/pushsync.go, line 396 at r2 (raw file):

	streamer, err := ps.streamer.NewStream(ctx, peer, nil, protocolName, protocolVersion, streamName)
	if err != nil {
		return nil, true, fmt.Errorf("new stream for peer %s: %w", peer.String(), err)

this should be nil,false, error since there was no attempt yet (nothing has been sent)

pkg/pushsync/pushsync.go, line 440 at r2 (raw file):

Previously, zelig (Viktor Trón) wrote…

if you use errgroup or separate errs channel, you wont need this struct

errgroup won't help, since it will result in an error if one error or more has occurred. but since subsequent retries might resolve the problem, we would not like to error. i would personally refrain from adding an errgroup here. also, errgroup helps when havin concurrent requests, which is not the case here, since we wait for one request to expire before trying the next i.e. no parallel requests here

aloknerurkar · 2021-05-10T11:52:45Z

pkg/pushsync/pushsync.go, line 396 at r2 (raw file):

Previously, acud (acud) wrote…

this should be nil,false, error since there was no attempt yet (nothing has been sent)

If NewStream returns an error, we are not able to connect to a peer right? I was assuming this might take some time to return in which case we can call it an attempt. No?

In the subsequent change, we will filter peers which return this error. Which should mitigate the issue (by not considering these peers) in my view. If we don't consider this an attempt, we will not consider this node to be blocked for future requests and retry this.

aloknerurkar

Reviewable status: 1 of 2 files reviewed, 5 unresolved discussions (waiting on @acud, @metacertain, and @zelig)

pkg/pushsync/pushsync.go, line 354 at r2 (raw file):

Previously, acud (acud) wrote…

errgroup is not needed since there is no concurrency here

Agree with @acud We only do 1 request at a time and only if it fails we retry.

pkg/pushsync/pushsync.go, line 365 at r2 (raw file):

Previously, acud (acud) wrote…

check if r.err != nil, log the error if not and continue to the next retry

This error is returned from pushPeer. The error is already logged there once.

pkg/pushsync/pushsync.go, line 407 at r2 (raw file):

Previously, zelig (Viktor Trón) wrote…

the String() call is not needed on %s arguments

Done.

acud

Reviewable status: 1 of 2 files reviewed, 4 unresolved discussions (waiting on @acud, @aloknerurkar, @metacertain, and @zelig)

pkg/pushsync/pushsync.go, line 365 at r2 (raw file):

Previously, aloknerurkar (Alok Nerurkar) wrote…

This error is returned from pushPeer. The error is already logged there once.

OK 👍 note here that the happy path is nested, usually you would check for an error and if there's an error do something, otherwise the happy path is assumed to be at the root of the method/block

pkg/pushsync/pushsync.go, line 396 at r2 (raw file):
If NewStream returns an error it is most probably that we are no longer connected to this peer or that some other problem may exist. It should not take too long to return here.

If we don't consider this an attempt, we will not consider this node to be blocked for future requests and retry this.

As far as I remember it was agreed that the boolean returned from the method signifies whether there was an actual successful attempt to communicate the chunk. It should not be true unless the chunk actually got sent successfully, which is not the case here. In any case, a failed attempt to send to the peer (i.e. if we did not manage to get around to sending the chunk to the peer) must always result in a failed attempt logged for the trace in the LRU cache and the peer added to the skip list

pkg/pushsync/pushsync.go, line 407 at r2 (raw file):

Previously, aloknerurkar (Alok Nerurkar) wrote…

Done.

that actually doesn't work with chunk addresses in all cases. a manual test to confirm would be nice

aloknerurkar

Reviewable status: 1 of 2 files reviewed, 4 unresolved discussions (waiting on @acud, @aloknerurkar, @metacertain, and @zelig)

pkg/pushsync/pushsync.go, line 365 at r2 (raw file):

Previously, acud (acud) wrote…

OK 👍 note here that the happy path is nested, usually you would check for an error and if there's an error do something, otherwise the happy path is assumed to be at the root of the method/block

There will be an error check once the next PR will be merged.

The comment on the bottom was meant to clarify that we will be continuing the loop. We will revisit this once I merge the next PR and if more details are required will add.

pkg/pushsync/pushsync.go, line 407 at r2 (raw file):

Previously, acud (acud) wrote…

that actually doesn't work with chunk addresses in all cases. a manual test to confirm would be nice

DEBU[2021-05-11T13:25:40+05:30] could not push to peer 6000000000000000000000000000000000000000000000000000000000000000: new stream for peer 6000000000000000000000000000000000000000000000000000000000000000: peer not reachable

This does work in the test. Not sure what you mean by 'all' cases?

acud

Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @metacertain and @zelig)

pkg/pushsync/pushsync.go, line 407 at r2 (raw file):

Previously, aloknerurkar (Alok Nerurkar) wrote…

DEBU[2021-05-11T13:25:40+05:30] could not push to peer 6000000000000000000000000000000000000000000000000000000000000000: new stream for peer 6000000000000000000000000000000000000000000000000000000000000000: peer not reachable

This does work in the test. Not sure what you mean by 'all' cases?

we had some formatting problems with using %s and only swarm.Address in the argument, I think this happens with the logger for some reason

acud

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @metacertain and @zelig)

…better peers (#1680)

aloknerurkar · 2021-05-12T05:52:42Z

Just for documenting:

We have decided to classify the NewStream error as an attempt for now. Idea is that it should not happen often and if it does, we can know using the new metrics. The metrics once run on production env, should tell us how the cache is being utilized. Once we have a better understanding, we can tweak these things in the future. Will check in as-is for now.

bee-runner bot added the pull-request label May 6, 2021

acud and others added 3 commits May 6, 2021 10:50

pushsync: cleanup logic and add preemptive retry

a2403a4

fixing previous changes. removing preemptive retries

9e566a8

Adding test for failed retries

2798870

aloknerurkar force-pushed the pushsync.0 branch from aa8b861 to 2798870 Compare May 6, 2021 05:20

lint

654c69e

aloknerurkar requested review from acud, zelig and metacertain May 6, 2021 05:27

aloknerurkar self-assigned this May 6, 2021

aloknerurkar added the ready for review The PR is ready to be reviewed label May 10, 2021

zelig approved these changes May 10, 2021

View reviewed changes

acud suggested changes May 10, 2021

View reviewed changes

fix log message formatting

a13973b

aloknerurkar commented May 10, 2021

View reviewed changes

aloknerurkar changed the title ~~refactor: pushsync to not account for failed retries due to accounting~~ refactor(pushsync): Don't account for failed retries due to accounting May 10, 2021

acud suggested changes May 11, 2021

View reviewed changes

aloknerurkar commented May 11, 2021

View reviewed changes

acud approved these changes May 11, 2021

View reviewed changes

acud removed the request for review from metacertain May 12, 2021 05:25

feat(pushsync): Add LRU cache to remember failed requests and select …

b9a218e

…better peers (#1680)

removing TODO comment

a46ce32

aloknerurkar merged commit 8598ed2 into master May 12, 2021

aloknerurkar deleted the pushsync.0 branch May 12, 2021 06:43

aloknerurkar mentioned this pull request May 12, 2021

pushsync: receive receipt deadline filed #1669

Closed

metacertain mentioned this pull request May 19, 2021

feat: Retrial behavior in retrieval #1780

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(pushsync): Don't account for failed retries due to accounting #1662

refactor(pushsync): Don't account for failed retries due to accounting #1662

aloknerurkar commented May 6, 2021 •

edited

Loading

aloknerurkar commented May 6, 2021

zelig May 10, 2021

zelig May 10, 2021

acud left a comment

aloknerurkar commented May 10, 2021

aloknerurkar left a comment

acud left a comment

aloknerurkar left a comment

acud left a comment

acud left a comment

aloknerurkar commented May 12, 2021

refactor(pushsync): Don't account for failed retries due to accounting #1662

refactor(pushsync): Don't account for failed retries due to accounting #1662

Conversation

aloknerurkar commented May 6, 2021 • edited Loading

aloknerurkar commented May 6, 2021

zelig May 10, 2021

Choose a reason for hiding this comment

zelig May 10, 2021

Choose a reason for hiding this comment

acud left a comment

Choose a reason for hiding this comment

aloknerurkar commented May 10, 2021

aloknerurkar left a comment

Choose a reason for hiding this comment

acud left a comment

Choose a reason for hiding this comment

aloknerurkar left a comment

Choose a reason for hiding this comment

acud left a comment

Choose a reason for hiding this comment

acud left a comment

Choose a reason for hiding this comment

aloknerurkar commented May 12, 2021

aloknerurkar commented May 6, 2021 •

edited

Loading