Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retrieval rewrite #3428

Merged
merged 9 commits into from
Jan 28, 2023
Merged

fix: retrieval rewrite #3428

merged 9 commits into from
Jan 28, 2023

Conversation

istae
Copy link
Member

@istae istae commented Oct 13, 2022

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

The PR simplifies the retrieval logic while maintaining preventive and on error retries.
Tests passed without any edits.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

ethersphere/bee-backlog#36

Screenshots (if appropriate):


This change is Reviewable

@istae istae requested review from a team, aloknerurkar and notanatol and removed request for a team October 13, 2022 15:08
@istae
Copy link
Member Author

istae commented Oct 17, 2022

@metacertain writes:

so the definition of requestRound in current retrieval is actually continuing with a skiplist that has the overdraft entries removed by a skiplist reset, and the time difference between the beginning of 2 requestRounds is at least 600 milliseconds (with 1024 rounds this covers around 10 minutes at least)
this currently enables an ultralightnode trying to retrieve all the chunks of a 200 MB file to keep trying for up to 10 minutes to get a peer with free accounting balance for each of these chunks, so sooner or later they will be retrieved.
the effective filesize limit that you can get with an ultralight node comes from this 10 minute window and the / second free bandwidth, so 600 * 7 KB/sec from each peer ( lets say with 64 peers this means 600 * 448 KB ~ 270 MBs )

without this, we would try to retrieve too many chunks at the same time from our peers, and many of the chunks would fail because all accounting balances are saturated with pending requests after a short while (especially as an ultra light node)

the current logic is, for up to 1024 times, we start with a reset skiplist that only contains nodes with real failures (either request failed to send or later error)
each of these times, we select up to 32 next-closest addresses and try to get accounting enable the request etc

this means the maximum number of next closest peer selections is 1024 * 32, but we will give up much sooner in case we run out of peers with no failure / overdraft or if we have enough real failures after successful requests sent

we give up (as an originator) when 32 succesfully sent requests got an actual failure in any of the rounds
we give up (as a forwarder) when 1 successfully sent request got an actual failure (and there are no rounds, just 1 time 32 next-closest addresses are selected)
we also give up if we run out of next-closest-peers and nobody on the skiplist is an overdraft entry

what i see in the rewrite attempt, is that it never does a skiplist reset, so overdraft entries are never removed, and never retried when we could later perhaps get free accounting balance
even if this exists, you still need the request rounds so that a number of peers are tried before the skiplist reset, so its not that only 1 peer is constantly waited to be retried in case of an accounting block

@istae
Copy link
Member Author

istae commented Oct 17, 2022

cont..

actually this 1024 limit is completely artificial and could be an infinite loop of request rounds anyway, so until the user explicitly cancels a request its still trying to continue no matter the time

that will give us infinite download size on ultra light node

its not that we would stay in the loop forever unless there are yet to be tried peers anyway (because of accounting blocks)

@istae
Copy link
Member Author

istae commented Oct 18, 2022

The latest commit should address @metacertain's comments

@notanatol
Copy link
Contributor

should we have a short mob review session for this one?

@istae istae requested a review from mrekucci January 26, 2023 12:00

lastTime := time.Now().Unix()
resultC := make(chan retrievalResult, 1)
retryC := make(chan struct{}, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a buffered channel?

The retry function has a default case, so its not blocking. If this is buffered we might end up starting more retrieval requests no? For eg if preemtive ticker and error from a previous request call retry at the same time, we should start only 1 new request right? In which case this should not be buffered?

Copy link
Contributor

@mrekucci mrekucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the comment, LGTM.

}
}()

defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a second defer call, the error check can be put at the beginning of the defer above.

Copy link
Member

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I am also questioning the buffered channel.

@istae istae merged commit e167428 into master Jan 28, 2023
@istae istae deleted the retrieval-rewrite branch January 28, 2023 09:50
@istae istae added this to the 1.12.0 milestone Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants