Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Retrial behavior in retrieval #1780

Merged
merged 14 commits into from
Jun 11, 2021
Merged

feat: Retrial behavior in retrieval #1780

merged 14 commits into from
Jun 11, 2021

Conversation

metacertain
Copy link
Member

@metacertain metacertain commented May 19, 2021

This pr attempts to mirror some of the push sync behaviour changes introduced in #1662 and to enable retrying until a chunk is attempted to be retrieved from at least one peer


This change is Reviewable

@metacertain metacertain changed the title feat: Retrieval behaviour in retrieval feat: Retryal behaviour in retrieval May 19, 2021
@metacertain metacertain changed the title feat: Retryal behaviour in retrieval feat: Retrial behavior in retrieval May 19, 2021
@metacertain metacertain force-pushed the retrieval_behaviour branch 3 times, most recently from bef39b9 to 58d2295 Compare May 19, 2021 17:44
Copy link
Member

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @metacertain)

retrieveChunkTimeout = 10 * time.Second

retrieveRetryIntervalDuration = 5 * time.Second
)

func (s *Service) RetrieveChunk(ctx context.Context, addr swarm.Address) (swarm.Chunk, error) {
func (s *Service) RetrieveChunk(ctx context.Context, addr swarm.Address, origin bool) (swarm.Chunk, error) {
s.metrics.RequestCounter.Inc()

v, err, _ := s.singleflight.Do(addr.String(), func() (interface{}, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a chunk is already retrieved with origin false, does that singleflight here not imply that we would never try with maxPeers=8 if a origin true call happens at the same time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the origin flag should be part of the singleflight key.

@metacertain metacertain force-pushed the retrieval_behaviour branch 7 times, most recently from b406e1e to ff99f01 Compare May 27, 2021 12:15
@Eknir
Copy link
Contributor

Eknir commented May 27, 2021

Let's merge this after the upcoming release (0.6.2)

@metacertain metacertain force-pushed the retrieval_behaviour branch 4 times, most recently from ae5f4c8 to e18113d Compare June 4, 2021 09:48
@metacertain metacertain force-pushed the retrieval_behaviour branch 3 times, most recently from aa06a82 to 150d81f Compare June 7, 2021 15:00
Copy link
Contributor

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 6 files reviewed, 2 unresolved discussions (waiting on @aloknerurkar, @metacertain, and @ralph-pichler)


pkg/retrieval/retrieval.go, line 111 at r2 (raw file):

	}

	v, err, _ := s.singleflight.Do(flightRoute, func() (interface{}, error) {

Problem with this singleFlight pattern is that the parent context cancellation does not cancel the singleFlight requests which are being shared, resulting in goroutine leaks. This was inspected during #1891 and we decided to use @janos 's lib which adds context semantics correctly to singleFlight. I would recommend using that here.


pkg/retrieval/retrieval.go, line 187 at r2 (raw file):

				requestAttempt++
				timeNow := time.Now().Unix()
				if timeNow > lastTime {

When will this not be true?

@metacertain
Copy link
Member Author

metacertain commented Jun 9, 2021

When will this not be true?

if we are in the same second still
we wait 600 miliseconds only so we should get into the next second within 2 rounds of requestAttempts

if !peer.IsZero() {
logger.Debugf("retrieval: failed to get chunk %s from peer %s: %v", addr, peer, err)
}
for requestAttempt < 5 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should stay consistent and have this as a variable at the top of the file


requestAttempt++
timeNow := time.Now().Unix()
if timeNow > lastTime {
Copy link
Member

@istae istae Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would always be true, no? What is this checking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are in the same second still
we wait 600 miliseconds only so we should get into the next second within 2 rounds of requestAttempts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can we add a test case that covers the area starting at requestAttempt++ and down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave the judgement to you if this is necessary

Copy link
Contributor

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r4.
Reviewable status: 3 of 6 files reviewed, 2 unresolved discussions (waiting on @aloknerurkar, @esadakar, and @ralph-pichler)

@aloknerurkar
Copy link
Contributor


pkg/retrieval/retrieval.go, line 188 at r3 (raw file):

Previously, metacertain wrote…

if we are in the same second still
we wait 600 miliseconds only so we should get into the next second within 2 rounds of requestAttempts

It would be good if you can add a comment here explaining this.

@metacertain metacertain merged commit 18bdc86 into master Jun 11, 2021
@metacertain metacertain deleted the retrieval_behaviour branch June 11, 2021 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants