simple fetchers #1492

nonsense · 2019-06-17T08:32:17Z

Related discussion: #1344

This PR is a subset of the simplify-fetchers branch for easier review and mostly addressing the issues described at #1309

It includes:

A refactor of NetStore, where we replace FetchFunc with two new functions introduced are:

	NetStore.RemoteGet = delivery.RequestFromPeers
	NetStore.RemoteFetch

You probably want to start reviewing this PR by checking them out and checking where they are used.

swarm/network/fetcher.go has been removed and replaced with singleflight.Group. Every time the Netstore.Get looks for chunk that is not in our LocalStore, it uses the singleflight.Group in order to suppress duplicate retrievals for the same chunk.

Each in-flight request for a chunk adds a Fetcher value to a map, so that the NetStore.Get receives a signal that the chunk it requested has been delivered. The signal that a chunk is delivered and stored to the LocalStore is the closing of the Fetcher.Delivered channel.

NetStore.Put is responsible for removing fetchers and closing the delivered channel. If a fetcher item has been added to the fetchers map, then we must get that chunk and there is an interested party in the delivered signal.

We need to add a timeout to the fetchers, because even though a chunk has been requested and an actor waits for it, doesn't mean it will be delivered.

handleOfferedHashesMsg uses the Netstore.Has to determine if it needs a chunk (while syncing). If a chunk is needed, then a Fetcher is also created, so that we keep track while the current batch is being delivered.

It is possible for a chunk to be both requested through a retrieve request and delivered through syncing independently, but there would always be only one Fetcher and only one Delivered channel for it so that interested parties are notified when the chunk is delivered and stored.

// * first it tries to request explicitly from peers that are known to have offered the chunk - this part of the functionality of RequestFromPeers has been removed. We no longer control the flow through values in the context.Context.
RequestTimeout has been split into RequestTimeout and FailedPeerSkipDelay as these have different meanings. All timeouts are now placed in the timeouts package, and have documentation.
Added LNetStore (we probably need a better name) - a NetStore wrapper needed for the LazyChunkReader.
Found a bug in OfferedHashes where we request the same chunk from multiple peers via the OfferedHashes/WantedHashes protocol. In a future PR we will address it the following way: if we have requested a chunk, we have a fetcher for it, so subsequent OfferedHashes won't have an effect.
Solved bug where the interval passed to SubscriptionPull is exclusive, meaning we lose one chunk between batches.
Solved a bug with chunk deliveries described at Fix tracing for chunk delivery #1292 - chunks are delivered, but fetcher continues to make requests for the same chunk.

TODO TESTS:

fix broken unit tests, and remove irrelevant unit tests.
review and fix streamer_test.go tests
- review and fix TestStreamerDownstream*
TestDeliveryFromNodes 8_200 seems to be flaky on macOS. TestDeliveryFromNodes is removed altogether, because we no longer blindly forward requests to our peers - FindPeer was changed.
review and fix flaky TestRetrieval
- make sure it works when retrieval is done after syncing completes
- make sure it works when retrieval is done before syncing completes
review and fix flaky TestFileRetrieval

TODO:

fix log levels - currently chunk not found results in errors, which is not right.
address all new todos and fixme leftovers in the change set
add global timeout for fetchers - if a chunk is never delivered, we don't want to keep a fetcher around forever - for this we need proper lifecycle and ownership handling for fetchers, something that we don't have right now.

NICE TO HAVE:

RequestFromPeers quit channel - notify that a peer was disconnected and we don't have to wait for timeouts.SearchTimeout - we can immediately call next peer.
add test for FindPeer - it has at least one new condition - not going out of depth - Add test for FindPeer #1362

… and Swarm connectivity is not a chain

zelig · 2019-06-17T11:01:00Z

network/stream/intervals_test.go

-	return wait
+
+	return loaded, func(ctx context.Context) error {
+		select {


so should we not select on the ctx.Done?

Why does this matter? Tests seem to be passing with the current select?

zelig · 2019-06-17T11:17:31Z

network/stream/messages.go

-		if wait := c.NeedData(ctx, hash); wait != nil {
+		log.Trace("checking offered hash", "ref", fmt.Sprintf("%x", hash))
+
+		if _, wait := c.NeedData(ctx, hash); wait != nil {


so we are still asking all offering peers, not just the first. note TODO here

Yes, we are still asking all peers. We decided that we will be addressing this later, in order to reduce the scope of an already big PR.

zelig · 2019-06-17T11:23:17Z

network/stream/syncer.go

+func (s *SwarmSyncerClient) NeedData(ctx context.Context, key []byte) (loaded bool, wait func(context.Context) error) {
+	start := time.Now()
+
+	fi, loaded, ok := s.netStore.GetOrCreateFetcher(ctx, key, "syncer")


TODO this logic should be part of netstore.Get no?

I don't understand this question. Why do you want this part of Netstore.Get? In order to adhere to the interface of the syncer (which this PR is not addressing), we need to know how long to wait for a chunk, hence the returned Fetcher, and the select on <-fi.Delivered below.

If this is part of Netstore.Get, we are just pushing this complexity to the Netstore.Get, I don't really see why this is an improvement.

Let's leave as it is at the moment, @zelig I've also discussed this at length with @nonsense, there's no benefit of having it here or there. I vote leave as is because the whole syncer part is gonna be heavily simplified anyway and this might as well go into that too

zelig · 2019-06-17T11:28:17Z

storage/lnetstore.go

+)
+
+// LNetStore is a wrapper of NetStore, which implements the chunk.Store interface. It is used only by the FileStore,
+// the component used by the Swarm API to store and retrieve content and to split and join chunks.


TODO we should simplify this. AFAIU This extra object is only needed because Origin is not handled via peers to skip.

How do you propose we simplify this?

This extra struct LNetStore is needed due to the FileStore, which has no notion of any network requests, or the NetStore, it just expects a simple Get interface to return chunks. We want to be able to test the FileStore independently of the NetStore.

Any concrete ideas are welcome.

nonsense and others added 17 commits June 11, 2019 14:59

network: disable shouldNOTRequestAgain

30c04bc

vendor singleflight

9cd99ee

network: disable shouldNOTRequestAgain

f94aefe

network, storage: fix tests

369f665

network/storage: remove HopCount and SkipCheck

9c46ed9

network/stream: use localstore when providing chunks a node has offered

7ea616a

storage: refactor lnetstore

283cfcc

storage: rename FetcherItem to Fetcher

5b77ee6

storage/feed: no distinction between catastrophic err or chunk not found

35abb6b

network/stream: remove TestDeliveryFromNodes, as FindPeer is changed,…

720d848

… and Swarm connectivity is not a chain

network/stream: fixed intervals tests

9a84bf1

swarm: fixes for linter

d4e5df7

storage: use LRU cache for fetchers

e8c821d

network, storage: better godoc

3be2aeb

storage/netstore: explicit errors

7cf5a1c

storage/feed/lookup: Clarify ReadFunc expected return error values

dfd0589

storage: address comments by elad

0301823

nonsense added the ready for review label Jun 17, 2019

nonsense requested a review from zelig June 17, 2019 08:32

nonsense added this to Backlog in Swarm Core - Sprint planning via automation Jun 17, 2019

nonsense moved this from Backlog to In review (includes Documentation) in Swarm Core - Sprint planning Jun 17, 2019

nonsense added this to the 0.4.4 milestone Jun 17, 2019

zelig reviewed Jun 17, 2019

View reviewed changes

zelig assigned nonsense Jun 17, 2019

zelig approved these changes Jun 17, 2019

View reviewed changes

acud merged commit d589af1 into master Jun 18, 2019

Swarm Core - Sprint planning automation moved this from In review (includes Documentation) to Done Jun 18, 2019

nonsense modified the milestones: 0.4.4, 0.4.2 Jun 18, 2019

nonsense deleted the simple-fetchers branch July 8, 2019 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple fetchers #1492

simple fetchers #1492

nonsense commented Jun 17, 2019 •

edited

Loading

zelig Jun 17, 2019

nonsense Jun 17, 2019

zelig Jun 17, 2019

nonsense Jun 17, 2019

zelig Jun 17, 2019

nonsense Jun 17, 2019

acud Jun 17, 2019

zelig Jun 17, 2019

nonsense Jun 17, 2019

simple fetchers #1492

simple fetchers #1492

Conversation

nonsense commented Jun 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nonsense commented Jun 17, 2019 •

edited

Loading