Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

network/stream: fix flaky tests and first delivered batch of chunks for an unwanted stream #1952

Merged
merged 4 commits into from
Nov 15, 2019

Conversation

acud
Copy link
Member

@acud acud commented Nov 14, 2019

This PR addresses three problems:

  1. TestStarNetworkSyncWithBogusNodes was flaking on travis due to a too short sync delay timer. This, at least to my interpretation, means Swarm is getting slower over time. We should keep this in mind
  2. TestNodesCorrectBinsDynamic was flaking because syncing was hardcoded to Autostart all the time. This resulted in incorrect bin indexes exchanged because some syncing has already occurred and has introduced non-determinism on the state of binIDs so as the cursors of the nodes. This is solved by a configurable parameter on SyncSimServiceOptions that toggles whether streams should be autostarted
  3. A bug in stream package resulted in a first delivered batch of chunks before an unwanted stream is actually dropped.

The scenario is the following: a node initially establishes streams on certain bins that become obsolete once kademlia depth changes (for example, depth is initially 0 and a stream on SYNC|0 is requested from a node with PO 2. Subsequently, kademlia depth changes to be depth 3 and the subscription on SYNC|0 is no longer relevant). Before this fix, the ChunkDelivery messages that would be received would actually put the chunks into the localstore without checking that provider.WantStream()==true.

This resulted in a first requested range of a live stream always to be persisted, regardless whether that stream is wanted or not at the time of reception. The data race happened when the depth changed between HandleOfferedHashes and the call to requestSubsequentRange at the end of clientHandleOfferedHashes method.

This has now been amended to have the appropriate checks in clientHandleOfferedHashes and clientHandleChunkDelivery handlers.

A change in kademlia depth is still possible between the send of WantedHashes message and the reception of the first ChunkDelivery message. Another change in depth can occur in between ChunkDelivery messages, in the case that the batch is split up to several messages. The two latter cases, however, are mitigated with the check that was added within clientHandleChunkDelivery handler, which will not process the chunks by returning, eventually causing the batch to time-out within clientSealBatch and the subsequent range to never be called

@acud acud self-assigned this Nov 14, 2019
@acud acud added this to Backlog in Swarm Core - Sprint planning via automation Nov 14, 2019
@acud acud moved this from Backlog to In progress in Swarm Core - Sprint planning Nov 14, 2019
@acud acud force-pushed the stream-test-flake branch 2 times, most recently from 7e500e7 to 48d0ae3 Compare November 15, 2019 07:47
@acud acud requested review from janos, nolash and zelig November 15, 2019 07:47
@acud acud moved this from In progress to In review (includes Documentation) in Swarm Core - Sprint planning Nov 15, 2019
@acud acud changed the title network/stream: fix flaky test network/stream: fix flaky test and first delivered batch of chunks for an unwanted stream Nov 15, 2019
become unwanted due to depth change in kademlia. this resulted in a
batch of chunks being delivered on the now unwanted stream before
_not_ requesting the next interval (due to WantStream returning false)
@acud acud changed the title network/stream: fix flaky test and first delivered batch of chunks for an unwanted stream network/stream: fix flaky tests and first delivered batch of chunks for an unwanted stream Nov 15, 2019
StreamConstructorFunc func(state.Store, []byte, ...StreamProvider) node.Service
}

func newSyncSimServiceFunc(o *SyncSimServiceOptions) func(ctx *adapters.ServiceContext, bucket *sync.Map) (s node.Service, cleanup func(), err error) {
if o == nil {
o = new(SyncSimServiceOptions)
o.Autostart = true // start stream on by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This my result in different behaviour on Autostart default value. If o is nil, Autostart is true, but if o is not nil, but some value which does not specify Autostart, Autostart will be false like & SyncSimServiceOptions{SyncOnlyWithinDepth: true}, which is inconsistent. I think that the boolean should be named in respect to default value being consistent in both cases, like NoAutosync.

This is even visible in this PR. Options are constructed just to set Autostart to false explicitly, even if it is a default value for the field.

@acud acud merged commit 9b0c910 into master Nov 15, 2019
Swarm Core - Sprint planning automation moved this from In review (includes Documentation) to Done Nov 15, 2019
@acud acud deleted the stream-test-flake branch November 15, 2019 16:10
acud added a commit that referenced this pull request Nov 18, 2019
…chunks for an unwanted stream (#1952)"

This reverts commit 9b0c910.
@acud acud added this to the 0.5.3 milestone Nov 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants