on restart miner shouldn't dial client #463

dirkmc · 2020-12-02T13:48:37Z

Addresses filecoin-project/lotus#4991
Depends on filecoin-project/go-data-transfer#127

I think it may make more sense to let go-data-transfer take care of restarts because it already has retry logic, so

it keeps retry logic in one place (just in go-data-transfer)
it makes the go-fil-markets FSM simpler
it prevents retries in go-data-transfer and go-fil-markets from getting out of sync

To that end in this PR:

When the miner restarts, don't try to dial the client (let the client attempt to reconnect to the miner)
When the connection from the client to the miner goes down, rely on go-data-transfer to attempt to redial
If the connection fails, just error out the deal

This PR does not change the existing behaviour on the miner when the connection (but not the miner itself) goes down:

stay in StorageDealTransferring (don't error out the deal)
go-data-transfer attempts to retry the connection

This PR does not change the existing behaviour on the client when the client restarts:

attempt to reconnect to the miner

Current retry behaviour in go-data-transfer is:

5 attempts, with exponential back of 5x
starts with 1s
maximum 5 min

I suggest

leaving retry behaviour as it is now on the miner (ie try to reconnect for a few minutes then give up and wait for client to reconnect)
changing retry behaviour at the network level on the client to a maximum of 15 attempts with exponential backoff of 5x:
1s + 5s + 25s + 125s + 5m + (10 x 5m) ~= 1 hour
Note: This will be configured in lotus (not in go-fil-markets)
changing go-data-transfer to restart push transfers
Implemented in Automatically restart push channel go-data-transfer#127

Notes:

While testing I discovered that on the provider side we perform validation of incoming data-transfer requests before the provider has completed starting up, which means that when the provider is restarted, it may reject a voucher because it doesn't find the deal in its local store. So I modified the validation code to block until the provider has completed starting up.

TODO:

update tests

hannahhoward

I definitely have no problem removing the provider restart

Regarding the "move restart to data transfer" -- this will be more clear when you're in the guts of data transfer, but there is a qualitative difference between restarting in the libp2p protocol and the concept of "stalled".

The current retry logic you identify is for the data transfer Libp2p protocol. However, data transfer is both its own libp2p protocol and go-graphsync. Go graphsync has it's own retry logic (which, side note, needs some work). A "Stall" occurs when go graphsync experiences a network error -- which occurs when it expends it's own internal retry logic, which is among other things, fairly short. At that point, the graphsync request fails and to "restart" data transfer actually has to internally make a new graphsync request. (this is when you call "Restart Data Transfer").

So we have:

retrying the data transfer libp2p protocol (internal to go-data-transfer)
retrying the go-graphsync libp2p protocol (internal to go-graphsync
retrying the go-graphsync request (internal to go-data-transfer but currently ONLY when you externally call RestartDataTransfer)

I just want to identify that there's some layers of complexity here -- not saying they are neccesary, but to make a clean solution you probably need to udnerstand what they are :)

hannahhoward · 2020-12-02T16:24:11Z

storagemarket/events.go

@@ -265,6 +265,7 @@ const (
 	ProviderEventRestart

 	// ProviderEventDataTransferRestartFailed means a data transfer that was restarted by the provider failed
+	// Deprecated: this event is no longer used


Just want to flag a thing for the future: at some point we are going to want to re-arrange these events and clean up deprecated ones.

When this happens, two things will be needed:

a migration - see our whole migrations process using go-ds-migrations

a protocol upgrade - meaning that we will need to define a new version on the protocol, and translate even numbers back and forth for older protocol versions

NOTE: this is the markets libp2p protocol here FWIW

dirkmc · 2020-12-03T10:37:11Z

Thanks for the detailed explanation @hannahhoward

Looking at the code for go-data-transfer it seems like it will fire the Disconnected event whenever:

there's a network error from the transport layer (graphsync)
there's a network error at the go-data-transfer layer

As you mentioned, in both go-data-transfer ~~and go-graphsync~~ there is retry logic at the network (libp2p) layer. So I suggest we simply adjust the parameters to add more network-level retry attempts in both protocols, on the client side. If after all those retries we still can't reach the provider, then I think it's reasonable to error out the deal.

Edit: My mistake, there is retry logic in go-data-transfer and go-fil-markets at the network layer, but the retry logic in go-graphsync is at a higher layer

dirkmc · 2020-12-04T11:02:58Z

I had misunderstood how the retry logic works in go-graphysyc, in summary:

go-graphsync makes 10 attempts to send a message
each attempt waits up to 10 minutes for a successful connection
the attempt may fail immediately if the other side rejects the connection
if the attempt fails go-graphsync closes the connection and waits 100ms before trying again

Therefore

if the other end rejects connection attempts, it's possible a send will fail within 10 x 100ms = 1s
if every connection attempt times out, it's possible send will take up to 10 x 10 minutes = 1h 40m

I think to keep things moving forward we should keep the original intention of this PR: to prevent the miner from failing the deal if the miner can't contact the client on restart. We can tackle the case of the client reconnecting to the miner in a separate PR.

dirkmc · 2020-12-04T14:46:25Z

This PR is currently blocked by an issue with go-graphsync: ipfs/go-graphsync#124

Essentially graphsync is not propagating some network errors up the stack, which means that when the client attempts to make a data transfer, the transfer fails silently and the deal gets stuck in StorageDealStartDataTransfer or StorageDealTransferring.

This PR is currently blocked because we need to write a test that

starts a data transfer
shuts down the provider
restarts the provider
verifies that the data transfer completes

dirkmc · 2020-12-10T15:03:26Z

When the client wants to send data to a provider:

go-data-transfer client opens a "push" channel to the provider
when go-data-transfer provider receives the "push" message, it opens a go-graphsync query to the client for the file
on the client go-graphsync responds to the query from the provider by sending the file data

If the miner goes down in step 3, currently the client will not automatically try to restart the channel.

Proposal

After the client opens a "push" channel to the provider

monitor the data rate to the provider
if the data takes too long to transfer, cancel the transfer (and error out the deal)
if the data rate stalls, attempt to restart the "push" channel
if there is an error from graphsync on the receiving side, attempt to restart the "push" channel
Note: currently go-graphsync does not report the error: Interface to register listeners for network errors on receiving side ipfs/go-graphsync#129

Default config params:

minimum data rate: 1MB / minute
restart after stall:
- 10 attempts
- min backoff: 1s
- max backoff: 5m
- backoff factor: 5

Edit: This proposal is implemented in filecoin-project/go-data-transfer#127

codecov-io · 2020-12-15T15:16:28Z

Codecov Report

Merging #463 (e9d31f4) into master (aa8533f) will decrease coverage by 0.65%.
The diff coverage is 43.55%.

@@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   65.58%   64.93%   -0.64%     
==========================================
  Files          46       49       +3     
  Lines        3198     3259      +61     
==========================================
+ Hits         2097     2116      +19     
- Misses        875      916      +41     
- Partials      226      227       +1

Impacted Files	Coverage Δ
storagemarket/impl/clientstates/client_fsm.go	`82.73% <0.00%> (ø)`
storagemarket/impl/provider_environments.go	`0.00% <0.00%> (-2.12%)`	⬇️
storagemarket/impl/providerstates/provider_fsm.go	`73.95% <0.00%> (+2.41%)`	⬆️
...oragemarket/impl/providerstates/provider_states.go	`85.12% <ø> (+0.93%)`	⬆️
storagemarket/impl/requestvalidation/types.go	`0.00% <ø> (ø)`
storagemarket/impl/provider.go	`27.21% <60.00%> (-0.20%)`	⬇️
shared/ready.go	`60.47% <62.86%> (ø)`
storagemarket/impl/clientstates/client_states.go	`89.93% <100.00%> (ø)`
storagemarket/impl/requestvalidation/common.go	`86.21% <100.00%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa8533f...e9d31f4. Read the comment docs.

hannahhoward

LGTM, couple suggestions. Like the ready manager refactor -- shoulda added this a bit a go!

go.mod

hannahhoward · 2020-12-15T16:02:07Z

storagemarket/impl/providerstates/provider_fsm.go

+		FromMany(storagemarket.StorageDealTransferring, storagemarket.StorageDealProviderTransferAwaitRestart).
+		ToJustRecord().
+		Action(func(deal *storagemarket.MinerDeal) error {
+			deal.Message = "data transfer appears to be stalled, awaiting reconnect from client"


just want to flag that based on how I'm reading this, it looks like you want to delete the miner restart data transfer commands in Lotus when you merge this.

I mean I guess they can still get called if it's manual -- at least the miner bears responsibility if they do. also probably they would never actually do that anyway.

Yeah that's a good point, we will need to think through if we still need those commands

dirkmc changed the title ~~on disconnect miner shouldn't dial client~~ on restart miner shouldn't dial client Dec 2, 2020

dirkmc force-pushed the feat/miner-dont-dial-client branch from 75e7831 to dc6bb03 Compare December 2, 2020 15:13

hannahhoward reviewed Dec 2, 2020

View reviewed changes

dirkmc mentioned this pull request Dec 11, 2020

Automatically restart push channel filecoin-project/go-data-transfer#127

Merged

dirkmc force-pushed the feat/miner-dont-dial-client branch from dc6bb03 to fb00aa7 Compare December 15, 2020 15:08

feat: on disconnect miner shouldn't dial client

5bf526e

dirkmc force-pushed the feat/miner-dont-dial-client branch from fb00aa7 to 5bf526e Compare December 15, 2020 15:13

dirkmc marked this pull request as ready for review December 15, 2020 15:18

dirkmc requested a review from hannahhoward December 15, 2020 15:18

test: ready manager test

5ac8737

hannahhoward approved these changes Dec 15, 2020

View reviewed changes

feat: update to go-data-transfer v1.2.4

e9d31f4

dirkmc merged commit b0c1f6d into master Dec 16, 2020

dirkmc deleted the feat/miner-dont-dial-client branch December 16, 2020 14:25

dirkmc mentioned this pull request Dec 16, 2020

release: v1.0.11 #468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on restart miner shouldn't dial client #463

on restart miner shouldn't dial client #463

dirkmc commented Dec 2, 2020 •

edited

hannahhoward left a comment

hannahhoward Dec 2, 2020

hannahhoward Dec 2, 2020

dirkmc commented Dec 3, 2020 •

edited

dirkmc commented Dec 4, 2020

dirkmc commented Dec 4, 2020

dirkmc commented Dec 10, 2020 •

edited

codecov-io commented Dec 15, 2020 •

edited

hannahhoward left a comment

hannahhoward Dec 15, 2020

hannahhoward Dec 15, 2020

dirkmc Dec 15, 2020

on restart miner shouldn't dial client #463

on restart miner shouldn't dial client #463

Conversation

dirkmc commented Dec 2, 2020 • edited

hannahhoward left a comment

Choose a reason for hiding this comment

hannahhoward Dec 2, 2020

Choose a reason for hiding this comment

hannahhoward Dec 2, 2020

Choose a reason for hiding this comment

dirkmc commented Dec 3, 2020 • edited

dirkmc commented Dec 4, 2020

dirkmc commented Dec 4, 2020

dirkmc commented Dec 10, 2020 • edited

Proposal

codecov-io commented Dec 15, 2020 • edited

Codecov Report

hannahhoward left a comment

Choose a reason for hiding this comment

hannahhoward Dec 15, 2020

Choose a reason for hiding this comment

hannahhoward Dec 15, 2020

Choose a reason for hiding this comment

dirkmc Dec 15, 2020

Choose a reason for hiding this comment

dirkmc commented Dec 2, 2020 •

edited

dirkmc commented Dec 3, 2020 •

edited

dirkmc commented Dec 10, 2020 •

edited

codecov-io commented Dec 15, 2020 •

edited