-
Notifications
You must be signed in to change notification settings - Fork 112
swap: introduce pending cheques, drop peers on io errors #1887
Conversation
196f0b3
to
96f540a
Compare
1e4be68
to
75bf887
Compare
swap/peer.go
Outdated
@@ -40,6 +40,7 @@ type Peer struct { | |||
contractAddress common.Address | |||
lastReceivedCheque *Cheque | |||
lastSentCheque *Cheque | |||
pendingCheque *Cheque |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small comment to each field in the peer struct would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
swap/swap.go
Outdated
} | ||
|
||
if err := p.setLastSentCheque(cheque); err != nil { | ||
p.Drop("persistence error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we drop the peer if our node cannot save the cheque? I would propose to shut down Swarm with this i/o error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not entirely clear to me how to shut down swarm and I didn't want to ignore the error entirely, so I used drop because that's what the other protocols do on io error as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting @janos to give his opinion here.
After an IO error, is it sufficient to just drop the peer (as done by @ralph-pichler here) or should Swarm be stopped entirely? If we need to stop Swarm, how to do this safely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't support a shutdown invoked by some high level components. you could panic
and risk data corruption with the databases since a shutdown is not invoked when panicking. dropping the peer in this case is the safer option because that way, if you're getting local IO errors and you start dropping peers - you isolate the problem to your local node and technically you do not participate in the network anymore (ideally). I'd stick with dropping peers at the moment since node behavior with IO errors is something we have not tested yet and it is quite an involved task in its own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very interesting question. I agree with @acud, as we do not have a way to shut down the node, except with panic. If this error happens, we will drop the peer, but it is very likely that the same I/O error will happen in all next calls, so we will drop all peers, practically removing our node from the network. And maybe recovering if the I/O problem gets resolved. This could happen if the disk is full, for example. Maybe we could drop all peers if this happens which would prevent other peers to step into the same problem. I am not sure how easy is to do so currently. If it requires non trivial changes, this drop is ok for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #1515
swap/peer.go
Outdated
@@ -154,16 +169,18 @@ func (p *Peer) createCheque() (*Cheque, error) { | |||
// To be called with mutex already held |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the comment, mentioning the added functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
swap/swap.go
Outdated
} | ||
|
||
// since more swap traffic might have occurred since this cheque was already sent, we redo the payment threshold check | ||
if err := s.checkPaymentThreshold(p); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the disconnect threshold is set at less than 200% of the payment threshold, this function will never result in sending another cheque.
Proposing to update the comment:
since more swap traffic might have occurred (when disconnect threshold >= 2 * paymentThreshold)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that more traffic has occurred when this function is under lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not during handleConfirmChequeMsg
. But traffic might have occurred since the call to checkPaymentThreshold
in Add
where this cheque was initially created.
swap/swap.go
Outdated
} | ||
|
||
// since more swap traffic might have occurred since this cheque was already sent, we redo the payment threshold check | ||
if err := s.checkPaymentThreshold(p); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sendCheque
returns an error here, it is handled differently than when sendCheque
returns an error while being called from the Add
function. I don't think this is desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I intentionally didn't change the error handling in Add
in this PR to keep it simple. Add
shouldn't return an error either if cheque sending fails but I would leave that (and the rest of the errors issues in Add
) to a separate PR and only deal with pending cheques in this one.
@@ -43,3 +43,7 @@ type HandshakeMsg struct { | |||
type EmitChequeMsg struct { | |||
Cheque *Cheque | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -251,6 +251,10 @@ func TestTriggerPaymentThreshold(t *testing.T) { | |||
t.Fatal(err) | |||
} | |||
|
|||
debitorSwap.handleConfirmChequeMsg(ctx, creditor, &ConfirmChequeMsg{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would place these additional lines line just above the lines containing:
if creditor.getBalance() != 0 {
t.Fatalf("Expected debitorSwap balance to be 0, but is %d", creditor.getBalance())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here. Move which lines where exactly?
@@ -198,6 +198,7 @@ const ( | |||
balancePrefix = "balance_" | |||
sentChequePrefix = "sent_cheque_" | |||
receivedChequePrefix = "received_cheque_" | |||
pendingChequePrefix = "pending_cheque_" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this new key to the test TestStoreKeys
in swap_test.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work. This makes SWAP much more robust.
I left a few comments.
Apart from those, I think it would be good to have test scenario's covering (make new tests or add to existing tests):
handleEmit:
- After receiving a new cheque, a confirm message is sent, containing the new cheque
- After receiving an already sent cheque, a confirm message is sent, containing the old cheque
handleConfirm - Balance is updated
sentCheque - balance is not updated before getting the confirm message
swap/swap.go
Outdated
} | ||
|
||
if err := p.setLastSentCheque(cheque); err != nil { | ||
p.Drop("persistence error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't support a shutdown invoked by some high level components. you could panic
and risk data corruption with the databases since a shutdown is not invoked when panicking. dropping the peer in this case is the safer option because that way, if you're getting local IO errors and you start dropping peers - you isolate the problem to your local node and technically you do not participate in the network anymore (ideally). I'd stick with dropping peers at the moment since node behavior with IO errors is something we have not tested yet and it is quite an involved task in its own
swap/swap.go
Outdated
@@ -318,6 +336,12 @@ func (s *Swap) handleEmitChequeMsg(ctx context.Context, p *Peer, msg *EmitCheque | |||
return err | |||
} | |||
|
|||
if err := p.Send(ctx, &ConfirmChequeMsg{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ralph-pichler the convention to use is:
err := p.Send(...)
if err != nil {
return err
}
for all error checks in the same block, this way you prevent allocating a new variable for every function call that checks for an error. please fix it in the calls below too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it for handleConfirmChequeMsg
and handleEmitChequeMsg
. Hope that's ok now, I remember the opposite recommendation from other PRs.
swap/swap.go
Outdated
} | ||
|
||
// since more swap traffic might have occurred since this cheque was already sent, we redo the payment threshold check | ||
if err := s.checkPaymentThreshold(p); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that more traffic has occurred when this function is under lock?
e96d98f
to
8fcc32c
Compare
8fcc32c
to
b80884a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no major gripes, looks good overall.
i'll take another look after having the discussions addressed.
i recommend rebasing/merging with master
often to avoid falling behind too much and having to deal with complicated conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though not all my comments have been addressed, i am going to go ahead and approve this PR, as i don't perceive any of them as blockers 👍
0c67a7c
to
e22d28e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am OK with the change to the moment in which the paying node adjusts its balance. it doesn't directly solve the problem (i.e. prevent it completely) but it makes it harder to occur.
aside from that, please address the open threads from my previous reviews so they can be resolved
This PR implements the pending cheque mechanism as described here (#1601 (comment)).
There are two differences: