-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server/swap: inaction check fixes #680
Conversation
@@ -1002,7 +1017,7 @@ func (s *Swapper) tryConfirmSwap(status *swapStatus) { | |||
if confs >= int64(swapConf) { | |||
log.Debugf("Swap %v (%s) has reached %d confirmations (%d required)", | |||
status.swap, dex.BipIDSymbol(status.swapAsset), confs, swapConf) | |||
status.swapConfirmed = time.Now().UTC() | |||
status.swapConfirmed = confTime.UTC() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the block time is used to set inaction check timer, and this time.Now was just a hair later than block time, the inaction check could just barely pass. With extra logging, I saw inaction checks pass with "59.998358778s ago".
if match.makerStatus.swapAsset != assetID { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first action timeout check is not relative to a block time.
if match.takerStatus.swapAsset != assetID { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These swapAsset checks were not necessary, but in this MakerSwapCast case, it should have been checking makerStatus.swapAsset not takerStatus.swapAsset. Part of why we were seeing very long delays in penalization, even indefinite delays in this case.
// If the maker has sent their swap tx, check the taker's broadcast | ||
// timeout against the time of the swap's SwapConf'th confirmation. | ||
if match.takerStatus.swapTime.IsZero() && | ||
!match.makerStatus.swapConfirmed.IsZero() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the status is MakerSwapCast
, takerStatus.swapTime
cannot be set. When it gets set, it would be TakerSwapCast
. We only need to check makerStatus.swapConfirmed
.
// If the taker has sent their swap tx, check the maker's broadcast | ||
// timeout (for redemption) against the time of the swap's SwapConf'th | ||
// confirmation. | ||
if match.makerStatus.redeemTime.IsZero() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous case, if makerStatus.redeemTime
were set, it wouldn't be TakerSwapCast
.
// If the maker has redeemed, the taker can redeem immediately, so | ||
// check the timeout against the time the Swapper received the | ||
// maker's `redeem` request (and sent the taker's 'redemption'). | ||
if match.takerStatus.redeemTime.IsZero() && | ||
!match.makerStatus.redeemTime.IsZero() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes further and eliminates the !match.makerStatus.redeemTime.IsZero()
because this is a bcast time that is set when it becomes MakerRedeem
when the redeem is at least seen, unlike the swap cast states where swapConfirmed
is only set after SwapConf
340c632
to
a659e5f
Compare
use Timers and AfterFunc for inaction checks more precise latencyQ expirations dcrdex: 5 min default bcast timeout fixing up broadcast timeout tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize this is still a draft, I'm eager to have this in, so took an early look. Haven't gone all through but one recurrent observation is inaction checks not accounting for the time it may take for an actor to receive and maybe validate the previous match step e.g. after server sends maker's swap, it might take some time for taker to receive the audit
request and maybe some more time to find and audit the swap. Server expecting taker to act bTimeout
after maker's swap gets the required confs may be unrealistic.
case order.NewlyMatched: | ||
// Maker has not broadcast their swap. They have until match time | ||
// plus bTimeout. | ||
if tooOld(match.time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match.time
is set to current time in readMatches
:
Line 2528 in 9e720a7
matches := readMatches(matchSets, feeRates) |
but the match requests are sent a little later
Lines 2651 to 2654 in 9e720a7
// Send the ack request. | |
s.authMgr.RequestWhenConnected(u, req, func(_ comms.Link, resp *msgjson.Message) { | |
s.processMatchAcks(u, resp, m) | |
}, auth.DefaultRequestTimeout, auth.DefaultConnectTimeout, expireFunc) |
Also, maker may not get the request until auth.DefaultRequestTimeout + auth.DefaultConnectTimeout
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "sent a little later" bit is about as brief as it gets, but as for the request/connect timeouts, that's intentionally deducted from the inaction timeout. If they aren't connected, that's should be dealt with by the inaction check. Users should not be able to draw out the swap intentionally by using some exception added for the request timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also meant to comment that DefaultRequestTimeout
is perhaps poorly named since it's a response timeout that starts when the request is successfully sent. But I admit there is a peculiar interaction between the response and connect timeouts (a user who is not connected can delay their response by just not being connected when the request is initially generated and queued for send).
Anyway, inaction doesn't kick in until bTimeout after the match request is sent (or queued for send), and I think that's the right way to do this. If the user is not online and that eats into their action time, that is to be held against them. In light of this, I don't have a strong objection to having the response timeout being the time until inaction/penalization, except that the RequestWhenConnected
syntax doesn't permit an absolute response deadline, which would make this straightforward. I can imagine a different request function that replaces the response timeout and connect timeout with a single deadline, but I don't want to explore that in this PR.
server/swap/swap.go
Outdated
// If the maker has redeemed, the taker can redeem immediately, so | ||
// check the timeout against the time the Swapper received the | ||
// maker's `redeem` request (and sent the taker's 'redemption'). | ||
match.makerStatus.mtx.RLock() | ||
defer match.makerStatus.mtx.RUnlock() | ||
if tooOld(match.makerStatus.redeemTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Server sends the redemption to taker a brief moment after setting the redeemTime; and taker may not receive the server-sent redemption until auth.DefaultRequestTimeout + auth.DefaultConnectTimeout
later. It'd be basically pointless resending the request if taker reconnects after s.bTimeout
because the match would have been revoked by then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment regarding the brief moment. I don't think we need to account for a few milliseconds.
I do agree it is pointless to send the request after bTimeout, so ideally the connect timeout would be capped at bTimeout.
Thanks for having a look @itswisdomagain. I'm gonna leave this PR as is for a bit while I work on other things, so it's good to have eyes on it now.
I think this is all built into the bTimeout setting. There has to be a limit to how long an actor is reasonably allowed to do these things. Admittedly, some steps like the ones involving an audit require more time than others. But with the audit, the server has already found the txn, so the actor should be able to find it too. The bTimeout defines how long they have to take these actions, including the audit. Further, recall that the audit happens before the swap txn reaches swapConf, so there's essentially no time pressure. I have considered a separate timeout for the audit. In fact, at one point I toyed with inserting Penalize on the audit request expiry, which would allow the txn bcast to be a completely separate timeout. But I'm not convinced that's really necessary. |
This comment has been minimized.
This comment has been minimized.
mt.mtx.Lock() | ||
defer mt.mtx.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were two concurrent match acks (e.g. different users) targeting the same match IDs, this could deadlock in the loop. Really a poor idea in retrospect.
} | ||
err = storFn(mid, ack.Sig) | ||
err = storFn(mid, ackSig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DB operation now comes after all the ack sigs are stored in memory so they can all be registered asap as they are needed to handle the maker's 'init', which can come at any moment.
Co-authored-by: Wisdom Arerosuoghene <wisdom.arerosuoghene@gmail.com>
case order.TakerSwapCast: | ||
if match.takerStatus.swapAsset != block.assetID { | ||
break statusSwitch | ||
} | ||
// If the taker has broadcast their transaction, the maker's broadcast | ||
// timeout (for redemption) starts once the maker's swap has SwapConf | ||
// confs. | ||
s.tryConfirmSwap(match.takerStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it occurred to me that I don't see much point in the RedeemStatus stuff below in the MatchComplete case, where we check the confirmation count on the redeem transactions. Planning to investigate changes w.r.t. this as there's not much reason to wait for redeem confirms to stop tracking the match. This is closely related to the issue of the maker's ack of the taker's redeem, so I'm going to consider that at the same time and how these things interact w.r.t. timing (e.g. I think we've seen matches removed before the maker's final pointless redeem ack -- on both client and server!).
// The counterparty does not need to actually locate the redemption txn, but | ||
// they do need to author and broadcast their own redeem, so allow the | ||
// request to wait for a response up until the inaction penalty. | ||
s.authMgr.RequestWhenConnected(ack.user, notification, func(_ comms.Link, resp *msgjson.Message) { | ||
s.processAck(resp, ack) // resp.ID == notification.ID | ||
}, auth.DefaultRequestTimeout, auth.DefaultConnectTimeout, expireFunc) | ||
}, s.bTimeout, auth.DefaultConnectTimeout, expireFunc) | ||
// maybe: s.bTimeout-time.Since(redeemTime) or time.Until(redeemTime.Add(s.bTimeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pertains to the issue resolved by the final commit, where we saw a maker with simultaneous 20+ matches that were processed in series instead of concurrently. The timeout on the response was too short. Even though the client will be doing them concurrently and faster with 4a22c9d in PR #688, I'm not sure I see a reason to timeout on the response before inaction time.
However, a different RequestWhenConnected with a single deadline instead of two timeouts for response and connect would be more appropriate and less confusing (presently the response timeout only starts after the request is sent, which could be after a good deal of time waiting for connect, but is only accurate as a deadline if the user is already connected at the time this request is made right here).
// Do not search for the transaction past the inaction deadline. For maker, | ||
// this is bTimeout after match request. For taker, this is bTimeout after | ||
// maker's swap reached swapConfs. | ||
lastEvent := stepInfo.match.time // NewlyMatched - the match request time, not matchTime | ||
if stepInfo.step == order.MakerSwapCast { | ||
lastEvent = stepInfo.match.makerStatus.swapConfTime() | ||
} | ||
expireTime := time.Now().Add(txWaitExpiration).UTC() | ||
if lastEvent.IsZero() { | ||
log.Warnf("Prematurely received 'init' from %v at step %v", makerTaker(stepInfo.actor.isMaker), stepInfo.step) | ||
} else if deadline := lastEvent.Add(s.bTimeout); expireTime.After(deadline) { | ||
expireTime = deadline | ||
} | ||
log.Debugf("Allowing until %v (%v) to locate contract from %v (%v), match %v", | ||
expireTime, time.Until(expireTime), makerTaker(stepInfo.actor.isMaker), | ||
stepInfo.step, matchID) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair bit of complexity for little gain in appropriateness of the latencyQ waiter timeout. But it is more appropriate than always going with txWaitExpiration from now, which might be after the inaction time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on a first pass. Since this is critical code, I want to look again with fresh eyes later. Anything to add?
No further additions planned for this PR, although I have my eye on the seemingly unnecessary redeem confirm checks and the related issue of the final redeem ack from the maker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just curious about some stuffs...
assetID, nBlocks := assetToMine.ID, uint16(assetToMine.SwapConf)+1 // mine 1 extra to ensure tx gets req. confs | ||
assetID, nBlocks := assetToMine.ID, uint16(assetToMine.SwapConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity... does mining an additional block (i.e. 2 blocks in quick succession) cause issues for the swap inaction checker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but the extra blocks were added in (pretty sure) because we were grappling with issues getting txns mined on simnet, then we figured out we needed the regentemplate RPC in mine-alpha, which seems to have entirely resolved the mining issue. Do you think we still need an extra block in any of these cases? The appropriate tests would only mine the blocks that reach inaction (just SwapConf).
@@ -184,7 +184,7 @@ func (m *Matcher) Match(book Booker, queue []*OrderRevealed) (seed []byte, match | |||
removed, ok := book.Remove(o.TargetOrderID) | |||
if !ok { | |||
// The targeted order might be down queue or non-existent. | |||
log.Debugf("Failed to remove order %v set by a cancel order %v", | |||
log.Debugf("Order %v not removed by a cancel order %v (target either non-existent or down queue in this epoch)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does down queue in this context mean? Could the target order be booked at this point but not matched to this cancel order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down queue means after shuffle of an epoch queue that contains both a trade and a cancel order targeting that trade, the target order would get processed after the cancel order, so the cancel order would not be able to cancel it since it hasn't been booked yet.
The contains a number if fixes to the server's inaction checks:
NewlyMatched
->MakerSwapCast
timing starts at the time the match requests are sent (event-based),MakerSwapCast
->TakerSwapCast
starts at the time the maker's swap reaches swapConf (block-based),TakerSwapCast
->MakerRedeemed
starts at the time the taker's swap reaches swapConf (block-based), andMakerRedeemed
->MatchComplete
starts at the time the maker broadcasts their redeem, notifies server, and server informs taker (event based).Other changes:
AfterFunc
s.