Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server/swap: inaction check fixes #680

Merged
merged 12 commits into from
Sep 21, 2020
Merged

server/swap: inaction check fixes #680

merged 12 commits into from
Sep 21, 2020

Conversation

chappjc
Copy link
Member

@chappjc chappjc commented Sep 9, 2020

The contains a number if fixes to the server's inaction checks:

  • Fix inaction asset ID checks skipping checks because of incorrect swapAsset.
  • Set swapConfirmed time from the block time, not time.Now, which was causing inaction check to pass by fractions of a second.
  • On startup, schedule an inaction check for each asset.
  • Separate block based and "event"-based checks. NewlyMatched->MakerSwapCast timing starts at the time the match requests are sent (event-based), MakerSwapCast->TakerSwapCast starts at the time the maker's swap reaches swapConf (block-based), TakerSwapCast->MakerRedeemed starts at the time the taker's swap reaches swapConf (block-based), and MakerRedeemed->MatchComplete starts at the time the maker broadcasts their redeem, notifies server, and server informs taker (event based).

Other changes:

  • Triggering of inaction checks is refactored, usings AfterFuncs.
  • Default broadcast timeout is not 5 minutes, up from 1 minute, but the dcrdex harness still uses 1 min via the config setting.
  • Swapper's txn waiter expiration is now 2 min, up from 1 min.
  • On Swapper construction, the tx waiter expiration is checked against the configured bcast timeout so that it is never longer than bcast timeout.
  • Swapper's coin recheck interval is slightly smaller (5 sec -> 3 sec)

@@ -1002,7 +1017,7 @@ func (s *Swapper) tryConfirmSwap(status *swapStatus) {
if confs >= int64(swapConf) {
log.Debugf("Swap %v (%s) has reached %d confirmations (%d required)",
status.swap, dex.BipIDSymbol(status.swapAsset), confs, swapConf)
status.swapConfirmed = time.Now().UTC()
status.swapConfirmed = confTime.UTC()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the block time is used to set inaction check timer, and this time.Now was just a hair later than block time, the inaction check could just barely pass. With extra logging, I saw inaction checks pass with "59.998358778s ago".

Comment on lines -1224 to -1226
if match.makerStatus.swapAsset != assetID {
return
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first action timeout check is not relative to a block time.

Comment on lines -1234 to -1236
if match.takerStatus.swapAsset != assetID {
return
}
Copy link
Member Author

@chappjc chappjc Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These swapAsset checks were not necessary, but in this MakerSwapCast case, it should have been checking makerStatus.swapAsset not takerStatus.swapAsset. Part of why we were seeing very long delays in penalization, even indefinite delays in this case.

// If the maker has sent their swap tx, check the taker's broadcast
// timeout against the time of the swap's SwapConf'th confirmation.
if match.takerStatus.swapTime.IsZero() &&
!match.makerStatus.swapConfirmed.IsZero() &&
Copy link
Member Author

@chappjc chappjc Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the status is MakerSwapCast, takerStatus.swapTime cannot be set. When it gets set, it would be TakerSwapCast. We only need to check makerStatus.swapConfirmed.

// If the taker has sent their swap tx, check the maker's broadcast
// timeout (for redemption) against the time of the swap's SwapConf'th
// confirmation.
if match.makerStatus.redeemTime.IsZero() &&
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous case, if makerStatus.redeemTime were set, it wouldn't be TakerSwapCast.

// If the maker has redeemed, the taker can redeem immediately, so
// check the timeout against the time the Swapper received the
// maker's `redeem` request (and sent the taker's 'redemption').
if match.takerStatus.redeemTime.IsZero() &&
!match.makerStatus.redeemTime.IsZero() &&
Copy link
Member Author

@chappjc chappjc Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes further and eliminates the !match.makerStatus.redeemTime.IsZero() because this is a bcast time that is set when it becomes MakerRedeem when the redeem is at least seen, unlike the swap cast states where swapConfirmed is only set after SwapConf

@chappjc chappjc force-pushed the inaction branch 2 times, most recently from 340c632 to a659e5f Compare September 13, 2020 02:16
use Timers and AfterFunc for inaction checks

more precise latencyQ expirations

dcrdex: 5 min default bcast timeout

fixing up broadcast timeout tests
Copy link
Member

@itswisdomagain itswisdomagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize this is still a draft, I'm eager to have this in, so took an early look. Haven't gone all through but one recurrent observation is inaction checks not accounting for the time it may take for an actor to receive and maybe validate the previous match step e.g. after server sends maker's swap, it might take some time for taker to receive the audit request and maybe some more time to find and audit the swap. Server expecting taker to act bTimeout after maker's swap gets the required confs may be unrealistic.

client/core/trade_simnet_test.go Show resolved Hide resolved
dex/testing/dcr/harness.sh Outdated Show resolved Hide resolved
server/swap/swap.go Show resolved Hide resolved
server/swap/swap.go Show resolved Hide resolved
case order.NewlyMatched:
// Maker has not broadcast their swap. They have until match time
// plus bTimeout.
if tooOld(match.time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match.time is set to current time in readMatches:

matches := readMatches(matchSets, feeRates)

but the match requests are sent a little later

dcrdex/server/swap/swap.go

Lines 2651 to 2654 in 9e720a7

// Send the ack request.
s.authMgr.RequestWhenConnected(u, req, func(_ comms.Link, resp *msgjson.Message) {
s.processMatchAcks(u, resp, m)
}, auth.DefaultRequestTimeout, auth.DefaultConnectTimeout, expireFunc)

Also, maker may not get the request until auth.DefaultRequestTimeout + auth.DefaultConnectTimeout later.

Copy link
Member Author

@chappjc chappjc Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "sent a little later" bit is about as brief as it gets, but as for the request/connect timeouts, that's intentionally deducted from the inaction timeout. If they aren't connected, that's should be dealt with by the inaction check. Users should not be able to draw out the swap intentionally by using some exception added for the request timeouts.

Copy link
Member Author

@chappjc chappjc Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also meant to comment that DefaultRequestTimeout is perhaps poorly named since it's a response timeout that starts when the request is successfully sent. But I admit there is a peculiar interaction between the response and connect timeouts (a user who is not connected can delay their response by just not being connected when the request is initially generated and queued for send).

Anyway, inaction doesn't kick in until bTimeout after the match request is sent (or queued for send), and I think that's the right way to do this. If the user is not online and that eats into their action time, that is to be held against them. In light of this, I don't have a strong objection to having the response timeout being the time until inaction/penalization, except that the RequestWhenConnected syntax doesn't permit an absolute response deadline, which would make this straightforward. I can imagine a different request function that replaces the response timeout and connect timeout with a single deadline, but I don't want to explore that in this PR.

Comment on lines 1256 to 1261
// If the maker has redeemed, the taker can redeem immediately, so
// check the timeout against the time the Swapper received the
// maker's `redeem` request (and sent the taker's 'redemption').
match.makerStatus.mtx.RLock()
defer match.makerStatus.mtx.RUnlock()
if tooOld(match.makerStatus.redeemTime) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server sends the redemption to taker a brief moment after setting the redeemTime; and taker may not receive the server-sent redemption until auth.DefaultRequestTimeout + auth.DefaultConnectTimeout later. It'd be basically pointless resending the request if taker reconnects after s.bTimeout because the match would have been revoked by then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding the brief moment. I don't think we need to account for a few milliseconds.

I do agree it is pointless to send the request after bTimeout, so ideally the connect timeout would be capped at bTimeout.

@chappjc
Copy link
Member Author

chappjc commented Sep 15, 2020

Thanks for having a look @itswisdomagain. I'm gonna leave this PR as is for a bit while I work on other things, so it's good to have eyes on it now.

it might take some time for taker to receive the audit request and maybe some more time to find and audit the swap. Server expecting taker to act bTimeout after maker's swap gets the required confs may be unrealistic.

I think this is all built into the bTimeout setting. There has to be a limit to how long an actor is reasonably allowed to do these things. Admittedly, some steps like the ones involving an audit require more time than others. But with the audit, the server has already found the txn, so the actor should be able to find it too. The bTimeout defines how long they have to take these actions, including the audit. Further, recall that the audit happens before the swap txn reaches swapConf, so there's essentially no time pressure.

I have considered a separate timeout for the audit. In fact, at one point I toyed with inserting Penalize on the audit request expiry, which would allow the txn bcast to be a completely separate timeout. But I'm not convinced that's really necessary.

@chappjc chappjc marked this pull request as ready for review September 15, 2020 13:58
@chappjc

This comment has been minimized.

server/swap/swap.go Outdated Show resolved Hide resolved
server/swap/swap.go Outdated Show resolved Hide resolved
server/swap/swap.go Outdated Show resolved Hide resolved
server/swap/swap.go Outdated Show resolved Hide resolved
server/swap/swap.go Show resolved Hide resolved
server/swap/swap.go Outdated Show resolved Hide resolved
server/swap/swap.go Outdated Show resolved Hide resolved
server/swap/swap_test.go Outdated Show resolved Hide resolved
Comment on lines -2302 to -2303
mt.mtx.Lock()
defer mt.mtx.Unlock()
Copy link
Member Author

@chappjc chappjc Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were two concurrent match acks (e.g. different users) targeting the same match IDs, this could deadlock in the loop. Really a poor idea in retrospect.

}
err = storFn(mid, ack.Sig)
err = storFn(mid, ackSig)
Copy link
Member Author

@chappjc chappjc Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DB operation now comes after all the ack sigs are stored in memory so they can all be registered asap as they are needed to handle the maker's 'init', which can come at any moment.

chappjc and others added 2 commits September 17, 2020 21:59
Co-authored-by: Wisdom Arerosuoghene <wisdom.arerosuoghene@gmail.com>
case order.TakerSwapCast:
if match.takerStatus.swapAsset != block.assetID {
break statusSwitch
}
// If the taker has broadcast their transaction, the maker's broadcast
// timeout (for redemption) starts once the maker's swap has SwapConf
// confs.
s.tryConfirmSwap(match.takerStatus)
Copy link
Member Author

@chappjc chappjc Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it occurred to me that I don't see much point in the RedeemStatus stuff below in the MatchComplete case, where we check the confirmation count on the redeem transactions. Planning to investigate changes w.r.t. this as there's not much reason to wait for redeem confirms to stop tracking the match. This is closely related to the issue of the maker's ack of the taker's redeem, so I'm going to consider that at the same time and how these things interact w.r.t. timing (e.g. I think we've seen matches removed before the maker's final pointless redeem ack -- on both client and server!).

Comment on lines +1990 to +1996
// The counterparty does not need to actually locate the redemption txn, but
// they do need to author and broadcast their own redeem, so allow the
// request to wait for a response up until the inaction penalty.
s.authMgr.RequestWhenConnected(ack.user, notification, func(_ comms.Link, resp *msgjson.Message) {
s.processAck(resp, ack) // resp.ID == notification.ID
}, auth.DefaultRequestTimeout, auth.DefaultConnectTimeout, expireFunc)
}, s.bTimeout, auth.DefaultConnectTimeout, expireFunc)
// maybe: s.bTimeout-time.Since(redeemTime) or time.Until(redeemTime.Add(s.bTimeout))
Copy link
Member Author

@chappjc chappjc Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pertains to the issue resolved by the final commit, where we saw a maker with simultaneous 20+ matches that were processed in series instead of concurrently. The timeout on the response was too short. Even though the client will be doing them concurrently and faster with 4a22c9d in PR #688, I'm not sure I see a reason to timeout on the response before inaction time.

However, a different RequestWhenConnected with a single deadline instead of two timeouts for response and connect would be more appropriate and less confusing (presently the response timeout only starts after the request is sent, which could be after a good deal of time waiting for connect, but is only accurate as a deadline if the user is already connected at the time this request is made right here).

Comment on lines +2086 to +2102
// Do not search for the transaction past the inaction deadline. For maker,
// this is bTimeout after match request. For taker, this is bTimeout after
// maker's swap reached swapConfs.
lastEvent := stepInfo.match.time // NewlyMatched - the match request time, not matchTime
if stepInfo.step == order.MakerSwapCast {
lastEvent = stepInfo.match.makerStatus.swapConfTime()
}
expireTime := time.Now().Add(txWaitExpiration).UTC()
if lastEvent.IsZero() {
log.Warnf("Prematurely received 'init' from %v at step %v", makerTaker(stepInfo.actor.isMaker), stepInfo.step)
} else if deadline := lastEvent.Add(s.bTimeout); expireTime.After(deadline) {
expireTime = deadline
}
log.Debugf("Allowing until %v (%v) to locate contract from %v (%v), match %v",
expireTime, time.Until(expireTime), makerTaker(stepInfo.actor.isMaker),
stepInfo.step, matchID)

Copy link
Member Author

@chappjc chappjc Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair bit of complexity for little gain in appropriateness of the latencyQ waiter timeout. But it is more appropriate than always going with txWaitExpiration from now, which might be after the inaction time.

Copy link
Member

@buck54321 buck54321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on a first pass. Since this is critical code, I want to look again with fresh eyes later. Anything to add?

server/swap/swap.go Outdated Show resolved Hide resolved
server/swap/swap.go Outdated Show resolved Hide resolved
@chappjc
Copy link
Member Author

chappjc commented Sep 18, 2020

No further additions planned for this PR, although I have my eye on the seemingly unnecessary redeem confirm checks and the related issue of the final redeem ack from the maker.

Copy link
Member

@buck54321 buck54321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Member

@itswisdomagain itswisdomagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just curious about some stuffs...

assetID, nBlocks := assetToMine.ID, uint16(assetToMine.SwapConf)+1 // mine 1 extra to ensure tx gets req. confs
assetID, nBlocks := assetToMine.ID, uint16(assetToMine.SwapConf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity... does mining an additional block (i.e. 2 blocks in quick succession) cause issues for the swap inaction checker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but the extra blocks were added in (pretty sure) because we were grappling with issues getting txns mined on simnet, then we figured out we needed the regentemplate RPC in mine-alpha, which seems to have entirely resolved the mining issue. Do you think we still need an extra block in any of these cases? The appropriate tests would only mine the blocks that reach inaction (just SwapConf).

@@ -184,7 +184,7 @@ func (m *Matcher) Match(book Booker, queue []*OrderRevealed) (seed []byte, match
removed, ok := book.Remove(o.TargetOrderID)
if !ok {
// The targeted order might be down queue or non-existent.
log.Debugf("Failed to remove order %v set by a cancel order %v",
log.Debugf("Order %v not removed by a cancel order %v (target either non-existent or down queue in this epoch)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does down queue in this context mean? Could the target order be booked at this point but not matched to this cancel order?

Copy link
Member Author

@chappjc chappjc Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down queue means after shuffle of an epoch queue that contains both a trade and a cancel order targeting that trade, the target order would get processed after the cancel order, so the cancel order would not be able to cancel it since it hasn't been booked yet.

server/swap/swap.go Show resolved Hide resolved
server/swap/swap.go Show resolved Hide resolved
server/swap/swap.go Outdated Show resolved Hide resolved
@chappjc chappjc merged commit e557253 into decred:master Sep 21, 2020
@chappjc chappjc deleted the inaction branch September 21, 2020 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants