Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trading: timed out waiting for "init" response #1561

Closed
JoeGruffins opened this issue Apr 5, 2022 · 14 comments
Closed

trading: timed out waiting for "init" response #1561

JoeGruffins opened this issue Apr 5, 2022 · 14 comments

Comments

@JoeGruffins
Copy link
Member

JoeGruffins commented Apr 5, 2022

Have been seeing this a lot while testing. Not sure if eth related. Today I hit for the first time on testnet so decided to just go ahead and make an issue.

It ultimately leads to a confusing state for the client. For example:

image

If you refresh, it does show refunding for a few seconds... I guess until something happens, a tick?
image

The client did get an error from the server that matches were revoked. But what are we doing now? Waiting for refund? The other client seems sure that they are refunding:

image

But besides the UI uncertainty here, isn't it odd the init failed? Just one of four of the same trade was ok? This is on the same machine too, so probably not a communication issue.

This is on #1552 with the confCheckTimeout set to 10 seconds but I've see the same init response timeout error on master as well on simnet. Can also be seen in #1538 so maybe solved by #1541 ? If so there is still the wierd UI issue for the errored client.

From logs it looks like the three failed matches were Taker, and the success was Maker. It looks like all three of the failed matches sent their inits at 02:28:26 and server got them all. Then server searched for a minute and failed them for timeout. Then it found them all the next second!? Seems suspicious. Is it possible s.matchMtx.Lock() is being held somewhere preventing processInit from continuing?

Then the client doesn't know how to interpret the message: [ERR] CORE[wss://127.0.0.1:17273/ws]: No handler found for response: {"type":2,"id":18,"payload":{"result":null,"error":{"code":25,"message":"match already revoked due to inaction"}}} which I think is an old issue.

Errored client log:
dexc.log

Other client log:
dexc.log

Server log:
dcrdex.log

@chappjc
Copy link
Member

chappjc commented Apr 5, 2022

Thank you for dog-fooding so hard. It is absolutely critical and you should get a medal when this is done.

Now, there is a lot to discuss in this issue, but, an issue you have is that the server's broadcast timeout of 1 minute is very short.

[DBG] SWAP: Swapper started with 1m0s broadcast timeout.

Here's the timeline for match b8ec6651 with the problematic eth txn https://goerli.etherscan.io/tx/0xbeaff99d1c8f29d903263b8eb744fc073d606af169d62d9fa6622aaffcfe2e63 that was initiating 3 swaps in a batch:

  • matched at 11:22:00.008
  • maker's DCR contract init received at 11:22:00.044 and immediately validated (server saw it)
  • the DCR contract was mined at 11:28:26.438, becoming takers turn to swap
2022-04-05 11:28:26.438 [DBG] ASSET[dcr][RPC]: New block 0000006a4fa709ebda222fa5efdf4a4239c26e5313c26b1f6fa957600071fb66 (897776)
2022-04-05 11:28:26.439 [DBG] SWAP: Swap {txid = 621c92f5417d0e3524882d459299d43789a265fc53e0756a894b0e6c522e2fb9, vout = 0} (dcr) has reached 1 confirmations (1 required)
...
2022-04-05 11:28:26.635 [DBG] SWAP: handleInit: 'init' received from user 1aa4aec190d355ead10b956fed232f8329a3234a2062e467b97e1b2727c291aa for match b8ec6651c01b9a077fc3dc6eaa44d41e9193292fa4cd56c552d6234f8c2f61a1, order 39b2e43eb1bff05c6c3814d3b75318786815979e6059ab24330d94cbdd0c77d7
2022-04-05 11:28:26.635 [DBG] SWAP: Allowing until 2022-04-05 02:29:26.438500481 +0000 UTC (59.802540271s) to locate contract from taker (MakerSwapCast), match b8ec6651c01b9a077fc3dc6eaa44d41e9193292fa4cd56c552d6234f8c2f61a1
  • server is allowing 1 minute to locate the taker's ETH swap. It's also searching for 6 other new inits at the same time. NOTE: It is block 6659787 at this point.

Here's where I have questions. Why couldn't server find it? Doesn't the full node on server have the ability to do this? I thought so. Are we relying on it getting mined before even accepting it and acknowledging the action so it does not revoke the match? This would be a big fundamental deviation from the utxo coin handling.

It also seems to have not been mined even with eth blocks going by.

2022-04-05 11:28:40.454 [DBG] ASSET[eth]: Tip change from 0x0aa5a99f1732678c825d8f304bcf0edc5a8f639fc507d01dcda622176a31428c (6659787) to 0x4d76b54a9479e23c3ba03f623200da48d9807f4a5fc1ca2de50951355e32be61 (6659788).
2022-04-05 11:28:55.454 [DBG] ASSET[eth]: Tip change from 0x4d76b54a9479e23c3ba03f623200da48d9807f4a5fc1ca2de50951355e32be61 (6659788) to 0xb6f1c56094629dbb88fe7709c36010e7ac6bc486772f6adf0d390474192704b3 (6659789).
...
2022-04-05 11:28:57.478 [DBG] SWAP: processInit: valid contract 0x475ae72b4d9a87615370ebb919cd440ee994399f9ff78c8f7cf62b8de99e8f43 (eth) received at 2022-04-05 02:28:57.477 +0000 UTC from user 1aa4aec190d355ead10b956fed232f8329a3234a2062e467b97e1b2727c291aa (taker) for match 1378a2f2b588aa8797bf590c39c6a2119157705df89b1a34890c7dbc59fab0fd, fee rate = 200, swapStatus MakerSwapCast => TakerSwapCast
2022-04-05 11:28:57.478 [DBG] SWAP: processInit: sending contract 'audit' request to counterparty c70306e9e04ddb38debee66192fd35abf678794500e6eb1ce1fb84538600c884 (maker) for match 1378a2f2b588aa8797bf590c39c6a2119157705df89b1a34890c7dbc59fab0fd
2022-04-05 11:28:57.481 [DBG] SWAP: Received contract 'audit' acknowledgement from user c70306e9e04ddb38debee66192fd35abf678794500e6eb1ce1fb84538600c884 (maker) for match 1378a2f2b588aa8797bf590c39c6a2119157705df89b1a34890c7dbc59fab0fd (TakerSwapCast)
2022-04-05 11:29:12.456 [DBG] ASSET[eth]: Tip change from 0xb6f1c56094629dbb88fe7709c36010e7ac6bc486772f6adf0d390474192704b3 (6659789) to 0xa1fec3cef4520e3b70703f768015cfa225f3e79409eb100d2d5a40a6859585cd (6659790).
2022-04-05 11:29:26.442 [DBG] SWAP: failMatch: swap b8ec6651c01b9a077fc3dc6eaa44d41e9193292fa4cd56c552d6234f8c2f61a1 failing (maker fault = false) at MakerSwapCast
...
2022-04-05 11:29:26.455 [DBG] ASSET[eth]: Tip change from 0xa1fec3cef4520e3b70703f768015cfa225f3e79409eb100d2d5a40a6859585cd (6659790) to 0x576ee41a087a5f340df05d58e4892057432ce8f6a6b8bff02655b1526cb49dc7 (6659791).
2022-04-05 11:29:27.478 [ERR] SWAP: Contract txn located after match was revoked (match id=b5be74d903f78caa9fddb158343376ac1538f5757c2db75ce620d97ce2a2d0b6, maker=false)
2022-04-05 11:29:27.478 [DBG] SWAP: Error going to user 1aa4aec190d355ead10b956fed232f8329a3234a2062e467b97e1b2727c291aa, code: 25, msg: match already revoked due to inaction
2022-04-05 11:29:27.480 [ERR] SWAP: Contract txn located after match was revoked (match id=b8ec6651c01b9a077fc3dc6eaa44d41e9193292fa4cd56c552d6234f8c2f61a1, maker=false)
2022-04-05 11:29:27.480 [DBG] SWAP: Error going to user 1aa4aec190d355ead10b956fed232f8329a3234a2062e467b97e1b2727c291aa, code: 25, msg: match already revoked due to inaction
2022-04-05 11:29:27.481 [ERR] SWAP: Contract txn located after match was revoked (match id=7fd11ad1f663521cbbb7b4ffea15841812f2df141c6bbf72cb148d804795f426, maker=false)
2022-04-05 11:29:27.481 [DBG] SWAP: Error going to user 1aa4aec190d355ead10b956fed232f8329a3234a2062e467b97e1b2727c291aa, code: 25, msg: match already revoked due to inaction
2022-04-05 11:29:41.455 [DBG] ASSET[eth]: Tip change from 0x576ee41a087a5f340df05d58e4892057432ce8f6a6b8bff02655b1526cb49dc7 (6659791) to 0x2c2fe9d52d783956c2995ed7657ecba054797dcd83ed9a044b61baaf27197079 (6659792).
2022-04-05 11:29:57.454 [DBG] ASSET[eth]: Tip change from 0x2c2fe9d52d783956c2995ed7657ecba054797dcd83ed9a044b61baaf27197079 (6659792) to 0x829c8122a4a906f08ac204dea60042f772153904c926aa25922d026b86d28361 (6659793).

Again, it was finally mined in 6659792.

Although there was no need to revoke at 1 minute, we really need to get to the bottom of server's locating the ETH contracts.

@chappjc
Copy link
Member

chappjc commented Apr 5, 2022

Ah I think what we're really seeing here is serious lock contention because of timeouts checking the eth contracts, just like the timeouts you've seen client-side (#1552).
The matchMtx is held while Swapper does these checks. Combined with repeated blocks and many event-based checks, the requests pile up because they take so long, and the mutex is blocked.

Note this was almost a minute late and it really does indicate the contract was found:

2022-04-05 11:29:27.480 [ERR] SWAP: Contract txn located after match was revoked

But I think it found it a good deal earlier, but it was hanging at the RLock at

dcrdex/server/swap/swap.go

Lines 1473 to 1477 in fb76b59

s.matchMtx.RLock()
if _, found := s.matches[matchID]; !found {
s.matchMtx.RUnlock()
log.Errorf("Contract txn located after match was revoked (match id=%v, maker=%v)",
matchID, actor.isMaker)

Combined with the fact that the latencyQ's recheck interval is every three seconds, I think we're having major lock contention problems because of contract call timeouts.

@chappjc
Copy link
Member

chappjc commented Apr 5, 2022

Continuing to look at this, I now think server is also having a hard time locating pending txns in under a minute (your broadcast timeout). I suspect processInit -> Contract -> newSwapCoin -> baseCoin -> (ethFetcher).transaction is getting ethereum.NotFound quite a bit longer than we expect it to for a full node.

@JoeGruffins increase the broadcast timeout back to closer to 20 minutes, at least 5 minutes, and see how it goes.
I'm still looking at some updates to reduce lock contention though.

When you consider that the tx was not mined for several blocks, but it was found before it was mined, it suggest that the tx broadcast was taking forever and it simply did not propagate promptly (to the server node or miners). It was close, but not under a minute.

@JoeGruffins

This comment was marked as resolved.

@JoeGruffins
Copy link
Member Author

JoeGruffins commented Apr 8, 2022

Actually that last one was fine for that client. They called the match inactive and did not init at all as taker.

The maker this time failed for timeout. Will leave the logs here for this second failure I've seen on testnet.

The broadcast timeout was not increased, but I will do that now...

Leaving logs anyhow.
Errored client log:
dexc.log

Other client:
dexc.log

Server:
dcrdex.log

@chappjc
Copy link
Member

chappjc commented Apr 8, 2022

Yeah, increase btimeout please.

I just merged #1541 because there was an error getting silenced that might be relevant here: bd5e52e#diff-6af7c56e40de4eb4dba203d67b1b07ccfc43ed88e99811b9697e9d754d01e21cR882

@chappjc
Copy link
Member

chappjc commented Apr 8, 2022

Oh BTW, I started tweaking Swapper to reduce lock contention in #1563 Might be worth a test for this issue (with a slightly more reasonable btimeout though) if you can reproduce it fairly regularly @JoeGruffins.

@JoeGruffins
Copy link
Member Author

Does broadcast timer above two minutes matter here? We are gated by txWaitExpiration so can only wait up to two minutes to find the tx.

@JoeGruffins
Copy link
Member Author

JoeGruffins commented Apr 8, 2022

Are we relying on it getting mined before even accepting it and acknowledging the action so it does not revoke the match? This would be a big fundamental deviation from the utxo coin handling.

For eth, don't we not error as long as the contract data looks good? I need to look at server more I guess, but I don't think finding the swap is even necessary to get things going. It should be super forgiving.

edit: I see we do need to find the tx that started the swap. If the server can't find that, might as well throw in the towel joking. Hoping it's something else. The light clients do have trouble randomly though... We should see the coin not found warning in that case, and I don't think we've seen that.

@buck54321
Copy link
Member

I'm comfortable upping the txWaitExpiration by quite a lot, but we should consider implementing a non-constant recheckInterval, so that after the first couple of checks we don't bog ourselves down with slow chains.

@chappjc
Copy link
Member

chappjc commented Apr 21, 2022

Whoops sorry I dropped the ball on this conversation.

Does broadcast timer above two minutes matter here? We are gated by txWaitExpiration so can only wait up to two minutes to find the tx.

I believe the 2 min txWaitExpiration is alright, for the following reasons. The client will get an error response to their init request and try the init request again, with, importantly, plenty of time before match revoke after broadcast timeout. It's intentional that the tx wait expiration is less than btimeout so client can try to rectify this situation, if possible. I'm fairly certain we had txWaitExpiration set to btimeout in previous revisions, but realized the folly with that.

In this case, AFAICT we discovered the issue was the txn not propagating. Not to the miners or the server (and I'm sure ethscan would not have seen it either). This is a clear indication of light client connectivity. It will resolve itself as it regains peers and rebroadcasts, steps that just take time and which we cannot influence.

For eth, don't we not error as long as the contract data looks good? I need to look at server more I guess, but I don't think finding the swap is even necessary to get things going. It should be super forgiving.

edit: I see we do need to find the tx that started the swap. If the server can't find that, might as well throw in the towel joking.

Right, server does need to find the tx before bothering the taker. If the tx is never found on the network or on chain (never confirmed), the maker has just wasted at least 8 hrs of everyone's time... and they wouldn't even have to actually broadcast a txn to do it, just pretend. More than that, our current design is to have server send the participant the "txData" that it retrieves from its node, not from the init tx creator.

we should consider implementing a non-constant recheckInterval, so that after the first couple of checks we don't bog ourselves down with slow chains

Yah, that's definitely something to consider.

@JoeGruffins
Copy link
Member Author

In this case, AFAICT we discovered the issue was the txn not propagating.

I really don't think this is it. Going to find the issue tomorrow.

@chappjc
Copy link
Member

chappjc commented Apr 21, 2022

There may be multiple issues yet, but the timings described here in the logs files make it clear that miners and server couldn't see the transaction for almost a minute. Several blocks were mined without the transaction, therefore, miners did not have it.
Server is a full node just like miners, so detecting transactions in txpool is fundamental.

@JoeGruffins
Copy link
Member Author

Have been testing a bit on testnet/simnet lately and not seeing this anymore. Maybe fixed. Will close if I can't hit it in the next few days.

@chappjc chappjc closed this as completed Dec 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants