Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give LocalBitcoinNode a timeout in case BitcoinJ hangs during handshake attempt #4058

Merged
merged 2 commits into from Mar 14, 2020

Conversation

dmos62
Copy link
Contributor

@dmos62 dmos62 commented Mar 13, 2020

Fixes #4057

@wiz discovered (#4057) than in some cases LocalBitcoinNode will hang during Bisq setup. This patch gives LocalBitcoinNode's handshake attempt a timeout. Specifically, the part where we block while waiting for BitcoinJ to attempt to return a local BTC node's VersionMessage. I had not originally implemented a timeout, because I was under the impression that in such a case the NioClient would timeout (which does have a timeout).

@wiz has tested this patch and verified that with this patch Bisq handles BitcoinJ hanging gracefully (by timing out the relevant future after 5000 ms).

This patch also prints an error in case a timeout happens, because it normally shouldn't.

The underlying cause is unknown, but seems to come from BitcoinJ.

@dmos62
Copy link
Contributor Author

dmos62 commented Mar 13, 2020

The fix is such that if BitcoinJ hangs (like @wiz was sometimes experiencing), a 5 second timeout will get triggered. This 5 second prolonging isn't ideal, but the other alternative is diving into BitcoinJ for patching. That would be difficult, because I can't reproduce the issue myself. Another alternative is reducing the timeout. Even a 2 seconds timeout shouldn't get triggered under normal circumstances.

@ripcurlx ripcurlx requested a review from wiz March 13, 2020 15:40
Copy link
Member

@wiz wiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this PR and it does prevent my issue #4057 from reproducing. However, I feel I should NACK this PR because of the multiple issues this feature has introduced (including #4052), and I personally feel we should revert the feature for now and re-evaluate it (with heavy testing) before merging it back into master. The most important thing is to keep Bisq stable and we must be risk averse when it comes to these types of changes.

@ripcurlx
Copy link
Member

and I personally feel we should revert the feature for now and re-evaluate it (with heavy testing) before merging it back into master. The most important thing is to keep Bisq stable and we must be risk averse when it comes to these types of changes.

That was my first intention, but the problem with reverting this feature is that unfortunately too much code touching this parts of the codebase occurred, so I can't just revert it by the GitHub feature. So it would also need lots of manual code changes which are also error-prone as well. I agree that keeping Bisq stable is key, so I think for the future we should also discuss having more devs than just two involved in release testing. In our last three releases we always had issues based on very specific OS setups. That can only be detected if lots of people are at least starting and smoke testing new releases. In short for the v1.2.9 update I think we need to go with the best local bitcoin node detection solution we can come up with.

Copy link
Member

@sqrrm sqrrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@sqrrm
Copy link
Member

sqrrm commented Mar 13, 2020

I think, like @ripcurlx, that we should merge this PR since it's fixing something that's wrong in the current code base. If we decide to not release the local bitcoin node detection we'll make a separate revert commit later on. I won't merge until at least one more person gives an ACK or @wiz changes his mind though.

@wiz this PR doesn't affect the inclusion of local bitcoin node in the next release. If that's your reason for NACK I think that's the wrong way to suggest reverting the entire feature.

@ripcurlx ripcurlx merged commit d309fd6 into bisq-network:master Mar 14, 2020
@ripcurlx
Copy link
Member

@dmos62 When running with a local full node I recognized that I always the following exception log in the console:

März-14 11:54:29.683 [NioClientManager] WARN  o.b.net.ConnectionHandler: Error handling SelectionKey: java.nio.channels.CancelledKeyException  java.nio.channels.CancelledKeyException: null
	at java.base/sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
	at java.base/sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
	at java.base/java.nio.channels.SelectionKey.isWritable(SelectionKey.java:310)
	at org.bitcoinj.net.ConnectionHandler.handleKey(ConnectionHandler.java:229)
	at org.bitcoinj.net.NioClientManager.handleKey(NioClientManager.java:86)
	at org.bitcoinj.net.NioClientManager.run(NioClientManager.java:122)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
	at org.bitcoinj.utils.ContextPropagatingThreadFactory$1.run(ContextPropagatingThreadFactory.java:49)
	at java.base/java.lang.Thread.run(Thread.java:844)

java.nio.channels.CancelledKeyException: null
	at java.base/sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
	at java.base/sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
	at java.base/java.nio.channels.SelectionKey.isWritable(SelectionKey.java:310)
	at org.bitcoinj.net.ConnectionHandler.handleKey(ConnectionHandler.java:229)
	at org.bitcoinj.net.NioClientManager.handleKey(NioClientManager.java:86)
	at org.bitcoinj.net.NioClientManager.run(NioClientManager.java:122)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
	at org.bitcoinj.utils.ContextPropagatingThreadFactory$1.run(ContextPropagatingThreadFactory.java:49)
	at java.base/java.lang.Thread.run(Thread.java:844)

The connection is established successfully to the node, but I did get this exception.

@wiz
Copy link
Member

wiz commented Mar 14, 2020

@sqrrm yes good points, you're right. I spoke with @ripcurlx and he told me the release plans for monday

@dmos62
Copy link
Contributor Author

dmos62 commented Mar 14, 2020

@dmos62 When running with a local full node I recognized that I always the following exception log in the console:

März-14 11:54:29.683 [NioClientManager] WARN  o.b.net.ConnectionHandler: Error handling SelectionKey: java.nio.channels.CancelledKeyException  java.nio.channels.CancelledKeyException: null
	at java.base/sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
	at java.base/sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
	at java.base/java.nio.channels.SelectionKey.isWritable(SelectionKey.java:310)
	at org.bitcoinj.net.ConnectionHandler.handleKey(ConnectionHandler.java:229)
	at org.bitcoinj.net.NioClientManager.handleKey(NioClientManager.java:86)
	at org.bitcoinj.net.NioClientManager.run(NioClientManager.java:122)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
	at org.bitcoinj.utils.ContextPropagatingThreadFactory$1.run(ContextPropagatingThreadFactory.java:49)
	at java.base/java.lang.Thread.run(Thread.java:844)

java.nio.channels.CancelledKeyException: null
	at java.base/sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
	at java.base/sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
	at java.base/java.nio.channels.SelectionKey.isWritable(SelectionKey.java:310)
	at org.bitcoinj.net.ConnectionHandler.handleKey(ConnectionHandler.java:229)
	at org.bitcoinj.net.NioClientManager.handleKey(NioClientManager.java:86)
	at org.bitcoinj.net.NioClientManager.run(NioClientManager.java:122)
	at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
	at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
	at org.bitcoinj.utils.ContextPropagatingThreadFactory$1.run(ContextPropagatingThreadFactory.java:49)
	at java.base/java.lang.Thread.run(Thread.java:844)

The connection is established successfully to the node, but I did get this exception.

Those should be some of the expected exceptions I have not muted. I'll double check and submit PR to mute them.

@dmos62
Copy link
Contributor Author

dmos62 commented Mar 14, 2020

So the misguiding exceptions pointed out by @ripcurlx are a more difficult problem than I expected. To be clear, they are harmless. They are thrown because we close the Peer while it's still doing something in parallel. The "in parallel" bit is what complicates things. I'll continue investigating. Should also lead to a better understanding of the edge cases. Can't guarantee a solution for Monday.

We can consider, as a temporary measure, a log entry preceding these exceptions that explains them.

Edit: solved; see mentioned PR.

dmos62 added a commit to dmos62/bisq that referenced this pull request Mar 14, 2020
Replaces NioClient and NioClientManager with BlockingClient, which
gets rid of misguiding log entries as reported in bisq-network#4058 PR's comments by
@ripcurlx. Also introduces a few minor changes including better logs and
clearer code.
@ripcurlx
Copy link
Member

I've now also tested the two configuration errors again on Regtest and it is working as before 👍

@ripcurlx
Copy link
Member

Ups wrong PR 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bisq v1.2.8 fails to start on Linux about 25% of the time
4 participants