Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate causes of and solutions to transaction broadcast failures in v0.6.3 #1195

Closed
cbeams opened this issue Jan 15, 2018 · 29 comments
Closed
Assignees
Milestone

Comments

@cbeams
Copy link
Member

cbeams commented Jan 15, 2018

@ManfredKarrer wrote in #1168 (comment):

I did not implement the retry because I think that when we broadcast to all nodes the likelihood is very small that the tx did not get broadcasted and retry handling might introduce bugs and cannot easily be tested.

We're now seeing seeing a significant number of issues coming into arbitration (~10 at time of writing) where maker and deposit transactions have not been broadcast. @keo-bisq and I are tracking these cases in #1193.

This sort of message is what we see in the logs:

Jan-14 12:09:50.420 [JavaFX Application Thread] WARN i.b.c.b.w.Broadcaster: Broadcast of tx <txid> not completed after 8 sec. We optimistically assume that the tx broadcast succeeded and call onSuccess on the callback handler.

And that log message is part of the change that was implemented in #1168, specifically in commits 1b7db08 and dc310d9.

@ManfredKarrer
Copy link
Member

One additional problem can be that the user closed the app before the broadcast is completed. That can happen already now but if we add a retry it can extend the time when the user must not close the app for several minutes. So we should add some information in the UI to deal with that.

I would like to first investigate if the Tor issues are really external issues or (mixed) with issues introduced in the 0.6 release. I will try to setup old 0.5 or even 0.4.9 nodes and see if the Tor network is slow and less reliable there as well.

@cbeams
Copy link
Member Author

cbeams commented Jan 16, 2018

@ManfredKarrer wrote in #1193 (comment) (which I'm bringing over to this issue, because the other issue is just about keeping track of incidences of the problem. This issue is about fixing it):

@cbeams Damn, seems the assumption that the broadcast will succeed sufficiently even we don't hear back from peers was wrong. I will not be able to look closer to it until weekend due traveling...

When searching the log the tx id and the broadcast attempt should be visible. If there are no logs for hearing back from peers then the broadcast did not get confirmed by back reporting peers and our timeout triggered the completion of the broadcast. The app logs a msg like:
"Broadcast of tx .... not completed after .... sec. We optimistically assume that the tx broadcast succeeded....".

People should not post json contract here as it contains the payment method details. I will remove the one above from user @Splitter8.

Instead the json contract the trade ID, taker fee tx, maker fee tx and deposit tx are sufficient. Would be good to post them in the posts so its easy to look up explorer and search log files with it. All those data are not privacy relevant and can be posted in plain text here.

The deposit tx is always failing if one of the trade fee txs was not getting into the blockchain.

We should also check logs on the bitcoin nodes to see if there was any issue.

Reporting users should also post the Bisq version they used at the time when they did the tx.

We should find anyone who knows more about the Tor network of has contact to Tor devs to get more background about the Tor issues. I saw that Tor connections are much slower and less reliable as they have been in the past. For BitcoinJ we could fall back to clear-net connections in the worst case (has some privacy issues). For the P2P network we don't have any alternative though.

@cbeams
Copy link
Member Author

cbeams commented Jan 16, 2018

Hey @ManfredKarrer, responding to the above:

seems the assumption that the broadcast will succeed sufficiently even we don't hear back from peers was wrong. I will not be able to look closer to it until weekend due traveling...

It’s Tuesday morning now. Per #1193, I’ve had 5 of these issues show up in the last 48 hours, and the oldest of my trades is one that was offered on Jan 12th / taken on Jan 13th.

@keo-bisq has had 5 of these issues, too, but strangely, his oldest one is from all the way back on Dec 31st, and his newest one is from Jan 12th.

I do not have a theory as to why I've started seeing this issue in trades that happened only after v0.6.3 was released and Keo has seen these issues in trades that happened before it. The situation doesn't make sense for 2 reasons:

  1. if this is happening because of the changes in Tolerate transaction broadcast timeouts #1168 (as mentioned above in Investigate causes of and solutions to transaction broadcast failures in v0.6.3 #1195 (comment)), then we shouldn't have been seeing these issues at all before then.
  2. both Keo and I should be seeing new issues of this type come into our arbitration queues at roughly the same rate. It's possible that randomness has just been routing them all my way over the last two days, but the odds of that continuing much longer are low and getting lower.

In any case, the issues seem to be coming in at an increasing rate. As I type this, we've had 100 trades since Jan 12th (a little over 3 full days), and 5 of those trades have ended up in arbitration with this problem. Surely more are suffering from it already, but have just not had a dispute ticket opened yet. So it's at least 5/100 trades, and probably something more like 8/100.

If we keep going at the current rate of ~30 trades per day, that'll mean another ~180 trades between now and Sunday (inclusive), which at the current error rate will mean something like another 15 cases on top of the existing 10. This means we should expect to do something like ~25 reimbursements as a result of this problem, assuming we ship a fix by the end of the weekend. Note that this is on top of the 29 reimbursements I'm currently in the middle of processing from the earlier timeout issues (see bisq-network/support#1 and its child issues).

In any case, I’m in full firefighting mode here, between working through the earlier timeout reimbursements and now working through these new failed tx broadcast issues.

I don’t think it makes sense for me to try to fix this problem myself and ship a new release on my own. That will just introduce more risk, and it will cause my arbitration / reimbursement efforts to queue up and get further delayed.

At this point I think what makes the most sense is to do the following every time one of these issues comes in:

  1. Log it immediately with a comment in Track failed tx broadcasts #1193 like we have been doing.
  2. Create a new issue in bisq-network/support to track reimbursing whichever transaction(s) did successfully get broadcast for that trade:
    a. usually this is just one of the maker or taker transactions, but include whichever transactions actually made it
    b. Always double-check the deposit transaction, because the Details screen doesn't always show a checkmark for the deposit transaction (known issue).
    c. when creating the GitHub issue, include the trade ID and the transaction id(s) that need to be reimbursed. Do not include any other information.
  3. Send arbitration chat messages to both buyer and seller explaining what happened and what to expect next, specifically:
    a. Explain that their trade is suffering from a bug that we're working on fixing, and provide links to Track failed tx broadcasts #1193 and Investigate causes of and solutions to transaction broadcast failures in v0.6.3 #1195 (this issue) so they can track resolution of the problem.
    b. Include a link to the reimbursement support issue created in (2) above. This link should be sent to the maker and/or taker according to which transactions were actually broadcast. Let them know that this is their issue and that they should log into GitHub (creating an account first if necessary) and then click the Subscribe on the issue to get notified about changes to it.
    c. Let both traders know that you are now going to close the dispute ticket, and that any further communication about reimbursement should happen via the linked GitHub issue or via email with chris@beams.io.
  4. Stage each transaction for reimbursement in a csv file in the bisq-network/support repository (like we've been doing for the timeout reimbursements in this file)
  5. Continue steps 1–4 until @ManfredKarrer has a fix for the issue and has shipped a new release
  6. Announce the release and wait a few days for everyone to upgrade and for any in-flight trades suffering from this problem to make their way into arbitration.
  7. Craft and send a batch transaction that pays out all the transactions staged for reimbursement in step (4).
  8. Close all the GitHub issues created in step (2) with a comment linking to the batch reimbursement transaction in step (7).
  9. Have a beer.

I will start working through steps 1–4 for my 5 issues as soon as possible, but I'm going to focus on finishing staging the timeout reimbursements first.

@keo-bisq, @ManfredKarrer, if you have any comments or questions about the plan above, please let me know.

@ManfredKarrer
Copy link
Member

Regarding the old reports at @keo-bisq: If maker and taker fee tx was happening before the 0.6.3. release the log should show different msg and it seems that the tx got broadcasted to our btc nodes but not further out. That can be bc of too low fees. Analyzing those log files should show more.

What would be a great help if anyone can start analyzing the log files.
To gather log files from our btc nodes and cross check it would be another thing to do.

We decreased the tx fee estimation. That might be an additional factor. We don't know for sure if the tx got broadcasted but the the min fee policies in full nodes ignores the tx because of too low fee.
Another thing we can do is to restart the btc full nodes to reset that min fee value which gets higher the longer the node runs.

What I can do today:

  • Increase tx fee in estimation service and tell others to do the same
  • Restart my btc nodes and tell others to do the same

@cbeams What do you think? Should we do that?

@ManfredKarrer
Copy link
Member

Our fee ist atm 435 sat/byte (http://37.139.14.34:8080/getFees)
Best fee at https://bitcoinfees.earn.com/ is 470. But at the 410-420 byte range there are 90k new txs (last 24 hours).
So seems it would make sense to bump up the fees to play more safe.

@ManfredKarrer
Copy link
Member

The data on https://dedi.jochen-hoenicke.de/queue/#24h are very different from those on earn.com.

@ManfredKarrer
Copy link
Member

Just checked mempoolminfee on my btc nodes and it is about 50 sat/bytes. So that cannot be the issue as we use about 400. I will ask the other node operators to post their values but assume it will not be much higher.

@cbeams
Copy link
Member Author

cbeams commented Jan 16, 2018

@bisq-network/btcnode-operators, could you please report your getmempoolinfo in a comment here?

e.g.:

$ bitcoin-cli getmempoolinfo
{
  "size": 3619,
  "bytes": 2149263,
  "usage": 6459840,
  "maxmempool": 300000000,
  "mempoolminfee": 0.00000000
}

And @ManfredKarrer, regarding your idea about restarting btc nodes to reset mempoolminfee values, why would this work at all if indeed -persistmempool is enabled (by default) on all nodes? (see bisq-network/bisq-btcnode#2) If restarting a node does in fact reset this value, then there must be something we (or I) am not understanding.

I would be very surprised if our current transaction fees were causing our transactions to get rejected in any case. I see transactions coming into the mempool all the time that are well below our fee rates. And furthermore, if this were the cause, then we should expect to see a much higher tx broadcast failure rate that we are actually seeing. Right now it's 1 out of every 15 or 20 trades that suffers from failed tx broadcasts. If our estimates were getting rejected from even entering the mempool because they're too low, we'd probably be seeing a majority of trades with failed tx broadcasts.

@ManfredKarrer
Copy link
Member

@cbeams Yes agree.

Just looked into one log and saw:
"TransactionBroadcast: Waiting for 4 peers required for broadcast, we have 1 ... "
So there was the problem that the peer was connected to only 1 btc node.

@ManfredKarrer
Copy link
Member

@cbeams What do u think about bumping the tx fee? now we use maxblocks 10 we could go up to maxblocks 5.

@Emzy
Copy link
Contributor

Emzy commented Jan 16, 2018

“size”: 51234,
“bytes”: 125062324,
“usage”: 294769472,
“maxmempool”: 300000000,
“mempoolminfee”: 0.00050423

“size”: 51259,
“bytes”: 125128925,
“usage”: 294981120,
“maxmempool”: 300000000,
“mempoolminfee”: 0.00050390

“size”: 51319,
“bytes”: 124982740,
“usage”: 295031968,
“maxmempool”: 300000000,
“mempoolminfee”: 0.00050381

“size”: 51557,
“bytes”: 125329619,
“usage”: 295318352,
“maxmempool”: 300000000,
“mempoolminfee”: 0.00050335

@mrosseel
Copy link
Member

mrosseel commented Jan 16, 2018

mike@bitcoin:~$ btc getmempoolinfo
[sudo] password for mike:
{
“size”: 50993,
“bytes”: 123479482,
“usage”: 290613712,
“maxmempool”: 300000000,
“mempoolminfee”: 0.00052143
}

cat bitcoin.conf:
listen=1
maxconnections=500
timeout=30000

mike@bitcoin:~$ btc getnetworkinfo
{
“connections”: 499,


Second node:

mike@bitcoin:~$ btc getmempoolinfo
[sudo] password for mike:
{
“size”: 50017,
“bytes”: 124128503,
“usage”: 296729456,
“maxmempool”: 300000000,
“mempoolminfee”: 0.00051501
}

cat bitcoin.conf:
listen=1
maxconnections=500
timeout=30000

mike@bitcoin:~$ btc getnetworkinfo
{
“connections”: 495,

@ManfredKarrer
regarding: "TransactionBroadcast: Waiting for 4 peers required for broadcast, we have 1 ... "
could the nr of connected peers be checked earlier in the process, so we can error out? Not a fix, but might make the error less nasty.

@ManfredKarrer
Copy link
Member

@cbeams The case where the tx fee was 100 sat/bytes is probably caused due not connecting to the fee service. The default fee is 100 sat/byte in the code (if connection to fee service has not succeeded yet). I saw often that it takes quite a while to get the connection.
I think the main issue of all those problems is that Tor is very slow and less reliable since 2 months.

@cbeams
Copy link
Member Author

cbeams commented Jan 16, 2018

The default fee is 100 sat/byte in the code (if connection to fee service has not succeeded yet)

Well, this fact alone could explain the failed tx broadcasts.

Typical Bisq maker/taker transactions are between 225-270 bytes in size. At a 100 sat/byte feerate, that means these users are broadcasting transactions with ~22,500 and 27,000 satoshi fees. We just determined that current mempoolminfee values across @bisq-network/btcnode-operators' nodes are in the 50,000 satoshi range.

The hypothesis, then is as follows:

  1. Because of poor Tor network conditions, users are unable to communicate with pricenodes to get fee estimates
  2. Bisq falls back to its hard-coded 100 sat/byte fee rate and creates maker/taker/deposit txns with absolute fee values in the 20,000-satoshi range
  3. mempoolminfee values on Bisq nodes (and likely the rest of the Bitcoin network) are in the 50,000-satoshi range
  4. These maker/taker/deposit txns get rejected immediately by our nodes
  5. We ignore the tx confidence timeout that follows as of v0.6.1
  6. Trade protocol progresses, but with missing txns and trades ends up in arbitration like we're seeing now in Track failed tx broadcasts #1193.

What we should do about it:

I see a few options:

  1. Increase the default sat/byte value by 3-5x to guarantee these txns exceed the current ~50000 satoshi mempoolminfee threshold
  2. Start connecting to pricenodes via clearnet by default
  3. Connect to pricenodes via clearnet as a fallback if we can't connect via Tor

Option (1) is something we could do and ship a fix for ASAP. I'm going to trawl through the logs we have to see if I can find evidence of the hypothesized series of events playing out.

@cbeams
Copy link
Member Author

cbeams commented Jan 16, 2018

Major flaw in the thinking above:

The value of mempoolminfee represents the minimum fee rate in BTC per kilobyte for transactions to get into a node's mempool. I is not an floor on the absolute fee value for a transaction as I assumed above.

@cbeams
Copy link
Member Author

cbeams commented Jan 16, 2018

This means that, with mempoolminfee values around 50000 satoshis (per kilobyte), we're talking about 50 satoshis per byte. And that means txns that fall back to our our 100 sat/byte default should get accepted.

I would say this invalidates the hypothesis above completely, but I did just see at least one report of a node with a mempoolminfee greater than 100 sat/byte:

image

Perhaps we deal with spikes across our federation of nodes where our mempoolminfee values are > 100 sat/byte as well? And when this happens, users falling back to the 100 sat/byte minimum are getting priced out? I don't know. Still thinking. Will try to find any data in the logs.

@Emzy
Copy link
Contributor

Emzy commented Jan 16, 2018

"size": 54836,
"bytes": 127244700,
"usage": 299995216,
"maxmempool": 300000000,
"mempoolminfee": 0.00050924

"size": 55063,
"bytes": 127056044,
"usage": 299996896,
"maxmempool": 300000000,
"mempoolminfee": 0.00050945

"size": 54520,
"bytes": 126728224,
"usage": 298437968,
"maxmempool": 300000000,
"mempoolminfee": 0.00049453

"size": 55204,
"bytes": 127184337,
"usage": 299927872,
"maxmempool": 300000000,
"mempoolminfee": 0.00050945

@ManfredKarrer
Copy link
Member

I assume that 100 sat/byte tx was an exception. But in the logs the tx will show the tx fee. At least I have not seen any so far (also from the last weeks) with 100 sat/byte.
If a price node is not reachable it will try the other ones, though it can take a bit. We can definitely improve that area to make either parallel connection attempts or decrease the timeouts there to fall back faster to another node.

There is also code for checking that the min. nr. of bitcoin peers is >=4 for make offer and take offer views (at time when the user clicks take offer in the offer book view, not at take offer confirm button -> should be added there as well as the user could lose connections in the meantime).

@ManfredKarrer
Copy link
Member

Maybe anyone can check Tor mailing lists, etc. to see if there is anything relevant regarding our issues....

@sneurlax
Copy link

sneurlax commented Jan 17, 2018

It could be that lowering @bisq-network/btcnode-operators' mempoolminfee does mitigate accruing any (or many) more reimbursements (maker and taker tx gets broadcast properly.) Raising the fallback is a bandaid until bitcoinj extends support for BIP173 (Bech32 / P2WKPH.)

Could we also cache the transactions meant for broadcast, saving them for re-broadcast/debugging purposes? I'll try tinkering to satisfy my curiousity... I'd appreciate a file/class tip in that direction!

Edit: the fee estimation spits out ... This is expected if the user pays from an external wallet., leading me to troubleshoot a bit. When transacting via Bisq's wallet, the order went through fine. Paying via an external wallet--even with a high fee and getting confirmation in the next block--borke things. I don't know how repeatable that is, but you can try it for yourself if you have funds to spare!

It'd be great to have a bisq branch for testnet. Is there such a thing? Tips... much appreciated :P

@ManfredKarrer
Copy link
Member

BitcoinJ re-broadcasts the tx at startup as far I have seen. Not 100% sure if that is the case also if the tx has been committed to the wallet (as we do in the timeout).

The broadcast happens in the Broadcaster class (including the timeout handling). Reason was that we did not heard back from the number of required nodes (usually 2) in time and that caused a timeout in the trade protocol (taker fee was broadcasted but as we waited too long to hear back the trade protocol timed out and the take-offer-attempt failed, leading to paid trade fee tx but no trade).

Re external funding:
Maybe the external wallet used a tx type not supported by BitcoinJ. Support for version 2 transactions have been added in master of BitcoinJ but is not merged to our fork (based on last BitcoinJ release). That is definitely on our TODO list but was not enough time for latest release to put that in. As BitcoinJ master is far away form the 0.14.4 BitcoinJ release the merge is a bit cumbersome. Not sure if other features not supported in BitcoinJ (RBF, Segwit) might lead to issues, but I am not aware of any.

Re Testnet:
You can run your own Testnet seednode and edit the CoreSeedNodeRepository class to add the onion address (GH wiki has some instructions). I had in the past one node running (or maybe its still running will check the next days - traveling now). You have to deactivate Tor and provided nodes for BitcoinJ as there are no provided Bitcoin testnet nodes and not enough public tor nodes on testnet. Having such a test setup will then diverge much from mainnet. I assume the unstable/slow tor network is the main reason for those issues...

@cbeams cbeams closed this as completed Jan 17, 2018
@cbeams cbeams reopened this Jan 17, 2018
@MaximFL
Copy link

MaximFL commented Jan 17, 2018

@cbeams I was told that the dispute ticket for issue #1193 will be closing soon and as noted above to reach out to you about reimbursement. Please let me know how I should proceed with this process and if you would like me to provide any further information.

@cbeams
Copy link
Member Author

cbeams commented Jan 17, 2018

@MaximFL, please send me an email at chris@beams.io with the transaction id(s) that you believe should be reimbursed and a screenshot of them from your Funds->Transactions screen. Thanks.

@cbeams
Copy link
Member Author

cbeams commented Jan 17, 2018

FYI, here is what looks like a textbook fee transaction that fell back to the 100 sat/byte default. From https://tradeblock.com/bitcoin/tx/0573815a72c9bd69947084d80ddc73d42cca3461f460c5dd9480d061c248df03:

image

@Splitter8
Copy link

Most recent log regarding trade EJOLOw
bisq.zip

@cbeams
Copy link
Member Author

cbeams commented Jan 20, 2018

@ManfredKarrer wrote in #1193 (comment):

From the log files I checked so far the clear problem is that if a user uses a localhost Bitcoin node which is not fully synced (or there are other reasons why the connection fails) that it connects to a public clearnet node and thus has only 1 Bitcoin peer, which creates a higher probability that the tx does not get broadcasted.

If anyone can provide more logs I will check it. Important is to get the logs from both traders as it is likely only one side causing the issue.

And shortly after posting the above, @ManfredKarrer created issue #1244 to implement a fix for this.

I'll leave this investigation issue open for the time being, until we're sure that this fix is indeed going to take care of the problem. In the meantime, keep an eye on #1244 and upgrade to v0.6.4 right when it ships.

@cbeams
Copy link
Member Author

cbeams commented Jan 22, 2018

Closing as complete with the changes in #1244. That is, we believe we have fixed the cause of these issues in Bisq v0.6.4. Everyone should update to that version as soon as possible, and certainly before engaging in further trades.

If we find that this issue crops back up, we can re-open this issue to continue the investigation.

Thanks to all who helped us track this down by posting their logs, etc!

@grisq
Copy link

grisq commented Jan 24, 2018

I got this message in Bisq but didn't reply before the dispute was closed:
"
Hi can you send your logs to github: #1195 or to me at stibor@protonmail.com
This particular trade seems to have two bugs and to find them we need the logs.
"

Instead of getting the deposit back I got a error-message and was told to open a github issue (#1261) - and this issue includes my log-file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants