Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wallet: Fix wallet loading race during node start #19876

Closed
wants to merge 3 commits into from

Conversation

fjahr
Copy link
Contributor

@fjahr fjahr commented Sep 4, 2020

Intended to fix 19853

There appears to be a race condition when using the wallet RPCs right after starting a node and that causes the tests to be flaky: in the wallet_basic.py a node is restarted and getbalance gets called right after. getbalance then sometimes reports a balance of 0 because the wallet is not loaded yet. It looks like this only appeared after the zapwallettxs tests were removed from this test but it should not directly be caused by the removal, it rather hides the issue less often. I wouldn't be surprised if this is a known issue that has popped up in a different context already but I didn't find anything on it yet.

My naive solution is to track a ready state for CWallet but I am very much looking for concept feedback on that. Returning Null seems to be the default behavior for wallet RPCs in early failure cases like this. An explicit error might be more helpful for users to figure out what happened but I am opting for consistency for now. Either option is much better than reporting a balance 0 that is incorrect but that could be real and could cause alarm systems to go off because all of a sudden a wallet balance is 0 after a node restarts.

Whether this ends up being the final solution or not, after the conceptual discussion I think it will make sense to add this check to other wallet RPCs and I will also check if there are other open flaky wallet test issues that might be caused by the same bug.

To reproduce the issue see the commented out block of code that I left in the test file. This reproduces the issue on ~4 out of 5 tries on my machine.

EDIT: The actual error that appears in the CI is hit here where node0_balance is 0 and then sendtoaddress is called with a negative amount.

@jonatack
Copy link
Member

jonatack commented Sep 4, 2020

Thanks for looking at this, will review.

@fjahr fjahr force-pushed the wallet_race branch 2 times, most recently from 7f34859 to dc729f8 Compare September 4, 2020 20:59
@achow101
Copy link
Member

achow101 commented Sep 4, 2020

We currently track what wallets are currently loading (see CWallet::LoadWallet) to avoid a different race condition. I think we could reuse that for the startup wallet loading. Then in the RPC, GetWalletForJSONRPCRequest should check that the wallet isn't in the list of loading wallets. If it is, it should raise an exception. I don't think that returning NullUniValue is the correct way to handle this error.

Copy link
Member

@promag promag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just needs a call to syncwithvalidationinterfacequeue before getbalance?

@fjahr
Copy link
Contributor Author

fjahr commented Sep 5, 2020

Just needs a call to syncwithvalidationinterfacequeue before getbalance?

Yes, this fixes the test. Initially, I was hoping I could fix the test and the underlying race issue in one PR but since the race fix is a bit more complicated I think it makes sense to split it from the test fix, so the CI failures stop as fast as possible. I opened it in #19887 and made you co-author.

@promag
Copy link
Member

promag commented Sep 5, 2020

Is this really a bug? The wallet is just "catching up" (?)

@fjahr
Copy link
Contributor Author

fjahr commented Sep 5, 2020

Is this really a bug? The wallet is just "catching up" (?)

Yes, it's catching up but while it's doing that it responds with a false balance to a getbalance request, which could mean the wallet has been hacked or it's broken. Instead, there should be an actual error that indicates that it's catching up or just anything that does not look like a real balance.

@promag
Copy link
Member

promag commented Sep 5, 2020

The balance is fine I think, the wallet isn't aware of the mempool yet right? So, the wallet itself is loaded, but not synced.

@fjahr
Copy link
Contributor Author

fjahr commented Sep 5, 2020

The balance is fine I think, the wallet isn't aware of the mempool yet right? So, the wallet itself is loaded, but not synced.

As a user, I would expect that the wallet would report my own change output in the balance even if the mempool is still syncing. But I will have to review that code and maybe I am off on that. The many reports in the CI failure issue in the span of 2 days suggest that this is hit pretty frequently on startup and I think we should minimize the potential of reporting unexpected balances to users. But if others agree that this is not a big deal, I will close it.

maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Sep 6, 2020
56b018c test: Fix flaky wallet_basic test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#19853

  I investigated the issue in bitcoin#19876 and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, bitcoin#19876 will only provide an error where before it was reporting a false balance.

Top commit has no ACKs.

Tree-SHA512: 52bb2388a3e77aa20d26ab0fd45796bc1781483b1cffe49cbb44e2488a72e76998edfb1198495373f9c6fd2ec26064d4176bd1a64dd59806622d5e50a4f4e870
@maflcko
Copy link
Member

maflcko commented Sep 6, 2020

When this bug is fixed, it would be good to remove the temporary workarounds. git grep -W syncwithval test. Specifically:

test/functional/mempool_persist.py-        # Verify accounting of mempool transactions after restart is correct
test/functional/mempool_persist.py:        self.nodes[2].syncwithvalidationinterfacequeue()  # Flush mempool to wallet
test/functional/mempool_persist.py-        assert_equal(node2_balance, self.nodes[2].getbalance())

test/functional/wallet_basic.py-        # Prevent potential race condition when calling wallet RPCs right after restart
test/functional/wallet_basic.py:        self.nodes[0].syncwithvalidationinterfacequeue()

This is used to prevent race conditions when using wallet RPCs right after starting a node. A wallet that is not synced with the mempool could temporarily report false balances because it's own transactions with change outputs do not appear to be in the mempool and are not trusted because of that.
@fjahr fjahr marked this pull request as ready for review September 18, 2020 21:31
@fjahr
Copy link
Contributor Author

fjahr commented Sep 18, 2020

I think this is ready for review now. It got harder to reproduce the issue with my script for some reason I can not explain. But through detailed logging I am certain that the issue is that postInitProcess() has not run yet after startup and thus the transactions (our own with change) that are in the mempool are not marked as in the mempool. This means they are not trusted and don't get included in getbalance.

We only want to ensure that the mempool sync runs once on startup. A flag indicating that it has in fact run still seems a reasonable solution after thinking about it a bit more, but happy to consider other approaches. Otherwise, I have improved the implementation a lot by not failing anymore when the issue is encountered and I also gave the flag a more reasonable name. I have combined this with a refactor which let me remove a lot of repetitive code. I have also removed the temporary fixes for the tests which used syncwithvalidationinterfacequeue.

@DrahtBot
Copy link
Contributor

DrahtBot commented Sep 20, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Member

@promag promag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ensure wallet is ready on load/startup? No changes would be needed on RPC code.

public:
/*
* Main wallet lock.
* This lock protects all the fields added by CWallet.
*/
mutable RecursiveMutex cs_wallet;

/** Indicated the wallet was initially synced with the mempool. */
void setMempoolSynced() { m_mempool_synced = true; };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d37756f

Should be private?

pwallet->BlockUntilSyncedToCurrentChain();

// Ensure that the mempool has been synced at least once on startup
if (!pwallet->mempoolSynced()) pwallet->syncMempool();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d37756f

Just call pwallet->syncMempool() and early return there - no need to expose mempool synced flag.

@@ -725,13 +725,22 @@ class CWallet final : public WalletStorage, public interfaces::Chain::Notificati

bool CreateTransactionInternal(const std::vector<CRecipient>& vecSend, CTransactionRef& tx, CAmount& nFeeRet, int& nChangePosInOut, bilingual_str& error, const CCoinControl& coin_control, bool sign);

/** Wallet was initially synced with the mempool. */
bool m_mempool_synced{false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d37756f

Make it atomic? Currently accessed in multiple threads.

@@ -77,6 +77,20 @@ static bool ParseIncludeWatchonly(const UniValue& include_watchonly, const CWall
}


static CWallet* GetReadyWallet(std::shared_ptr<CWallet> const wallet) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d37756f

NAK the dereferencing because the wallet can be destroyed after the call, like

CWallet* const pwallet = GetReadyWallet(GetWalletForJSONRPCRequest(request));

Also it's not clear the function does the dereference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the following signature might be better?

 static CWallet& GetReadyWallet(CWallet& wallet)
{

@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 2, 2020

🐙 This pull request conflicts with the target branch and needs rebase.

Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft".

@bitcoin bitcoin deleted a comment Jan 19, 2021
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jun 27, 2021
56b018c test: Fix flaky wallet_basic test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#19853

  I investigated the issue in bitcoin#19876 and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, bitcoin#19876 will only provide an error where before it was reporting a false balance.

Top commit has no ACKs.

Tree-SHA512: 52bb2388a3e77aa20d26ab0fd45796bc1781483b1cffe49cbb44e2488a72e76998edfb1198495373f9c6fd2ec26064d4176bd1a64dd59806622d5e50a4f4e870
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jun 28, 2021
56b018c test: Fix flaky wallet_basic test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#19853

  I investigated the issue in bitcoin#19876 and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, bitcoin#19876 will only provide an error where before it was reporting a false balance.

Top commit has no ACKs.

Tree-SHA512: 52bb2388a3e77aa20d26ab0fd45796bc1781483b1cffe49cbb44e2488a72e76998edfb1198495373f9c6fd2ec26064d4176bd1a64dd59806622d5e50a4f4e870
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jun 29, 2021
56b018c test: Fix flaky wallet_basic test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#19853

  I investigated the issue in bitcoin#19876 and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, bitcoin#19876 will only provide an error where before it was reporting a false balance.

Top commit has no ACKs.

Tree-SHA512: 52bb2388a3e77aa20d26ab0fd45796bc1781483b1cffe49cbb44e2488a72e76998edfb1198495373f9c6fd2ec26064d4176bd1a64dd59806622d5e50a4f4e870
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 1, 2021
56b018c test: Fix flaky wallet_basic test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#19853

  I investigated the issue in bitcoin#19876 and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, bitcoin#19876 will only provide an error where before it was reporting a false balance.

Top commit has no ACKs.

Tree-SHA512: 52bb2388a3e77aa20d26ab0fd45796bc1781483b1cffe49cbb44e2488a72e76998edfb1198495373f9c6fd2ec26064d4176bd1a64dd59806622d5e50a4f4e870
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 1, 2021
56b018c test: Fix flaky wallet_basic test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#19853

  I investigated the issue in bitcoin#19876 and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, bitcoin#19876 will only provide an error where before it was reporting a false balance.

Top commit has no ACKs.

Tree-SHA512: 52bb2388a3e77aa20d26ab0fd45796bc1781483b1cffe49cbb44e2488a72e76998edfb1198495373f9c6fd2ec26064d4176bd1a64dd59806622d5e50a4f4e870
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 15, 2021
56b018c test: Fix flaky wallet_basic test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#19853

  I investigated the issue in bitcoin#19876 and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, bitcoin#19876 will only provide an error where before it was reporting a false balance.

Top commit has no ACKs.

Tree-SHA512: 52bb2388a3e77aa20d26ab0fd45796bc1781483b1cffe49cbb44e2488a72e76998edfb1198495373f9c6fd2ec26064d4176bd1a64dd59806622d5e50a4f4e870
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 16, 2021
56b018c test: Fix flaky wallet_basic test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#19853

  I investigated the issue in bitcoin#19876 and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, bitcoin#19876 will only provide an error where before it was reporting a false balance.

Top commit has no ACKs.

Tree-SHA512: 52bb2388a3e77aa20d26ab0fd45796bc1781483b1cffe49cbb44e2488a72e76998edfb1198495373f9c6fd2ec26064d4176bd1a64dd59806622d5e50a4f4e870
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Sep 23, 2021
Summary:
> I investigated the issue in [[bitcoin/bitcoin#19876 | core#19876]] and I still intend to fix the underlying issue of a race when using wallet RPCs right after starting a node in that PR. However, since that is a bit more complicated than I initially thought it makes sense to merge the fix of the test so the intermittent test failures stop. This fix in the test is going to be needed, either way, #19876 will only provide an error where before it was reporting a false balance

Co-authored-by: João Barbosa <joao.paulo.barbosa@gmail.com>

This is a backport of [[bitcoin/bitcoin#19887 | core#19887]]

Test Plan: `ninja check-functional`

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Differential Revision: https://reviews.bitcoinabc.org/D10183
@DrahtBot
Copy link
Contributor

There hasn't been much activity lately and the patch still needs rebase. What is the status here?
  • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
  • Is it no longer relevant? ➡️ Please close.
  • Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

1 similar comment
@DrahtBot
Copy link
Contributor

There hasn't been much activity lately and the patch still needs rebase. What is the status here?
  • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
  • Is it no longer relevant? ➡️ Please close.
  • Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

@achow101
Copy link
Member

Closing this as it has not had any activity in a while. If you are interested in continuing work on this, please leave a comment so that it can be reopened.

@achow101 achow101 closed this Oct 12, 2022
@bitcoin bitcoin locked and limited conversation to collaborators Oct 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants