Improve transaction relay logic #4985

ximinez · 2024-04-10T23:40:35Z

High Level Overview of Change

This PR, if merged, will improve transaction relay logic around a few edge cases.

(I'll write a single commit message later, but before this PR is squashed and merged.)

Context of Change

A few months ago, while examining some of the issues around the 2.0.0 release, and auditing transaction relay code, I identified a few areas with potential for improvement.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Tests (you added tests for code that already exists, or your new feature included in this PR)

Before / After

This PR is divided into four mostly independent changes.

"Decrease shouldRelay limit to 30s." Pretty self-explanatory. Currently, the limit is 5 minutes, by which point the HashRouter entry could have expired, making this transaction look brand new (and thus causing it to be relayed back to peers which have sent it to us recently).
"Give a transaction more chances to be retried." Will put a transaction into LedgerMaster's held transactions if the transaction gets a ter, tel, or tef result. Old behavior was just ter.
- Additionally, to prevent a transaction from being repeatedly held indefinitely, it must meet some extra conditions. (Documented in a comment in the code.)
"Pop all transactions with sequential sequences, or tickets." When a transaction is processed successfully, currently, one held transaction for the same account (if any) will be popped out of the held transactions list, and queued up for the next transaction batch. This change pops all transactions for the account, but only if they have sequential sequences (for non-ticket transactions) or use a ticket. This issue was identified from interactions with @mtrippled's Apply transaction batches in periodic intervals. #4504, which was merged, but unfortunately reverted later by Revert "Apply transaction batches in periodic intervals (#4504)" #4852. When the batches were spaced out, it could potentially take a very long time for a large number of held transactions for an account to get processed through. However, whether batched or not, this change will help get held transactions cleared out, particularly if a missing earlier transaction is what held them up.
"Process held transactions through existing NetworkOPs batching." In the current processing, at the end of each consensus round, all held transactions are directly applied to the open ledger, then the held list is reset. This bypasses all of the logic in NetworkOPs::apply which, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for a really long time (5 minutes in the current implementation, or 30 seconds with this first commit). If the node is a bottleneck (either due to network configuration, or because the transaction was submitted locally), the transaction may not be seen by any other nodes or validators before it expires or causes other problems.

* Allows transactions, validator lists, proposals, and validations to be relayed more often, but only when triggered by another event, such as receiving it from a peer * Decrease from 5min. * Expected to help transaction throughput on poorly connected networks.

* Hold if the transaction gets a ter, tel, or tef result. * Use the new SF_HELD flag to ultimately prevent the transaction from being held and retried too many times.

* Ensures that successful transactions are broadcast to peers, appropriate failed transactions are held for later attempts, fee changes are sent to subscribers, etc.

codecov-commenter · 2024-04-11T23:33:05Z

Codecov Report

Attention: Patch coverage is 86.95652% with 12 lines in your changes are missing coverage. Please review.

Project coverage is 70.9%. Comparing base (5aa1106) to head (3863b01).

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #4985   +/-   ##
=======================================
  Coverage     70.9%   70.9%           
=======================================
  Files          796     796           
  Lines        66792   66851   +59     
  Branches     11002   11003    +1     
=======================================
+ Hits         47379   47429   +50     
- Misses       19413   19422    +9

Files	Coverage Δ
src/ripple/app/ledger/LocalTxs.h	`100.0% <ø> (ø)`
src/ripple/app/ledger/impl/LedgerMaster.cpp	`40.0% <100.0%> (-<0.1%)`	⬇️
src/ripple/app/ledger/impl/LocalTxs.cpp	`100.0% <100.0%> (ø)`
src/ripple/app/main/Application.cpp	`63.1% <100.0%> (+<0.1%)`	⬆️
src/ripple/app/misc/CanonicalTXSet.cpp	`100.0% <100.0%> (ø)`
src/ripple/app/misc/HashRouter.cpp	`100.0% <100.0%> (ø)`
src/ripple/app/misc/HashRouter.h	`100.0% <100.0%> (ø)`
src/ripple/app/misc/NetworkOPs.h	`100.0% <ø> (ø)`
src/ripple/app/misc/NetworkOPs.cpp	`66.2% <82.6%> (+0.5%)`	⬆️

... and 4 files with indirect coverage changes

scottschurr

Very nice! I really appreciate the way the commits were divided up. It made the code review much easier.

I mostly left complementary comments. But there's one bool that I suspect needs to be changed to a std::atomic<bool>. See what you think...

scottschurr · 2024-04-18T02:02:18Z

src/ripple/app/misc/CanonicalTXSet.cpp

+        (!itrNext->second->getSeqProxy().isSeq() ||
+         itrNext->second->getSeqProxy().value() == seqProxy.value() + 1))


Nice! This takes full advantage of the "unusual" sort order of SeqProxy, that all sequence numbers sort in front of all tickets.

scottschurr · 2024-04-18T02:18:30Z

src/ripple/app/misc/NetworkOPs.cpp

-                auto const txNext = m_ledgerMaster.popAcctTransaction(txCur);
-                if (txNext)
+                auto txNext = m_ledgerMaster.popAcctTransaction(txCur);
+                while (txNext)


I was initially worried that this while loop might submit_held() a boatload of transactions. But the TxQ defaults maximumTxnPerAccount to 10. So the largest number of times this loop could run (ordinarily) would be 10. That seems reasonable, and a good way to clear out an account that has a lot of transactions queued.

This new characteristic may be worth pointing out to the performance folks, in case they want to stress it.

scottschurr · 2024-04-18T02:26:57Z

src/ripple/app/ledger/impl/LedgerMaster.cpp

+        // VFALCO NOTE The hash for an open ledger is undefined so we use
+        // something that is a reasonable substitute.
+        CanonicalTXSet set(app_.openLedger().current()->info().parentHash);
+        std::swap(mHeldTransactions, set);


This is a great way to minimize processing while the lock is held. Good spotting!

scottschurr · 2024-04-18T02:33:36Z

src/ripple/app/misc/NetworkOPs.cpp

-    bool bLocal,
-    FailHard failType)
+bool
+NetworkOPsImp::preProcessTransaction(std::shared_ptr<Transaction>& transaction)


Splitting the validation and canonicalization part of processTransaction into this other method was a good idea.

scottschurr · 2024-04-18T17:45:13Z

src/ripple/app/misc/NetworkOPs.cpp

@@ -1276,6 +1300,17 @@ NetworkOPsImp::doTransactionSync(
        transaction->setApplying();
    }

+    doTransactionSyncBatch(
+        lock, [&transaction](std::unique_lock<std::mutex>& lock) {
+            return transaction->getApplying();


I looked into this call to Transaction::getApplying(). The bool being read is neither protected by a lock nor atomic. I think we need to change the Transaction::mApplying member variable into a std::atomic<bool>, since the bool is being accessed across threads.

This problem was present before your change. But please fix it while we're thinking about it.

scottschurr · 2024-04-18T23:24:47Z

src/ripple/app/misc/NetworkOPs.cpp

@@ -1224,12 +1232,28 @@ NetworkOPsImp::processTransaction(
        transaction->setStatus(INVALID);
        transaction->setResult(temBAD_SIGNATURE);
        app_.getHashRouter().setFlags(transaction->getID(), SF_BAD);
-        return;
+        return false;


Similar to above, these 5 lines are not hit by the unit tests. 🤷

scottschurr · 2024-04-18T23:24:55Z

src/ripple/app/misc/NetworkOPs.cpp

        JLOG(m_journal.warn()) << transaction->getID() << ": cached bad!\n";
        transaction->setStatus(INVALID);
        transaction->setResult(temBAD_SIGNATURE);
-        return;
+        return false;


Interesting. According to my local code coverage these four lines are never hit. I know there are a few places in the unit tests that produce corrupted signatures. They must be handled elsewhere. Just noticing, no need to address this in this pull request.

scottschurr · 2024-04-19T00:02:25Z

src/ripple/app/misc/NetworkOPs.cpp

+        mTransactions.swap(transactions);
+    else
+    {
+        for (auto& t : transactions)


Does it make sense to reserve space in mTransactions? Consider

mTransactions.reserve(mTransactions.size() + transactions.size());

* upstream/develop: fix: Remove redundant STAmount conversion in test (4996) fix: resolve database deadlock: (4989) test: verify the rounding behavior of equal-asset AMM deposits (4982) test: Add tests to raise coverage of AMM (4971) chore: Improve codecov coverage reporting (4977) test: Unit test for AMM offer overflow (4986) fix amendment to add `PreviousTxnID`/`PreviousTxnLgrSequence` (4751)

* upstream/develop: Set version to 2.2.0-b3

* upstream/develop: Ignore more commits Address compiler warnings Add markers around source lists Fix source lists Rewrite includes Format formerly .hpp files Rename .hpp to .h Simplify protobuf generation Consolidate external libraries Remove packaging scripts Remove unused files

Process held transactions through existing NetworkOPs batching: * Ensures that successful transactions are broadcast to peers, appropriate failed transactions are held for later attempts, fee changes are sent to subscribers, etc. Pop all transactions with sequential sequences, or tickets Give a transaction more chances to be retried: * Hold if the transaction gets a ter, tel, or tef result. * Use the new SF_HELD flag to ultimately prevent the transaction from being held and retried too many times. Decrease `shouldRelay` limit to 30s: * Allows transactions, validator lists, proposals, and validations to be relayed more often, but only when triggered by another event, such as receiving it from a peer * Decrease from 5min. * Expected to help transaction throughput on poorly connected networks.

* upstream/develop: Set version to 2.2.0-rc1

* upstream/develop: Remove flow assert: (5009) Update list of maintainers: (4984)

* upstream/develop: Add external directory to Conan recipe's exports (5006) Add missing includes (5011)

ximinez added 4 commits April 10, 2024 17:58

Give a transaction more chances to be retried:

36dddfa

* Hold if the transaction gets a ter, tel, or tef result. * Use the new SF_HELD flag to ultimately prevent the transaction from being held and retried too many times.

Pop all transactions with sequential sequences, or tickets

99a63a0

Process held transactions through existing NetworkOPs batching:

e048a41

* Ensures that successful transactions are broadcast to peers, appropriate failed transactions are held for later attempts, fee changes are sent to subscribers, etc.

ximinez added the Perf Attn Needed Attention needed from RippleX Performance Team label Apr 10, 2024

ximinez requested review from mtrippled and scottschurr April 10, 2024 23:40

ximinez assigned mtrippled and scottschurr Apr 10, 2024

ximinez force-pushed the relay branch from fed03f7 to e048a41 Compare April 10, 2024 23:59

scottschurr reviewed Apr 19, 2024

View reviewed changes

ximinez added 4 commits April 18, 2024 22:25

Merge remote-tracking branch 'upstream/develop' into relay

88c0074

* upstream/develop: Set version to 2.2.0-b3

Merge branch 'develop' into relay

a1218af

ximinez added 3 commits April 26, 2024 19:28

Merge remote-tracking branch 'upstream/develop' into relay

f67f346

* upstream/develop: Set version to 2.2.0-rc1

Merge remote-tracking branch 'upstream/develop' into relay

3863b01

* upstream/develop: Remove flow assert: (5009) Update list of maintainers: (4984)

Merge remote-tracking branch 'upstream/develop' into relay

8b43ea8

* upstream/develop: Add external directory to Conan recipe's exports (5006) Add missing includes (5011)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve transaction relay logic #4985

Improve transaction relay logic #4985

ximinez commented Apr 10, 2024

codecov-commenter commented Apr 11, 2024 •

edited

scottschurr left a comment

scottschurr Apr 18, 2024

scottschurr Apr 18, 2024

scottschurr Apr 18, 2024

scottschurr Apr 18, 2024

scottschurr Apr 18, 2024

scottschurr Apr 18, 2024

scottschurr Apr 18, 2024

scottschurr Apr 19, 2024

		(!itrNext->second->getSeqProxy().isSeq() \|\|
		itrNext->second->getSeqProxy().value() == seqProxy.value() + 1))

Improve transaction relay logic #4985

Are you sure you want to change the base?

Improve transaction relay logic #4985

Conversation

ximinez commented Apr 10, 2024

High Level Overview of Change

Context of Change

Type of Change

Before / After

codecov-commenter commented Apr 11, 2024 • edited

Codecov Report

scottschurr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Apr 11, 2024 •

edited