New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve transaction relay logic #4985
base: develop
Are you sure you want to change the base?
Conversation
* Allows transactions, validator lists, proposals, and validations to be relayed more often, but only when triggered by another event, such as receiving it from a peer * Decrease from 5min. * Expected to help transaction throughput on poorly connected networks.
* Hold if the transaction gets a ter, tel, or tef result. * Use the new SF_HELD flag to ultimately prevent the transaction from being held and retried too many times.
* Ensures that successful transactions are broadcast to peers, appropriate failed transactions are held for later attempts, fee changes are sent to subscribers, etc.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #4985 +/- ##
=======================================
Coverage 70.9% 70.9%
=======================================
Files 796 796
Lines 66792 66851 +59
Branches 11002 11003 +1
=======================================
+ Hits 47379 47429 +50
- Misses 19413 19422 +9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I really appreciate the way the commits were divided up. It made the code review much easier.
I mostly left complementary comments. But there's one bool
that I suspect needs to be changed to a std::atomic<bool>
. See what you think...
(!itrNext->second->getSeqProxy().isSeq() || | ||
itrNext->second->getSeqProxy().value() == seqProxy.value() + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This takes full advantage of the "unusual" sort order of SeqProxy
, that all sequence numbers sort in front of all tickets.
auto const txNext = m_ledgerMaster.popAcctTransaction(txCur); | ||
if (txNext) | ||
auto txNext = m_ledgerMaster.popAcctTransaction(txCur); | ||
while (txNext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially worried that this while
loop might submit_held()
a boatload of transactions. But the TxQ
defaults maximumTxnPerAccount
to 10. So the largest number of times this loop could run (ordinarily) would be 10. That seems reasonable, and a good way to clear out an account that has a lot of transactions queued.
This new characteristic may be worth pointing out to the performance folks, in case they want to stress it.
// VFALCO NOTE The hash for an open ledger is undefined so we use | ||
// something that is a reasonable substitute. | ||
CanonicalTXSet set(app_.openLedger().current()->info().parentHash); | ||
std::swap(mHeldTransactions, set); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great way to minimize processing while the lock
is held. Good spotting!
bool bLocal, | ||
FailHard failType) | ||
bool | ||
NetworkOPsImp::preProcessTransaction(std::shared_ptr<Transaction>& transaction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting the validation and canonicalization part of processTransaction
into this other method was a good idea.
@@ -1276,6 +1300,17 @@ NetworkOPsImp::doTransactionSync( | |||
transaction->setApplying(); | |||
} | |||
|
|||
doTransactionSyncBatch( | |||
lock, [&transaction](std::unique_lock<std::mutex>& lock) { | |||
return transaction->getApplying(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this call to Transaction::getApplying()
. The bool
being read is neither protected by a lock nor atomic
. I think we need to change the Transaction::mApplying
member variable into a std::atomic<bool>
, since the bool
is being accessed across threads.
This problem was present before your change. But please fix it while we're thinking about it.
@@ -1224,12 +1232,28 @@ NetworkOPsImp::processTransaction( | |||
transaction->setStatus(INVALID); | |||
transaction->setResult(temBAD_SIGNATURE); | |||
app_.getHashRouter().setFlags(transaction->getID(), SF_BAD); | |||
return; | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, these 5 lines are not hit by the unit tests. 🤷
JLOG(m_journal.warn()) << transaction->getID() << ": cached bad!\n"; | ||
transaction->setStatus(INVALID); | ||
transaction->setResult(temBAD_SIGNATURE); | ||
return; | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. According to my local code coverage these four lines are never hit. I know there are a few places in the unit tests that produce corrupted signatures. They must be handled elsewhere. Just noticing, no need to address this in this pull request.
mTransactions.swap(transactions); | ||
else | ||
{ | ||
for (auto& t : transactions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to reserve space in mTransactions
? Consider
mTransactions.reserve(mTransactions.size() + transactions.size());
* upstream/develop: fix: Remove redundant STAmount conversion in test (4996) fix: resolve database deadlock: (4989) test: verify the rounding behavior of equal-asset AMM deposits (4982) test: Add tests to raise coverage of AMM (4971) chore: Improve codecov coverage reporting (4977) test: Unit test for AMM offer overflow (4986) fix amendment to add `PreviousTxnID`/`PreviousTxnLgrSequence` (4751)
* upstream/develop: Set version to 2.2.0-b3
* upstream/develop: Ignore more commits Address compiler warnings Add markers around source lists Fix source lists Rewrite includes Format formerly .hpp files Rename .hpp to .h Simplify protobuf generation Consolidate external libraries Remove packaging scripts Remove unused files
Process held transactions through existing NetworkOPs batching: * Ensures that successful transactions are broadcast to peers, appropriate failed transactions are held for later attempts, fee changes are sent to subscribers, etc. Pop all transactions with sequential sequences, or tickets Give a transaction more chances to be retried: * Hold if the transaction gets a ter, tel, or tef result. * Use the new SF_HELD flag to ultimately prevent the transaction from being held and retried too many times. Decrease `shouldRelay` limit to 30s: * Allows transactions, validator lists, proposals, and validations to be relayed more often, but only when triggered by another event, such as receiving it from a peer * Decrease from 5min. * Expected to help transaction throughput on poorly connected networks.
* upstream/develop: Set version to 2.2.0-rc1
* upstream/develop: Remove flow assert: (5009) Update list of maintainers: (4984)
* upstream/develop: Add external directory to Conan recipe's exports (5006) Add missing includes (5011)
High Level Overview of Change
This PR, if merged, will improve transaction relay logic around a few edge cases.
(I'll write a single commit message later, but before this PR is squashed and merged.)
Context of Change
A few months ago, while examining some of the issues around the 2.0.0 release, and auditing transaction relay code, I identified a few areas with potential for improvement.
Type of Change
Before / After
This PR is divided into four mostly independent changes.
shouldRelay
limit to 30s." Pretty self-explanatory. Currently, the limit is 5 minutes, by which point theHashRouter
entry could have expired, making this transaction look brand new (and thus causing it to be relayed back to peers which have sent it to us recently).LedgerMaster
's held transactions if the transaction gets ater
,tel
, ortef
result. Old behavior was justter
.NetworkOPs::apply
which, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for a really long time (5 minutes in the current implementation, or 30 seconds with this first commit). If the node is a bottleneck (either due to network configuration, or because the transaction was submitted locally), the transaction may not be seen by any other nodes or validators before it expires or causes other problems.