New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkAllLedgers in Auditor supports read throttle #2973
checkAllLedgers in Auditor supports read throttle #2973
Conversation
rerun failure checks |
1 similar comment
rerun failure checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
averageEntrySize needs synchronization (or a design change to avoid it)
@@ -76,6 +84,11 @@ | |||
@Override | |||
public void readEntryComplete(int rc, long ledgerId, long entryId, | |||
ByteBuf buffer, Object ctx) { | |||
if (readThrottle != null && buffer != null) { | |||
int readSize = buffer.readableBytes(); | |||
averageEntrySize = (int) (averageEntrySize * AVERAGE_ENTRY_SIZE_RATIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async code, concurrent reads, plus updates of non-volatile field/later reads from non-volatile field.
You'll need to add synchronization around use of this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will fix
@@ -51,6 +51,14 @@ | |||
public final BookieClient bookieClient; | |||
public final BookieWatcher bookieWatcher; | |||
|
|||
private static int averageEntrySize; | |||
|
|||
private static final int INITIAL_AVERAGE_ENTRY_SIZE = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider simplifying by using number of entries in flight (using Semaphore, releasing when processed) instead of guessing avg sizes and rate limiting. Or rate limit by number of entries.
This also removes need in synchronization.
One ledger may have entries of 512K, next one of the 1K, third one is mixed.
I don't see how tracking avg size significantly helps in this case, especially if the backpressure is not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rate limit by number of entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In accordance with your suggestions, I made changes。
Please review again, thank you!@dlg99
…ocessed) instead of guessing avg sizes and rate limiting.
…ocessed) instead of guessing avg sizes and rate limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@michaeljmarshall PTAL,thanks! |
@eolivelli @nicoloboschi PTAL,thanks! |
But I am still not sure , how it addresses the root cause. I understand that throttling with semaphore sort of reduces the pressure on the bookie. And there is a another mechanism for that too, that is percentageOfLedgerFragmentToBeVerified(Slight misnomer here, it actually checks percentage of entries in the ledger fragments). I understand throttling will reduce timeout from bookie. But timeout can still happen and will happen. My question is why not address this issue that occasional timeout should not be considered a failure, or may be should be retried? Thoughts? Looks good otherwise. |
percentageOfLedgerFragmentToBeVerified is not very useful. This parameter is for a Fragment, but in fact, most Fragments have only one entry, but at least the first and last entry will be checked. |
I agree that the timeout problem should be solved through retry, but at the same time I think the rate should also be limited to prevent checking all ledger from putting too much pressure on the cluster! @pkumar-singh |
ping |
Sure. It may not be sufficient but its accurate regardless. So OK from my end. |
ping |
* support read throttle in checkAllLedgers * using number of entries in flight (using Semaphore, releasing when processed) instead of guessing avg sizes and rate limiting. * using number of entries in flight (using Semaphore, releasing when processed) instead of guessing avg sizes and rate limiting. * check style (cherry picked from commit 525a4a0)
* support read throttle in checkAllLedgers * using number of entries in flight (using Semaphore, releasing when processed) instead of guessing avg sizes and rate limiting. * using number of entries in flight (using Semaphore, releasing when processed) instead of guessing avg sizes and rate limiting. * check style (cherry picked from commit 525a4a0)
* support read throttle in checkAllLedgers * using number of entries in flight (using Semaphore, releasing when processed) instead of guessing avg sizes and rate limiting. * using number of entries in flight (using Semaphore, releasing when processed) instead of guessing avg sizes and rate limiting. * check style (cherry picked from commit 525a4a0) (cherry picked from commit ebd5e4c)
Motivation
When checkAllLedgers is scheduled periodically, because it will try to read almost all entry data, it may cause the bookkeeper to time out and cause the entry to be incorrectly marked with markLedgerUnderreplicatedAsync.
Every time when check all ledger is executed, a large number of ledgers will be marked as under replica, obviously this is a wrong judgment。
In our cluster, the execution cycle of checkAllLedgers is 1 week. Then we found that a large number of ledger will be marked markLedgerUnderreplicatedAsync each time it is executed. Analyzing the log found that there are some reading bookkeeper timeouts:
Due to too many read requests, the cluster pressure is too high, and the latency of pulsar's write time continues to soar until the recovery is completed.: