-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bookies should prioritize recovery reads/writes #898
Conversation
@@ -90,6 +90,7 @@ message ReadRequest { | |||
enum Flag { | |||
FENCE_LEDGER = 1; | |||
ENTRY_PIGGYBACK = 2; | |||
RECOVERY_READ = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merlimat @sijie
There's a potential thorny BC issue here. On yahoo branch, RECOVERY_READ is 2.https://github.com/yahoo/bookkeeper/blob/165971209f933e0bcefaac997db8e75c8776f942/bookkeeper-server/src/main/proto/BookkeeperProtocol.proto#L83
@merlimat are any any the yahoo clients using protobufs? Or are they all on V2? If so it may not be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only clients using the v3 protocols are the auto-recovery daemons which are using the default settings. What would be the worst case if they do a read request with ENTRY_PIGGYBACK
flag?
The RECOVERY_READ
flag is anyway optional, in the sense that is currently just used to signal urgency to the bookie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a read request with ENTRY_PIGGYBACK
is a normal read request. so it only means you don't assign priorities for recovery read, which I don't think it is a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this definitely needs to be BC tested. The question is whether we block this change to do it, or just make it a blocker on the 4.7.0 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my feeling is we need to do this as early as possible. I am not sure if there is other similar changes coming up or not. I understand this is a port from yahoo branch, it probably doesn't make sense to block this change to be BC tested. I am fine with creating a blocker issue for adding BC tests for this change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Created issue #903
This is an important protocol change how can we do BC tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some BC tests here:
- we need to make sure yahoo client can talks to this new bookies
- new bookie client can talks to yahoo bookie.
@@ -265,7 +265,7 @@ ByteBuf getData() { | |||
} | |||
|
|||
boolean isRecoveryAdd() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name should be changed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -152,6 +159,14 @@ public BookieRequestProcessor(ServerConfiguration serverCfg, Bookie bookie, | |||
createExecutor( | |||
this.serverCfg.getNumLongPollWorkerThreads(), | |||
"BookieLongPollThread-" + serverCfg.getBookiePort(), OrderedScheduler.NO_TASK_LIMIT); | |||
this.fenceThreadPool = Executors.newCachedThreadPool( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#283 is to submit a runnable to the read processor thread for checking the fencing? I don't think this applies here. This fenceThreadPool is for V3. The problem was that writeThreadPool and readThreadPool can reject executions, if they hit a limit. This was causing failures in on of the tests for fencing (testRecoveryNotRejected). I would think that fencing should be outside of these execution limits, as it is part of recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry I read as it is for v2 process. I thought that you move fence read to a separate thread.
then I have a few questions here /cc @merlimat
- this change will move fence read to netty I/O thread, which it would be blocking subsequent requests. If read threads is accumulating requests, that typically means I/O is slow. fenceRead can potentially be very slow due to overwhelmed I/O, then it will subsequently block all the network I/O. I am not sure how much we can gain from moving fence read to netty I/O thread. It is against the purposes we having thread pools, is to avoid blocking network I/O.
@merlimat any metrics or experiences show this change improved things a lot?
-
If this change is to prioritize recovery related ops, isn't that better to have its own recovery read thread pools? rather than blocking network I/O.
-
even worse at V2, the fence read is waiting for both read I/O and journal I/O (v3 makes fence request a non-blocking op on waiting fence result). If the bookie is overwhelmed (e.g. hitting bandwidth limit), journal can become a bottleneck and hence block netty I/O thread. (I assume this would potentially block sending responses back)
All the questions are related to the goal of not blocking network thread. If I read this change correctly, this change is against this goal, which can potentially have very bad impact.
then shall v2 fencing be executed in this fencing thread or not? because v2 fencing is actually a blocking call, while v3 fencing is not a blocking call. If you run v2 fencing in netty I/O thread, isn't it blocking other I/O operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sijie This change is a companion of #706. The story here is that in BK 4.6 if the read rate is very high, maxing out the bookie IO capacity what starts to happen is that read requests are piled up in the read executor task queue.
The read thread is processing the read request, but since they spent a lot of time in the queue, the client will have already timed them out and re-sent other read requests.
In #706, we've introduced the max-limit for the size of the executor queue, in order to limit the amount of time spent in the queue and rejecting additional requests in order to fail-fast.
The rationale is to avoid to read ops that are going to be timed-out and that will not make any progress.
Another problem with the read operations is that when the timeout/rejections happens, it impacts also fencing-reads which are used in recovery.
This change is to make sure the recovery reads can make progress even though there is a very high rate of regular read requests. As Ivan pointed out, the recovery reads are critical because they typically prevent a writer from being operative, while regular reads are "just" reading data and they can thus be deferred to lower priority.
The original change was doing the fence-read directly in IO thread as the way to skip the executor thread queue. I guess that using a dedicated thread pool for fence reads is probably a much better option in order not to block the IO threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on dedicated recovery thread pool. I'm currently working on BC stuff first though, but will come back once that's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merlimat +1 on a separate recovery thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate thread pool does sound like a good idea, but I would really restrict recovery reads to real fencing reads only, not under replication reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If it's a recovery read, we want to make sure it gets submitted right away, | ||
// bypassing the eventual executor queue, | ||
// to reduce the chances of ledger recovery exceptions. | ||
boolean isRecoveryRead = r.getReadRequest().hasFlag() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this into some until function and have a test case cover it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
* @param ctx | ||
* control object | ||
*/ | ||
public void asyncReadLastEntry(ReadCallback cb, Object ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this from this pull request and create a seperate one for it?
- we need make this as part of new api as well
- we need a test case to cover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unused it seems. Removing completely for now.
// Only one flag can be set on the read requests | ||
if (((short) flags & BookieProtocol.FLAG_DO_FENCING) == BookieProtocol.FLAG_DO_FENCING) { | ||
readBuilder.setFlag(ReadRequest.Flag.FENCE_LEDGER); | ||
checkArgument(masterKey != null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont do this. fail the callback with IncorrectParameterException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -90,6 +90,7 @@ message ReadRequest { | |||
enum Flag { | |||
FENCE_LEDGER = 1; | |||
ENTRY_PIGGYBACK = 2; | |||
RECOVERY_READ = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a read request with ENTRY_PIGGYBACK
is a normal read request. so it only means you don't assign priorities for recovery read, which I don't think it is a big deal.
@@ -122,4 +131,60 @@ public void addComplete(int rc, LedgerHandle lh, long entryId, Object ctx) { | |||
|
|||
assertTrue(receivedTooManyRequestsException.get()); | |||
} | |||
|
|||
@Test(timeout = 60000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Remember to drop these timeouts
@@ -90,6 +90,7 @@ message ReadRequest { | |||
enum Flag { | |||
FENCE_LEDGER = 1; | |||
ENTRY_PIGGYBACK = 2; | |||
RECOVERY_READ = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of pure bookkeeper this feature may be much useful as the ledger is boundary of the writer. Is this the usecase for Pulsar? In general, for Salesforce usecase regular reads are more important( or atleast not less imprtant) than the background recovery reads. Being said that, there won't be many recovery reads as compared to regular reads, so may be it is OK. This is just first comment, and I haven't read through the patch fully.
checkArgument(masterKey != null); | ||
readBuilder.setMasterKey(ByteString.copyFrom(masterKey)); | ||
} else if (((short) flags & BookieProtocol.FLAG_RECOVERY) == BookieProtocol.FLAG_RECOVERY) { | ||
readBuilder.setFlag(ReadRequest.Flag.RECOVERY_READ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so the ledger replicator, which could be part of recovering under replicated ledgers gets higher priority than the the actual client reads? That doesn't sound right to me.
@@ -320,7 +320,7 @@ public void readComplete(int rc, LedgerHandle lh, | |||
for (BookieSocketAddress newBookie : newBookies) { | |||
bkc.getBookieClient().addEntry(newBookie, lh.getId(), | |||
lh.getLedgerKey(), entryId, toSend.retainedSlice(), | |||
multiWriteCallback, dataLength, BookieProtocol.FLAG_RECOVERY_ADD); | |||
multiWriteCallback, dataLength, BookieProtocol.FLAG_RECOVERY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we giving high priority to replicator reads too? It should not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an addEntry. The recovery flag has a different meaning for adds. It means that if the ledger has been fenced, still store the entry (as it's not a new entry and the client will not see an entry acknowledged as successful after the point of the fencing event).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, seems we're shortcutting those too. hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to create a new explicit PRIORITY flag for these operations, rather than using the semantics from RECOVERY, as RECOVERY already means something for adds.
Can one of the admins verify this patch? |
Pushing to see how it does on CI, was a pretty big merge from master after 2 months dormant. Still need to change the flags, and fixup the executor stuff. |
5cad488
to
91ceda9
Compare
Rebased because it was getting messy to see what had changed after merging master. This version is the same as the what was pushed 2 months ago except:
@sijie @merlimat @jiazhai @jvrao @eolivelli This is ready for review again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall is good but I thunk there was some problem with merge/rebase see the comment
* @return | ||
* @throws ParseJsonException | ||
*/ | ||
public String asJson() throws ParseJsonException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems unrelated
try { | ||
configAsString = conf.asJson(); | ||
LOG.info(configAsString); | ||
} catch (ParseJsonException pe) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are reverting a change from @charanreddy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah crap, must have been in flight as I was turning the merge into a rebase. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
The change adds a new flag for reads, RECOVERY_READ. This flag is set when a client is retrying to recover a ledger. Recovery reads should be higher priority than normal reads, as recovery usually happens when a writer has failed, and another node needs to take over the stream that that ledger belonged to. All writes on that stream are blocked until the ledger the ledger has been recovered, so giving these reads priority will mean the stream is out for action for the shortest amount of time possible. Priority is given to these reads by executing the reads directly in the BookieRequestProcessor, rather than adding them to the readThreadPool where they could end up queuing. This change was originally 372a99d in the yahoo.
91ceda9
to
385cf9e
Compare
} | ||
optional Flag flag = 100; | ||
repeated Flag flag = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think this changes too much for a portback. i know repeated field is compatible with optional. However I am not sure how it would impact other existing code logic. I don't feel comfortable with change, especially this is coupled with a portback change.
This definitely requires more thoughts and testing. I would suggest not changing it to repeated at this moment. since v2 and v3 are different protocols, we don't necessarily need to follow what v2 is doing for v3. If it is a 'priority' field, we should just separate it from a flag. keep flag as it is and adding a field called priority
in the request header. so in this way, you don't need to change the Flag field from optional to repeated. And the priority field can then be used later to further optimize io scheduler if we want to do.
There is no BC concern in v3 protocol, since pulsar is using v2 protocol only. The BC concern between apache version and yahoo version only remains at v2 protocol.
If this approach doesn't make sense to you, I would suggest decoupling changing v3 protocol from this PR, get the change in v2 first to complete this merge. since this is required by pulsar. Defer changing v3 protocol in a different PR. I would prefer not take any risks on this, because except pulsar, most of the people is using v3 protocol. And this feature is used by pulsar in v2 protocol only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify what I proposed above:
-
for v2 protocol, add the flag
FLAG_HIGH_PRIORITY
for BC purpose with yahoo branch. -
for v3 protocol, don't change/add any flag. add a
integer
fieldpriority
at packet header, which is used for indicating the priority for requests. set the default value of priority to 0, which means "no priority". if a positive number or a special number like100
is set, it meanshigh priority
. in future, we can use thispriority
field for achieving optimized io scheduling. for example, if we develop metadata store, the requests from those ledgers used by metadata store can have higher priorities from the requests of the user ledgers. -
in future, if we implemented priority based io scheduling, we can bring the priority feature back to v2 protocol or if we want to make v2 have the same features as v3 protocol. we can bring in the priority field and deprecated the
high_priority
flag in v2 protocol.
In this way, we can get the merge done without violating BC problem and we also can have better extensibility for future improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think this changes too much for a portback. i know repeated field is compatible with optional. However I am not sure how it would impact other existing code logic. I don't feel comfortable with change, especially this is coupled with a portback change.
The code changes required are forced by the change because the api for accessing repeated is completely different than that for accessing an optional.
I agree with what you're saying though, I was actually expecting this to fail BC in CI as enums in protobuf don't take kindly to new additions, so I was surprised when it past. My contingency was to add something like you suggest, though a boolean. An int is a good idea though. I'll leave everything else the same with regards to BookieClient.
@@ -276,10 +291,12 @@ public void processRequest(Object msg, Channel c) { | |||
// process packet | |||
switch (r.getOpCode()) { | |||
case BookieProtocol.ADDENTRY: | |||
processAddRequest(r, c); | |||
assert(r instanceof BookieProtocol.ParsedAddRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't use assert
, it is unpredictable behavior when assert is disabled. either fail it explicitly or just ignore it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change.
statsLogger); | ||
"BookieLongPollThread-" + serverCfg.getBookiePort(), | ||
OrderedScheduler.NO_TASK_LIMIT, statsLogger); | ||
this.highPriorityThreadPool = Executors.newCachedThreadPool( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why using CachedThreadPool here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea. I just reused what was there for fencedThreadPool. probably something about newCachedThreadPool being harder to fail, because of unboundedness, but this introduces problems of it's own. Will change to something bounded.
On a side note, with these thread pools, the architecture is starting to look more like something which could benefit from an actor architecture using something like lmax disruptor. Way in the future though, if at all.
if (null == writeThreadPool) { | ||
write.run(); | ||
} else if (r.isHighPriority()) { | ||
highPriorityThreadPool.submit(write); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have the same handling logic as normal writes. If you make highPriorityThreadPool as the same type of scheduler as writeThreadPool
and readThreadPool
, you can reuse just choose the scheduler based on priority and reuse the logic between line 481-487.
if (null == readThreadPool) { | ||
read.run(); | ||
} else if (r.isHighPriority() || r.isFencing()) { | ||
highPriorityThreadPool.submit(read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as write scheduler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments.
Should we care about Yahoo migration? This change will be surely not compatible with actual bookies on Yahoo env.
I think that any way the most important point is that we do not break things, at least bookies won't consider as high priority fence reads from clients. And this is achived but current patch. Lets see IT
Yahoo uses the V2 protocol, so the change should be compatible (the whole BC test changes are to ensure this, there's a test with the yahoo custom version now). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good now. I have only two comments left, otherwise it is ready to go.
@@ -101,6 +101,7 @@ message ReadRequest { | |||
optional int64 previousLAC = 4; | |||
// Used as a timeout (in milliseconds) for the long polling request | |||
optional int64 timeOut = 5; | |||
optional uint32 priority = 6 [default = 0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make priority as part of PacketHeader? so this would be available to all type of requests.
As what I kind see here, it can be reused by GetBookieInfoRequest
, WriteLacRequest
and ReadLacRequest
potentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@@ -127,6 +127,7 @@ | |||
protected static final String MAX_PENDING_READ_REQUESTS_PER_THREAD = "maxPendingReadRequestsPerThread"; | |||
protected static final String MAX_PENDING_ADD_REQUESTS_PER_THREAD = "maxPendingAddRequestsPerThread"; | |||
protected static final String NUM_LONG_POLL_WORKER_THREADS = "numLongPollWorkerThreads"; | |||
protected static final String NUM_HIGH_PRIORITY_WORKER_THREADS = "numHighPriorityWorkerThreads"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this to bookkeeper-server/conf/bk_server.conf and site/_data/config/bk_server.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add it
@jvrao please review as well since you were in the conversation of this PR. |
I moved it out of GC settings because it didn't belong there.
Will merge this tomorrow if there are no more comment (cc: @jvrao) |
The problem was introduced at apache#898
The change adds a new flag for reads, RECOVERY_READ. This flag is set
when a client is retrying to recover a ledger. Recovery reads should
be higher priority than normal reads, as recovery usually happens when
a writer has failed, and another node needs to take over the stream
that that ledger belonged to. All writes on that stream are blocked
until the ledger the ledger has been recovered, so giving these reads
priority will mean the stream is out for action for the shortest
amount of time possible.
Priority is given to these reads by executing the reads directly in
the BookieRequestProcessor, rather than adding them to the
readThreadPool where they could end up queuing.
This change was originally 372a99d in the yahoo.