Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookies should prioritize recovery reads/writes #898

Closed
wants to merge 6 commits into from

Conversation

ivankelly
Copy link
Contributor

@ivankelly ivankelly commented Dec 20, 2017

The change adds a new flag for reads, RECOVERY_READ. This flag is set
when a client is retrying to recover a ledger. Recovery reads should
be higher priority than normal reads, as recovery usually happens when
a writer has failed, and another node needs to take over the stream
that that ledger belonged to. All writes on that stream are blocked
until the ledger the ledger has been recovered, so giving these reads
priority will mean the stream is out for action for the shortest
amount of time possible.

Priority is given to these reads by executing the reads directly in
the BookieRequestProcessor, rather than adding them to the
readThreadPool where they could end up queuing.

This change was originally 372a99d in the yahoo.

@@ -90,6 +90,7 @@ message ReadRequest {
enum Flag {
FENCE_LEDGER = 1;
ENTRY_PIGGYBACK = 2;
RECOVERY_READ = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat @sijie
There's a potential thorny BC issue here. On yahoo branch, RECOVERY_READ is 2.https://github.com/yahoo/bookkeeper/blob/165971209f933e0bcefaac997db8e75c8776f942/bookkeeper-server/src/main/proto/BookkeeperProtocol.proto#L83

@merlimat are any any the yahoo clients using protobufs? Or are they all on V2? If so it may not be a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only clients using the v3 protocols are the auto-recovery daemons which are using the default settings. What would be the worst case if they do a read request with ENTRY_PIGGYBACK flag?

The RECOVERY_READ flag is anyway optional, in the sense that is currently just used to signal urgency to the bookie

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a read request with ENTRY_PIGGYBACK is a normal read request. so it only means you don't assign priorities for recovery read, which I don't think it is a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this definitely needs to be BC tested. The question is whether we block this change to do it, or just make it a blocker on the 4.7.0 release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my feeling is we need to do this as early as possible. I am not sure if there is other similar changes coming up or not. I understand this is a port from yahoo branch, it probably doesn't make sense to block this change to be BC tested. I am fine with creating a blocker issue for adding BC tests for this change here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Created issue #903

@eolivelli
Copy link
Contributor

This is an important protocol change how can we do BC tests?

Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some BC tests here:

  • we need to make sure yahoo client can talks to this new bookies
  • new bookie client can talks to yahoo bookie.

@@ -265,7 +265,7 @@ ByteBuf getData() {
}

boolean isRecoveryAdd() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name should be changed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -152,6 +159,14 @@ public BookieRequestProcessor(ServerConfiguration serverCfg, Bookie bookie,
createExecutor(
this.serverCfg.getNumLongPollWorkerThreads(),
"BookieLongPollThread-" + serverCfg.getBookiePort(), OrderedScheduler.NO_TASK_LIMIT);
this.fenceThreadPool = Executors.newCachedThreadPool(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need a separate thread pool if you fix #283. I would suggest fixing #283 first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#283 is to submit a runnable to the read processor thread for checking the fencing? I don't think this applies here. This fenceThreadPool is for V3. The problem was that writeThreadPool and readThreadPool can reject executions, if they hit a limit. This was causing failures in on of the tests for fencing (testRecoveryNotRejected). I would think that fencing should be outside of these execution limits, as it is part of recovery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I read as it is for v2 process. I thought that you move fence read to a separate thread.

then I have a few questions here /cc @merlimat

  1. this change will move fence read to netty I/O thread, which it would be blocking subsequent requests. If read threads is accumulating requests, that typically means I/O is slow. fenceRead can potentially be very slow due to overwhelmed I/O, then it will subsequently block all the network I/O. I am not sure how much we can gain from moving fence read to netty I/O thread. It is against the purposes we having thread pools, is to avoid blocking network I/O.

@merlimat any metrics or experiences show this change improved things a lot?

  1. If this change is to prioritize recovery related ops, isn't that better to have its own recovery read thread pools? rather than blocking network I/O.

  2. even worse at V2, the fence read is waiting for both read I/O and journal I/O (v3 makes fence request a non-blocking op on waiting fence result). If the bookie is overwhelmed (e.g. hitting bandwidth limit), journal can become a bottleneck and hence block netty I/O thread. (I assume this would potentially block sending responses back)

All the questions are related to the goal of not blocking network thread. If I read this change correctly, this change is against this goal, which can potentially have very bad impact.

then shall v2 fencing be executed in this fencing thread or not? because v2 fencing is actually a blocking call, while v3 fencing is not a blocking call. If you run v2 fencing in netty I/O thread, isn't it blocking other I/O operations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie This change is a companion of #706. The story here is that in BK 4.6 if the read rate is very high, maxing out the bookie IO capacity what starts to happen is that read requests are piled up in the read executor task queue.

The read thread is processing the read request, but since they spent a lot of time in the queue, the client will have already timed them out and re-sent other read requests.

In #706, we've introduced the max-limit for the size of the executor queue, in order to limit the amount of time spent in the queue and rejecting additional requests in order to fail-fast.

The rationale is to avoid to read ops that are going to be timed-out and that will not make any progress.

Another problem with the read operations is that when the timeout/rejections happens, it impacts also fencing-reads which are used in recovery.

This change is to make sure the recovery reads can make progress even though there is a very high rate of regular read requests. As Ivan pointed out, the recovery reads are critical because they typically prevent a writer from being operative, while regular reads are "just" reading data and they can thus be deferred to lower priority.

The original change was doing the fence-read directly in IO thread as the way to skip the executor thread queue. I guess that using a dedicated thread pool for fence reads is probably a much better option in order not to block the IO threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on dedicated recovery thread pool. I'm currently working on BC stuff first though, but will come back once that's done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat +1 on a separate recovery thread pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate thread pool does sound like a good idea, but I would really restrict recovery reads to real fencing reads only, not under replication reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie @jvrao @merlimat Trying to get back the context on this again. Any objection to reusing the fencing thread pool for the recovery reads/adds?

// If it's a recovery read, we want to make sure it gets submitted right away,
// bypassing the eventual executor queue,
// to reduce the chances of ledger recovery exceptions.
boolean isRecoveryRead = r.getReadRequest().hasFlag()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this into some until function and have a test case cover it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

* @param ctx
* control object
*/
public void asyncReadLastEntry(ReadCallback cb, Object ctx) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this from this pull request and create a seperate one for it?

  • we need make this as part of new api as well
  • we need a test case to cover it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unused it seems. Removing completely for now.

// Only one flag can be set on the read requests
if (((short) flags & BookieProtocol.FLAG_DO_FENCING) == BookieProtocol.FLAG_DO_FENCING) {
readBuilder.setFlag(ReadRequest.Flag.FENCE_LEDGER);
checkArgument(masterKey != null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont do this. fail the callback with IncorrectParameterException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -90,6 +90,7 @@ message ReadRequest {
enum Flag {
FENCE_LEDGER = 1;
ENTRY_PIGGYBACK = 2;
RECOVERY_READ = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a read request with ENTRY_PIGGYBACK is a normal read request. so it only means you don't assign priorities for recovery read, which I don't think it is a big deal.

@@ -122,4 +131,60 @@ public void addComplete(int rc, LedgerHandle lh, long entryId, Object ctx) {

assertTrue(receivedTooManyRequestsException.get());
}

@Test(timeout = 60000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remember to drop these timeouts

@@ -90,6 +90,7 @@ message ReadRequest {
enum Flag {
FENCE_LEDGER = 1;
ENTRY_PIGGYBACK = 2;
RECOVERY_READ = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of pure bookkeeper this feature may be much useful as the ledger is boundary of the writer. Is this the usecase for Pulsar? In general, for Salesforce usecase regular reads are more important( or atleast not less imprtant) than the background recovery reads. Being said that, there won't be many recovery reads as compared to regular reads, so may be it is OK. This is just first comment, and I haven't read through the patch fully.

checkArgument(masterKey != null);
readBuilder.setMasterKey(ByteString.copyFrom(masterKey));
} else if (((short) flags & BookieProtocol.FLAG_RECOVERY) == BookieProtocol.FLAG_RECOVERY) {
readBuilder.setFlag(ReadRequest.Flag.RECOVERY_READ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so the ledger replicator, which could be part of recovering under replicated ledgers gets higher priority than the the actual client reads? That doesn't sound right to me.

@@ -320,7 +320,7 @@ public void readComplete(int rc, LedgerHandle lh,
for (BookieSocketAddress newBookie : newBookies) {
bkc.getBookieClient().addEntry(newBookie, lh.getId(),
lh.getLedgerKey(), entryId, toSend.retainedSlice(),
multiWriteCallback, dataLength, BookieProtocol.FLAG_RECOVERY_ADD);
multiWriteCallback, dataLength, BookieProtocol.FLAG_RECOVERY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we giving high priority to replicator reads too? It should not...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an addEntry. The recovery flag has a different meaning for adds. It means that if the ledger has been fenced, still store the entry (as it's not a new entry and the client will not see an entry acknowledged as successful after the point of the fencing event).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, seems we're shortcutting those too. hmm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to create a new explicit PRIORITY flag for these operations, rather than using the semantics from RECOVERY, as RECOVERY already means something for adds.

@asfgit
Copy link

asfgit commented Feb 16, 2018

Can one of the admins verify this patch?

@ivankelly
Copy link
Contributor Author

Pushing to see how it does on CI, was a pretty big merge from master after 2 months dormant. Still need to change the flags, and fixup the executor stuff.

@ivankelly
Copy link
Contributor Author

Rebased because it was getting messy to see what had changed after merging master.

This version is the same as the what was pushed 2 months ago except:

  • The flag has been renamed to HIGH_PRIORITY, which is separate to RECOVERY_ADD, so that we can have recovery adds which are not high priority (for rereplication) // @jvrao
  • The fencing thread pool has been renamed the high priority thread. It handles all requests with the high priority flag or the fencing flag.

@sijie @merlimat @jiazhai @jvrao @eolivelli This is ready for review again

@ivankelly ivankelly closed this Feb 21, 2018
@ivankelly ivankelly reopened this Feb 21, 2018
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall is good but I thunk there was some problem with merge/rebase see the comment

* @return
* @throws ParseJsonException
*/
public String asJson() throws ParseJsonException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated

try {
configAsString = conf.asJson();
LOG.info(configAsString);
} catch (ParseJsonException pe) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are reverting a change from @charanreddy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah crap, must have been in flight as I was turning the merge into a rebase. Will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

The change adds a new flag for reads, RECOVERY_READ. This flag is set
when a client is retrying to recover a ledger. Recovery reads should
be higher priority than normal reads, as recovery usually happens when
a writer has failed, and another node needs to take over the stream
that that ledger belonged to. All writes on that stream are blocked
until the ledger the ledger has been recovered, so giving these reads
priority will mean the stream is out for action for the shortest
amount of time possible.

Priority is given to these reads by executing the reads directly in
the BookieRequestProcessor, rather than adding them to the
readThreadPool where they could end up queuing.

This change was originally 372a99d in the yahoo.
}
optional Flag flag = 100;
repeated Flag flag = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think this changes too much for a portback. i know repeated field is compatible with optional. However I am not sure how it would impact other existing code logic. I don't feel comfortable with change, especially this is coupled with a portback change.

This definitely requires more thoughts and testing. I would suggest not changing it to repeated at this moment. since v2 and v3 are different protocols, we don't necessarily need to follow what v2 is doing for v3. If it is a 'priority' field, we should just separate it from a flag. keep flag as it is and adding a field called priority in the request header. so in this way, you don't need to change the Flag field from optional to repeated. And the priority field can then be used later to further optimize io scheduler if we want to do.

There is no BC concern in v3 protocol, since pulsar is using v2 protocol only. The BC concern between apache version and yahoo version only remains at v2 protocol.

If this approach doesn't make sense to you, I would suggest decoupling changing v3 protocol from this PR, get the change in v2 first to complete this merge. since this is required by pulsar. Defer changing v3 protocol in a different PR. I would prefer not take any risks on this, because except pulsar, most of the people is using v3 protocol. And this feature is used by pulsar in v2 protocol only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify what I proposed above:

  1. for v2 protocol, add the flag FLAG_HIGH_PRIORITY for BC purpose with yahoo branch.

  2. for v3 protocol, don't change/add any flag. add a integer field priority at packet header, which is used for indicating the priority for requests. set the default value of priority to 0, which means "no priority". if a positive number or a special number like 100 is set, it means high priority. in future, we can use this priority field for achieving optimized io scheduling. for example, if we develop metadata store, the requests from those ledgers used by metadata store can have higher priorities from the requests of the user ledgers.

  3. in future, if we implemented priority based io scheduling, we can bring the priority feature back to v2 protocol or if we want to make v2 have the same features as v3 protocol. we can bring in the priority field and deprecated the high_priority flag in v2 protocol.

In this way, we can get the merge done without violating BC problem and we also can have better extensibility for future improvements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think this changes too much for a portback. i know repeated field is compatible with optional. However I am not sure how it would impact other existing code logic. I don't feel comfortable with change, especially this is coupled with a portback change.

The code changes required are forced by the change because the api for accessing repeated is completely different than that for accessing an optional.

I agree with what you're saying though, I was actually expecting this to fail BC in CI as enums in protobuf don't take kindly to new additions, so I was surprised when it past. My contingency was to add something like you suggest, though a boolean. An int is a good idea though. I'll leave everything else the same with regards to BookieClient.

@@ -276,10 +291,12 @@ public void processRequest(Object msg, Channel c) {
// process packet
switch (r.getOpCode()) {
case BookieProtocol.ADDENTRY:
processAddRequest(r, c);
assert(r instanceof BookieProtocol.ParsedAddRequest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use assert, it is unpredictable behavior when assert is disabled. either fail it explicitly or just ignore it

#995

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

statsLogger);
"BookieLongPollThread-" + serverCfg.getBookiePort(),
OrderedScheduler.NO_TASK_LIMIT, statsLogger);
this.highPriorityThreadPool = Executors.newCachedThreadPool(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why using CachedThreadPool here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea. I just reused what was there for fencedThreadPool. probably something about newCachedThreadPool being harder to fail, because of unboundedness, but this introduces problems of it's own. Will change to something bounded.

On a side note, with these thread pools, the architecture is starting to look more like something which could benefit from an actor architecture using something like lmax disruptor. Way in the future though, if at all.

if (null == writeThreadPool) {
write.run();
} else if (r.isHighPriority()) {
highPriorityThreadPool.submit(write);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have the same handling logic as normal writes. If you make highPriorityThreadPool as the same type of scheduler as writeThreadPool and readThreadPool, you can reuse just choose the scheduler based on priority and reuse the logic between line 481-487.

if (null == readThreadPool) {
read.run();
} else if (r.isHighPriority() || r.isFencing()) {
highPriorityThreadPool.submit(read);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as write scheduler

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments.
Should we care about Yahoo migration? This change will be surely not compatible with actual bookies on Yahoo env.

I think that any way the most important point is that we do not break things, at least bookies won't consider as high priority fence reads from clients. And this is achived but current patch. Lets see IT

@ivankelly
Copy link
Contributor Author

Yahoo uses the V2 protocol, so the change should be compatible (the whole BC test changes are to ensure this, there's a test with the yahoo custom version now).

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 great

Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good now. I have only two comments left, otherwise it is ready to go.

@@ -101,6 +101,7 @@ message ReadRequest {
optional int64 previousLAC = 4;
// Used as a timeout (in milliseconds) for the long polling request
optional int64 timeOut = 5;
optional uint32 priority = 6 [default = 0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make priority as part of PacketHeader? so this would be available to all type of requests.

As what I kind see here, it can be reused by GetBookieInfoRequest, WriteLacRequest and ReadLacRequest potentially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -127,6 +127,7 @@
protected static final String MAX_PENDING_READ_REQUESTS_PER_THREAD = "maxPendingReadRequestsPerThread";
protected static final String MAX_PENDING_ADD_REQUESTS_PER_THREAD = "maxPendingAddRequestsPerThread";
protected static final String NUM_LONG_POLL_WORKER_THREADS = "numLongPollWorkerThreads";
protected static final String NUM_HIGH_PRIORITY_WORKER_THREADS = "numHighPriorityWorkerThreads";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this to bookkeeper-server/conf/bk_server.conf and site/_data/config/bk_server.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add it

@sijie
Copy link
Member

sijie commented Feb 23, 2018

@jvrao please review as well since you were in the conversation of this PR.

I moved it out of GC settings because it didn't belong there.
@ivankelly
Copy link
Contributor Author

Will merge this tomorrow if there are no more comment (cc: @jvrao)

@ivankelly ivankelly added this to the 4.7.0 milestone Feb 28, 2018
@ivankelly ivankelly closed this in 7540adf Feb 28, 2018
sijie added a commit to sijie/bookkeeper that referenced this pull request Feb 28, 2018
sijie added a commit that referenced this pull request Feb 28, 2018
Descriptions of the changes in this PR:

The problem was introduced at #898

Author: Sijie Guo <sijie@apache.org>

Reviewers: Luc Perkins <lucperkins@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>

This closes #1222 from sijie/fix_website
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants