Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to configure sticky reads #1808

Merged

Conversation

merlimat
Copy link
Contributor

Motivation

Currently the BK client is issuing the read requests in round-robin fashion across all the bookies in the write set.

One issue with this approach is that it's not taking full advantage of the read-ahead cache, either explicit (like in DbLedgerStorage) or implicit (by reading data through Linux page cache which will do some prefetching).

With e=2, w=2, when we read e-0 from bookie-1 and e-1 from bookie-2, we fail to take advantage of the fact that bookie-1 will have already e-1 in memory.

Effectively with e-2, w-2 the disk read IO will be doubled, compared to the amount of data served to BK clients. The larger the quorum, the bigger will be overhead (eg: e=5, w=5 will lead to 5x reads from disk).

Changes

Added a BK client flag for "sticky reads". When reading from a ledger that has E=W (every bookie has all the entries), the sticky reads will direct all read request to 1 single bookie in the ensemble.

// We can only enable sticky reads if the ensemble==writeQuorum
// otherwise the same bookie will not have all the entries
// stored
entryIdToConsiderForWriteSet = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you deal with ensemble changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change is to pick a "sticky" bookie. the idea here is to pick a fixed "write set" (a certain read order). the write set is a sequence of bookie index not fixed bookie addresses, so there is nothing to worry about about "ensemble changes".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@eolivelli
Copy link
Contributor

We also need tests.

Is this a WIP ?

// We can only enable sticky reads if the ensemble==writeQuorum
// otherwise the same bookie will not have all the entries
// stored
entryIdToConsiderForWriteSet = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I don't think it is a good idea to sticky to ensemble 0. if the first bookie of ensemble 0 went down, all the reads will have huge penalties. I think "sitckiness" here should be at ledger handle level, when a ledger handle is constructed, one bookie is chosen as the sticky bookie, most of the requests would be sent to this sticky bookie. until failure occurs, a new sticky bookie would be choose. that is being said, ledger handle should probably implement a Supplier<WriteSet> which supplies the WriteSet for pending read ops to read.

so the pending read ops only talks to this supplier to get write set to read entries.

  • for normal settings, the supplier will return lh.distributionSchedule.getWriteSet(entryId).
  • if reorder is enabled, the supplier will return reordered write set.
  • if sticky read is enabled, the supplier will return a fix writeset most of the time, if errors occurs, the supplier will update the write set, so the subsequent reads will choose a different sticky bookie to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change needs to be more elaborate. Sticking to a fixed set is a good idea, but it should be driven with a feedback loop rather than strictly fixing. The bookie can be down, or overloaded, or it could be in a different zone in public cloud etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud
Maybe the PlacementPolicy should decide about which bookie is to be used for the 'sticky read'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the PlacementPolicy should decide about which bookie is to be used for the 'sticky read'

PlacementPolicy should be used for choosing bookies for locality. but it doesn't have to distinguish sticky reads or normal reads. The "stickiness" should not be as part of placement policy, "stickiness" should be part of ledger handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Sijie to have the the ledger handle to return directly the write handle.

The placement policy is always applied after considering the sticky read flag.eg : if you want to read from a particular bookie because it's in same rack as the client, then you will want to read all entries from this same bookie.

@jvrao
Copy link
Contributor

jvrao commented Nov 13, 2018

@nicmichael can you please look into this too?

*
* @return true if reorder read sequence is enabled, otherwise false.
*/
public boolean isStickyReadsEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be a per read request flag rather than per client?
i.e. we have various cases, in some cases reads are random and sticky reads probably won't help.
In other cases reads are strictly sequential and sticky read may help. "may" is because we can hit bottleneck on bookie NIC side and spreading reads across multiple bookies might actually help more then hurt.
Having two clients for these cases would mean 2x connections to bookies, executors/threads etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean per ledger handle flag, not per read request flag, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be a per read request flag rather than per client?

@dlg99 It could work on per-request as long as the mechanism works across multiple requests. Eg: read(0, 99) and read(100, 199), all the 200 requests should go to same bookie.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have sequential reads and random reads from the same ledger (at different points of time).
We cache open ledger handles for awhile, to avoid reopening it.

I guess we can cache to ledger handles if needed, for sequential and for random reads if this is absolutely needed.
TBH, why is that making any difference? i.e. flag is passed to pending add op. if it is false it just uses regular first from writeset rule (plus reads reordering). If it is true it picks "preferred reader" bookie.
the only difference is whether readEntry() variant would expose this flag or LedgerHandle would.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read(0, 99) and read(100, 199), all the 200 requests should go to same bookie.

won't happen for us, ES > WQ. but with https://github.com/apache/bookkeeper/pull/1808/files#r233679775 in mind that should not make any difference.
also we don't use read(0, 99) option. it happened so that we issue read(entryId) requests individually.

long entryIdToConsiderForWriteSet = eId;

if (clientCtx.getConf().enableStickyReads
&& lh.getLedgerMetadata().getWriteQuorumSize() == lh.getLedgerMetadata().getEnsembleSize()) {
Copy link
Contributor

@dlg99 dlg99 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WQ == ES is quite limiting requirement. I.e. we run with ES = 7 and WQ = 3.
I'd suggest picking bookie with ensemble.id % WQ == 0.

i.e in case of 7 bookies in the ensemble and WQ 3, regular round-robin placement:

entry 0 is on bookies 0,1,2 (from ensemble) - reads stick to 0 (0 % 3 == 0)
entry 1 is on bookies 1,2,3 - reads stick to 3
entry 2 on 2,3,4 - reads sticks to 3
entry 3 on 3,4,5 - stick to 3
entry 4 on 4,5,6 - stick to 6
entry 5 on 5,6,0 - stick to 6 (choice of 6 and 0, go to the first one in order)
entry 6 on 6,0,1 - stick to 6
entry 7 on 0,1,2 - stick to 0.
etc

^^ these can be pre-calculated e.g. when ensemble is created.

with WQ == ES it will stick to 0 all the time.

Plus it has to support read reordering (as I understand, it did not change) - don't try to use bookie if its netty channel is not writable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of it but I think the IO reads saving will be just a fraction compared to E=W case.

Eg, in your example, Bookie-0 will read e-0 and prefetch e-5 and e-6 though these 2 entries will be read on bookie 6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but OS page cache will load more than e5 and e6 so all and all it will help.
I mean, it is speculations without perf tests anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but reading of e-5 and e-6 will be anyway "wasted".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From strictly theoretical POV:
It works the same as picking bookie 0 for ES == WQ case.
For ES > WQ case it pins reads to limited subset of bookies and yes, something will get wasted but what else we can do other than round robin?
in a full sequential read of ledger will this waste more or less than not handling ES > WQ case?

From practical POV, even with ES=WQ something will get wasted.
There is no guarantee that entries strictly sequential in the file. 1st, they can interleave with other ledgers.
Even with entry log per ledger and sorted ledger we can end up with out of order writes and out of order entries in entry log.

also, when you say prefetched do you mean OS file system cache or bufferedchannel in the bookie?

@dlg99
Copy link
Contributor

dlg99 commented Nov 15, 2018

do you have any perf numbers to share to show what kind of perf improvements we should expect with this feature enabled?

@merlimat
Copy link
Contributor Author

do you have any perf numbers to share to show what kind of perf improvements we should expect with this feature enabled?

I haven't completed full-scale tests yet. Though I was able to measure the fact that read-ahead cache hit-miss rate was being driven down by the fact that the prefetching was happening on all 3 the bookies for the same entries, driving the IO reads to 3x of what it should be.

@eolivelli
Copy link
Contributor

We could have a ReadFlag instead of a global client wise config parameter.

The discussion seems broader than the expected, should we start a BP? Or any way discuss on an 'issue' and then go back to 'code'

@sijie
Copy link
Member

sijie commented Nov 16, 2018

The discussion seems broader than the expected, should we start a BP? Or any way discuss on an 'issue' and then go back to 'code'

@eolivelli there is already an implementation in progress (it is working although not perfect). so I don't think it is necessarily to this heavy-lifting bp discussion loop. Lets stick the discussion here and get the ball move.

@nicmichael
Copy link
Contributor

I actually like @dlg99's suggestion: if each W adjacent entries have the same primary bookie, but the next W entries another primary bookie, we'd get W times better data locality but still evenly distribute reads across all bookies in E. Plus it would work for E>W. Ideally E and W are prime (e.g. E=3, W=7), then the primary bookie would be easy to calculate and guaranteed to be evenly distributed.

If even distribution is guaranteed, this policy wouldn't even need a flag to turn it on/off. It could be the default policy as it wouldn't harm random reads (due to even distribution across all bookies in E) but benefit sequential reads (at least in the absence of out-of-order writes and entry logs per ledger...).

The benefits of read-ahead anyways depend on the size of the entries and file system configuration. Unless entries are tiny, a policy like this that benefits from read-ahead of W entries might bring most of the performance gain (and a stickiness of the entire ledger to just one bookie might be marginally better for sequential reads).

I suppose reordering of reads (e.g. due to slow bookies or number of outstanding requests to a bookie, Issue #1489) would still kick in if enabled, and rearrange the order if a primary bookie is found to be slow?

@merlimat merlimat force-pushed the single-bookie-per-ledger-read-master branch from b75e2c5 to 5df8771 Compare December 13, 2018 07:04
@merlimat
Copy link
Contributor Author

@eolivelli @jvrao @dlg99 @sijie @ivankelly I have updated the PR with the following:

  • When enabled, LedgerHandle has the notion of a sticky bookie
  • The sticky bookie is initially selected at random (this helps when there are multiple readers that are independently reading different portions of the ledger concurrently).
  • If there are read failures, the ledger handle will pick a new sticky bookie, to avoid read impact.
  • Added tests

I haven't addressed the case where E > W because I believe that the benefit in that scenario are not that big, compared to the case where E=W. It would be easy to add that afterward in any case.

Please take a look.

Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me. I have one comment.

@@ -180,6 +194,13 @@

this.ledgerId = ledgerId;

if (clientCtx.getConf().enableStickyReads
&& getLedgerMetadata().getEnsembleSize() == getLedgerMetadata().getWriteQuorumSize()) {
stickyBookieIndex = ThreadLocalRandom.current().nextInt() % getLedgerMetadata().getEnsembleSize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to add a method in EnsemblePlacementPolicy

int getStickyReadBookieIndex(LedgerMetadata metadata);

it can pick up a bookie randomly by default. for rack-aware and region-aware placement, it can be implemented using network locations.

// We can only enable sticky reads if the ensemble==writeQuorum
// otherwise the same bookie will not have all the entries
// stored
return distributionSchedule.getWriteSet(stickyBookieIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an entryId or a reference to a bookie.
I am a bit confused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write set is typically dependent on the entry id. If a stickyBookieIndex was set, it means we're try to override that distribution and instead send all reads to a particular bookie. Bookie-i will be the "master" bookie for the entry-i. Therefore, a write set with bookie-i will first try to read from bookie-i, falling back to other bookies in case of failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eolivelli Does this answer the question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. It works.
Sorry for late reply

@sijie
Copy link
Member

sijie commented Jan 6, 2019

@merlimat are we including this for 4.9.0?

@merlimat
Copy link
Contributor Author

merlimat commented Jan 7, 2019

@sijie Added the getStickyReadBookieIndex() in EnsemblePlacementPolicy so it can be overridden. I haven't done the region-aware selection since I'm not too familiar on that part.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Let's merge this one

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants