Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOOKKEEPER-1106: Introduce write FileInfo cache and read FileInfo cache #513

Closed
wants to merge 2 commits into from
Closed

BOOKKEEPER-1106: Introduce write FileInfo cache and read FileInfo cache #513

wants to merge 2 commits into from

Conversation

yzang
Copy link

@yzang yzang commented Sep 16, 2017

Problem: when read behind happens, it would quickly read bunch of ledgers, which will evict current active ledgers for writing from the ledger cache. with the ledger being evicted from cache, it would impact the write performance.

This feature is contributed by @sijie , in which we introduced write file info cache and read file info cache to cache the ledger index separately for read and write, so that when catch up read happens, it will not evict the file info for writes.


Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    <Issue # or BOOKKEEPER-#>: Description of pull request
    e.g. Issue 123: Description ...
    e.g. BOOKKEEPER-1234: Description ...
  • Make sure tests pass via mvn clean apache-rat:check install findbugs:check.
  • Replace <Issue # or BOOKKEEPER-#> in the title with the actual Issue/JIRA number.

Yiming Zang added 2 commits September 16, 2017 16:35
…stenceMgr so that ledger eviction wouldn't impact write performance during catching up.
@yzang
Copy link
Author

yzang commented Sep 17, 2017

Pass all checks locally

@asfgit
Copy link

asfgit commented Sep 17, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/bookkeeper-precommit-pullrequest-docker/23/

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM
One little comment on tests.

Great improvement!

final long lid = 1L;
final byte[] masterKey = "write".getBytes();

@Test(timeout = 60000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop explicit timeout please. Recently we choose not to use it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll fix the timeout

Copy link
Member

@jiazhai jiazhai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. LGTM. Thanks for fix this issue.

@eolivelli
Copy link
Contributor

@yzang the change would be good to go, if you can fix the @timeout annotation when can ship it

@sijie
Copy link
Member

sijie commented Oct 2, 2017

@eolivelli @jiazhai if @yzang doesn't have time fixing the timeout annotation, we can have a separate issue for addressing that. let's move forward with this change.

@eolivelli
Copy link
Contributor

Ok will merge today

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM merging now

evictedLedgersCounter.inc();
// we need to acquire the write lock in another thread,
// otherwise there could be dead lock happening.
evictionThreadPool.execute(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzang @sijie
There's a serious bug here (which also causes the flake in IndexPersistenceMgrTest#testEvictFileInfoWhenUnderlyingFileExists).

When evicting from the cache, the FileInfo is removed from the cache, but only flushed asynchronously. If someone tries to read the FileInfo from the cache before the flush runs, they will read stale information, and then this will be written back when that FileInfo eventually gets evicted.

Do either of you recall what the deadlock mentioned in the comment is? LoadingCache notifications happen in the same thread as the get() so when handleLedgerEviction is run we should already have the FileInfoLock.readLock() held.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivankelly do your think this can be a blocker for 4.6 release ?
@jiazhai

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivankelly nice catch here. @yzang has the fresh mind about the locking part. However I think the problem exists even before this change, the problem will occur when closing fileinfo happens after removing fileinfo from the fileinfo cache. another thread can load a new fileinfo in between and cause the issue. It wasn't noticed because of the way how we wrote the test case here - the test case was single threaded.

I don't think it is a blocker for 4.6. It is a very corner case: when a file info was evicted with a cached fenced bit and a concurrent read request to load the fileinfo into the cache again. It only happens when a file info was evicted from both write and read fileinfo cache. in reality, it most likely won't happen. because the fileinfo cache is usually large and the read fileinfo cache is twice larger than write fileinfo cache.

but it should go to 4.6.1 for sure, if necessary we need a fix for 4.5.2 as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I don't think it's a blocker for 4.6.0, but for sure for 4.6.1.

The tragic part about this, is that I didn't catch it, CI did, multiple times, but because there were already flakes it wasn't seen as an big issue :/

Looking forward to hearing solutions around the locking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzang do you mind taking a look at this and comment here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivankelly : I have the same feeling with @yzang. It can cause the case that I described in my first comment here, and trigger this flaky test. but it won't cause fence bit physically lost. I would suggest here if you think it will lost, a test case is good to demonstrate the sequence.

I don't think this problem is particularly related to guava cache and reference counting. this corner case exists even before guava cache change. It will happens if there is concurrent loading and eviction.

regarding FileInfo#close(), it is my bad when introducing the watcher/watchable. it is a change after guava cache, so we can move the recycling logic to actual release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie @yzang there is an issue for sure.

The flaking one and the one I described today (see ivankelly@15609e5 for demo).

Yes, they're unlikely, but we cannot base guarantees on "unlikely".

I have half a fix already, I'll should get it finished tomorrow morning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivankelly I think we all agree with we need a fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivankelly Could you add an example of which test is flaky?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexPersistenceMgrTest.testEvictFileInfoWhenUnderlyingFileExists

I've already a fix in review, see #913

athanatos pushed a commit to athanatos/bookkeeper that referenced this pull request Jan 25, 2019
…on fails

This PR ports a fix from the 3.4 to the master branch, that was originally made in ZOOKEEPER-1506.

Closes ZOOKEEPER-2982.

Author: Eron Wright <eron.wright@gmail.com>

Reviewers: Andor Molnar <andor@cloudera.com>, Abraham Fine <afine@apache.org>, Flavio Junqueira <fpj@apache.org>

Closes apache#513 from EronWright/ZOOKEEPER-2982-master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants