Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide async version of markLedgerUnderreplicated for LedgerUnderreplicationManager #1619

Closed
wants to merge 4 commits into from

Conversation

sijie
Copy link
Member

@sijie sijie commented Aug 22, 2018

Descriptions of the changes in this PR:

Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue #1578 is one of the examples.

After looking into the LedgerUnderreplicationManager, markLedgerUnderreplicated
is the only interface that will be called in async callbacks. This change is
to provide an async version of markLedgerUnderreplicated.

Changes

  • add markLedgerUnderreplicatedAsync interface in LedgerUnderreplicationManager.
  • implement the logic of markLedgerUnderreplicated using async callbacks
  • use markLedgerUnderreplicatedAsync in the Auditor

Related Issues: #1578
Master Issue: #1617


In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks. However running all
the precommit checks can take a long time, some trivial changes don't need to run all the precommit checks. You
can check following list to skip the tests that don't need to run for your pull request. Leave them unchecked if
you are not sure, committers will help you:

  • [skip bookkeeper-server bookie tests]: skip testing org.apache.bookkeeper.bookie in bookkeeper-server module.
  • [skip bookkeeper-server client tests]: skip testing org.apache.bookkeeper.client in bookkeeper-server module.
  • [skip bookkeeper-server replication tests]: skip testing org.apache.bookkeeper.replication in bookkeeper-server module.
  • [skip bookkeeper-server tls tests]: skip testing org.apache.bookkeeper.tls in bookkeeper-server module.
  • [skip bookkeeper-server remaining tests]: skip testing all other tests in bookkeeper-server module.
  • [skip integration tests]: skip docker based integration tests. if you make java code changes, you shouldn't skip integration tests.
  • [skip build java8]: skip build on java8. ONLY skip this when ONLY changing files under documentation under site.
  • [skip build java9]: skip build on java9. ONLY skip this when ONLY changing files under documentation under site.


Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

  • Make sure the PR title is formatted like:
    <BP-#>: Description of bookkeeper proposal
    e.g. BP-1: 64 bits ledger is support
  • Attach the master issue link in the description of this PR.
  • Attach the google doc link if the BP is written in Google Doc.

Otherwise:

  • Make sure the PR title is formatted like:
    <Issue #>: Description of pull request
    e.g. Issue 123: Description ...
  • Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.
  • Replace <Issue #> in the title with the actual Issue number.

…nager

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue apache#1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: apache#1578
Master Issue: apache#1617
@sijie
Copy link
Member Author

sijie commented Aug 22, 2018

@jvrao @reddycharan @merlimat

This PR is to follow up the comment in #1608. After looking into LedgerUnderreplicationManager, I think markLedgerUnderreplicated is the only method will be called in async callbacks. so the change is pretty straightforward to rewrite markLedgerUnderreplicated in async method.

FutureUtils.completeExceptionally(processFuture, BKException.create(rc));
}
}, null, BKException.Code.OK, BKException.Code.ReadException);
FutureUtils.result(processFuture, BKException.HANDLER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If processing is done async, will get impacted by the below admin.close() and client.close()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processFuture is completed only when finished processing the last ledger, so checkAllLedgers is still sync, which is fine, since checkAllLedgers is executing in Auditor's own executor.

@sijie
Copy link
Member Author

sijie commented Aug 22, 2018

regarding tests: this PR only focuses on rewriting markLedgerUnderreplicated in async way. it doesn't change existing logic or adding new functionalities, so no tests are added.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Great work. +1

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* @param missingReplicas missing replicas
* @return a future presents the mark result.
*/
CompletableFuture<Void> markLedgerUnderreplicatedAsync(long ledgerId, Collection<String> missingReplicas);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dont we need to have async version of "void markLedgerReplicated(long ledgerId)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't. markLedgerReplicated is used by replication worker, which replication worker is single threaded and using sync methods. so markLedgerReplicated is fine at the context.
only markLedgerUnderreplicated is the issue.

ideally, LedgerUnderreplicationManager should be splitted into at least 2 interfaces, one for Auditor, the other one for ReplicationWorker. That would make things much clearer.

if (cause instanceof ReplicationException) {
return (ReplicationException) cause;
} else {
if (cause instanceof InterruptedException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FutureUtils.result would throw InterruptedException without applying the exceptionHandler function https://github.com/apache/bookkeeper/blob/master/bookkeeper-common/src/main/java/org/apache/bookkeeper/common/concurrent/FutureUtils.java#L77

return (BKException) cause;
} else {
BKException ex;
if (cause instanceof InterruptedException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FutureUtils.result would throw InterruptedException without applying the exceptionHandler function https://github.com/apache/bookkeeper/blob/master/bookkeeper-common/src/main/java/org/apache/bookkeeper/common/concurrent/FutureUtils.java#L77

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. I would remove this here

* @param missingReplicas missing replicas
* @return a future presents the mark result.
*/
CompletableFuture<Void> markLedgerUnderreplicatedAsync(long ledgerId, Collection<String> missingReplicas);
Copy link
Contributor

@reddycharan reddycharan Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there any need of making 'missingReplicas' collection, instead of just String missingReplica

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this for https://github.com/apache/bookkeeper/pull/1619/files#diff-7525f06ad3a1ad0a00a462df4deb4698L581 . then it doesn't need to do multiple zk calls for creating an UL.

final List<ACL> zkAcls = ZkUtils.getACLs(conf);
final String znode = getUrLedgerZnode(ledgerId);
final CompletableFuture<Void> createFuture = new CompletableFuture<>();
tryMarkLedgerUnderreplicatedAsync(znode, missingReplicas, zkAcls, createFuture);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why does caller has to create 'createFuture' (CompletableFuture) and pass it, why cann't method returns the CompletableFuture

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allows the method tryMarkLedgerUnderreplicatedAsync to be attempted with same CompleteableFuture.

@sijie
Copy link
Member Author

sijie commented Aug 23, 2018

@reddycharan I have addressed your comments.

Sets.newHashSet(lh.getId())
).whenComplete((result, cause) -> {
if (null != cause) {
callback.processResult(Code.ReplicationException, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to make sure that this 'cause' is not lost in log errors. will it not hide cause completely here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I added the logging back. please take a look at the latest commit.

}

callback.processResult(rc, null, null);
lh.closeAsync().whenComplete((result, cause) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it responsibility of ProcessLostFragmentsCb to close the ledgerhandle, I'm not sure if it is appropriate to do it here, since ProcessLostFragmentsCb is not the owner of 'lh'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t change the behavior. The original logic is ProcessLostFramgmentsCb closing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java#L679

this line here is wrong.

However it was "okay" because close in ReadOnlyHandle is no-op.

@reddycharan
Copy link
Contributor

changes LGTM, but i understand that there is no functional change and they are not public API methods, but change of this magnitude needs some new test coverage for confidence.

@sijie
Copy link
Member Author

sijie commented Aug 24, 2018

change of this magnitude needs some new test coverage for confidence.

if there is no functionality change, what new tests do you expect?

@sijie
Copy link
Member Author

sijie commented Aug 24, 2018

run bookkeeper-server replication tests

@sijie
Copy link
Member Author

sijie commented Aug 24, 2018

This test is only rewriting sync calls to use async calls. Although markLedgerUnderreplicatedAsync is an async method, it is still used in sync ways across the places at Auditor. That says this PR doesn't change any existing logic how Auditor uses markLedgerUnderreplicated. from functionality wises, we don't need separate test cases.

Agreed that we need to have better coverage and since makrLedgerUnderreplicatedAsync is now at the interface, people might be using it in a different place. Those should be done in a separate PR rather than this PR. so I created a separate issue for tracking test coverage: #1626, not trying to combine changes with different purposes into one PR.

@sijie sijie modified the milestones: 4.9.0, 4.8.0 Aug 27, 2018
@sijie sijie closed this in 73b428c Aug 27, 2018
sijie added a commit that referenced this pull request Aug 27, 2018
…licationManager

Descriptions of the changes in this PR:

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue #1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: #1578
Master Issue: #1617

Author: Sijie Guo <sijie@apache.org>

Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org>

This closes #1619 from sijie/async_sync_autorecovery

(cherry picked from commit 73b428c)
Signed-off-by: Sijie Guo <sijie@apache.org>
sijie added a commit that referenced this pull request Aug 27, 2018
…licationManager

Descriptions of the changes in this PR:

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue #1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: #1578
Master Issue: #1617

Author: Sijie Guo <sijie@apache.org>

Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org>

This closes #1619 from sijie/async_sync_autorecovery
@sijie sijie modified the milestones: 4.8.0, 4.9.0 Aug 27, 2018
sijie added a commit to sijie/bookkeeper that referenced this pull request Aug 27, 2018
…derreplicatedAsync

*Motivation*

We introduced LedgerUnderredplicationManager#markLedgerUnderreplicatedAsync in apache#1619. This exposes the async
api to the public. Let's make sure we have enough test coverage for this new async api, including negative tests
and concurrent tests.

*Changes*

- Add basic tests
- Add negative tests
- Add concurrent tests on resolving conflicts
reddycharan pushed a commit to reddycharan/bookkeeper that referenced this pull request Oct 17, 2018
…licationManager

Descriptions of the changes in this PR:

 ### Motivation

Auditor has multiple places calling sync methods in async callbacks.
This raises the possibility hitting deadlock. Issue apache#1578 is one of the examples.

After looking into the `LedgerUnderreplicationManager`, `markLedgerUnderreplicated`
is the only interface that will be called in async callbacks. This change is
to provide an async version of `markLedgerUnderreplicated`.

 ### Changes

- add `markLedgerUnderreplicatedAsync` interface in `LedgerUnderreplicationManager`.
- implement the logic of `markLedgerUnderreplicated` using async callbacks
- use `markLedgerUnderreplicatedAsync` in the Auditor

Related Issues: apache#1578
Master Issue: apache#1617

Author: Sijie Guo <sijie@apache.org>

Reviewers: Charan Reddy Guttapalem <reddycharan18@gmail.com>, Enrico Olivelli <eolivelli@gmail.com>, Matteo Merli <mmerli@apache.org>

This closes apache#1619 from sijie/async_sync_autorecovery

(cherry picked from commit 3e01125)
Signed-off-by: JV Jujjuri <vjujjuri@salesforce.com>

Checkstyle fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants