Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue #2879 : let bookie quit if journal thread exit #2887

Merged
merged 4 commits into from Jul 31, 2022

Conversation

aloyszhang
Copy link
Contributor

Descriptions of the changes in this PR:
fix #2879
This pull request let bookie quit when there's journal thread exit

Motivation

As described in #2879, now if a bookie has multi journal directories means it has multi journal thread. Once a journal thread exits, the bookie will be unhealthy due to the block of all bookie-io threads, and then the bookie will not work but progress is still alive.
This pull request tries to fix this problem.

Changes

check the journal thread alive in a fixed interval, let bookie quit once there's a journal thread exit

@Vanlightly
Copy link
Contributor

We don't need a new configuration for this. This kind of thing should be a hidden implementation detail, there is no benefit to allowing users to configure an interval.

Another approach would be something like passing a single CountdownLatch to each journal that can be waited on from the bookie thread.

Also note that any dependency changes need to be added to Gradle or won't compile.

@aloyszhang
Copy link
Contributor Author

@Vanlightly
CountdownLatch sounds good. Thanks for your suggestions.

@aloyszhang
Copy link
Contributor Author

I'm not familiar with gradle and I'll check the dependency problem first.

@aloyszhang
Copy link
Contributor Author

@Vanlightly PTAL, thanks.

@aloyszhang
Copy link
Contributor Author

cc @Vanlightly @eolivelli

Copy link
Contributor

@Vanlightly Vanlightly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JournalAliveListener approach looks good. Should it be named JournalDeathListener though? We do need to check that concurrent invocations to shutdown() either directly due to BookieServer shutdown or a journal exiting is handled. There is already handling for multiple invocations of triggerBookieShutdown, but not from that and a BookierServer shutdown. BookieServer.shutdown() should block on an already in-progress BookieImpl.shutdown() to avoid the Java process exiting too soon.

@Vanlightly
Copy link
Contributor

@aloyszhang I mentioned in a previous comment that we need to ensure that if a shutdown is trigger by an OS signal, but a shutdown is in progress, then it needs to wait for that shutdown to complete, else the process may exit before the shutdown is complete. I didn't see that the BookieImpl.shutdown() method is synchronized but it is, and so should do that blocking for us.

Being synchronized also means that we shouldn't need that AtomicBoolean shuttingDown as if only one thread can execute it at a time then the isRunning() check is enough. However, when I tested it I found that two threads were able to enter the method concurrently! This seems due to the this.join() as that is the moment when another thread is permitted to enter the method. So synchronized isn't doing the job for us. The AtomicBoolean you added also isn't enough because we need an OS signal triggered shutdown to be blocked until an in-progress shutdown is completed.

So that leads us to using an explicit lock on the shutdown method to replace synchronized and the AtomicBoolean you added. With those changes, we should have everything covered.

Note that I have done a test of a crashing journal followed by an OS sigint and the bookie does terminate before the shutdown was completed, so this is something that does need addressing.

@aloyszhang
Copy link
Contributor Author

aloyszhang commented Nov 18, 2021

@Vanlightly

This seems due to the this.join() as that is the moment when another thread is permitted to enter the method. So synchronized isn't doing the job for us.

Yes, the original synchronized does not work due to the join().

Note that I have done a test of a crashing journal followed by an OS signal and the bookie does terminate before the shutdown was completed, so this is something that does need addressing.

Do you mean that there is a common shutdown in progress which invoked by the stop command bookkeeper-deamon stop bookie, at the same time, the journal thread crashed due to os failure, and the bookie server will shutdown without waiting for the common shutdown to complete.
I think this can't happen. Because the common shutdown will trigger ComponentShutdownHook which invokes BookieImpl.shutdown() directly, and if journal crash at the same time, this will trigger a shutdown by BookieImpl.triggerBookieShutdown() which will go into BookieImpl.shutdown() but execute nothing since shuttingDown flag is already set to true.

I think the bad case is that if a journal crash first and already triggers BookieImpl.triggerBookieShutdown(), then we shutdown bookie by the command bookkeeper-deamon stop bookie, the shutdown progress by the command may not execute completely. And we have add another controll to make sure the graceful shutdown will complete.

@Vanlightly
Copy link
Contributor

@aloyszhang Exactly, if the triggerBookieShutdown happens first, which can happen due to a number of reasons, and then while it is in progress an operator initiates a controlled shutdown then the bookie can terminate before the BookieImpl shutdown is complete.

@aloyszhang
Copy link
Contributor Author

aloyszhang commented Nov 18, 2021

PR Validation failed.
It seems details can be found in
file:///home/runner/work/bookkeeper/bookkeeper/bookkeeper-server/build/reports/spotbugs/main.xml, but I can't open or download this file.
@Vanlightly Can you help with this?

@Vanlightly
Copy link
Contributor

@aloyszhang run this to get the report: ./gradlew bookkeeper-server:spotbugsMain

@aloyszhang
Copy link
Contributor Author

@Vanlightly Validation issue has been resolved, PTAL, thanks!

@Vanlightly
Copy link
Contributor

@aloyszhang An exception thrown during the shutdown would leave the CountdownLatch at 1, blocking the second caller. I would replace the AtomicBoolean and CountdownLatch with a simple lock, that is acquired before if (isRunning()) and that would do everything needed.

@aloyszhang
Copy link
Contributor Author

@Vanlightly PTAL

Copy link
Contributor

@Vanlightly Vanlightly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aloyszhang
Copy link
Contributor Author

CC @eolivelli PTAL, thanks!

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

can you please resolve the conflicts ?
the patch is ready to be merged

@aloyszhang
Copy link
Contributor Author

aloyszhang commented Nov 25, 2021

after rebasing the master branch, there're some tests that failed, I'll resolve this soon.

@dlg99
Copy link
Contributor

dlg99 commented Feb 14, 2022

@aloyszhang Please rebase/resolve merge conflicts

@aloyszhang
Copy link
Contributor Author

Sorry for late, I'll do this soon

@hangc0276
Copy link
Contributor

ping @aloyszhang, Would you please rebase the master?

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
@zymap
Copy link
Member

zymap commented Jul 29, 2022

@aloyszhang Could you rebase the master and address the comment?

Copy link
Contributor

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function looks nice

Copy link
Contributor

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check failed,please rebase from the newest master branch

image

Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@hangc0276 hangc0276 merged commit 67208fb into apache:master Jul 31, 2022
zymap pushed a commit that referenced this pull request Aug 1, 2022
Descriptions of the changes in this PR:
fix #2879
This pull request let bookie quit when there's journal thread exit

### Motivation

As described in #2879, now if a bookie has multi journal directories means it has multi journal thread. Once a journal thread exits, the bookie will be unhealthy due to the block of all bookie-io threads, and then the bookie will not work but progress is still alive.
This pull request tries to fix this problem.

### Changes

check the journal thread alive in a fixed interval, let bookie quit once there's a journal thread exit

(cherry picked from commit 67208fb)
hangc0276 pushed a commit that referenced this pull request Feb 9, 2023
### Motivation

PR #2887 introduced the feature of shutdown the bookie service after any Journal thread exits, so we don't need to wait in BookieImpl for all Journal threads to exit before shutdown the Bookie service, because this cannot happen.
zymap pushed a commit that referenced this pull request Feb 16, 2023
### Motivation

PR #2887 introduced the feature of shutdown the bookie service after any Journal thread exits, so we don't need to wait in BookieImpl for all Journal threads to exit before shutdown the Bookie service, because this cannot happen.

(cherry picked from commit b85ac48)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Journal thread exit while bookie is still on
8 participants