Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping placementPolicyCheck when ledger replication disabled #3561

Merged
merged 1 commit into from Oct 25, 2022

Conversation

wenbingshen
Copy link
Member

@wenbingshen wenbingshen commented Oct 20, 2022

Motivation

When placementPolicyCheck is enabled and then bookkeeper shell autorecovery -disable is executed, placementPolicyCheck will detect ledgers that do not satisfy placementPolicy and write to zookeeper, but ReplicationWorker no longer obtains ledgers from zookeeper for replication work because autorecovery is disabled, which results in a large number of temporary nodes on zookeeper , when there are more ledgers that do not satisfy placementPolicy, the problem will get worse.

The method of ReplicationWorker to get ledger:

private boolean rereplicate() throws InterruptedException, BKException,
            UnavailableException {
        long ledgerIdToReplicate = underreplicationManager
                .getLedgerToRereplicate();
       
       ...

So we should also disable placementPolicyCheck.

@wenbingshen
Copy link
Member Author

ping @zymap @dlg99 @eolivelli @hangc0276 @shoothzj PTAL. Thanks.

Copy link
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be in line with the rest of Auditor's logic, looks like all the tasks scheduled by

            scheduleBookieCheckTask();
            scheduleCheckAllLedgersTask();
            schedulePlacementPolicyCheckTask();
            scheduleReplicasCheckTask();

won't do anything.

This brings another question: we also don't see what's going on in this case (independent of this change).
E.g. pausing AR for some time (maintenance window etc.) is fine but is it ok to not update counts of underreplicated ledgers?

@wenbingshen
Copy link
Member Author

This brings another question: we also don't see what's going on in this case (independent of this change). E.g. pausing AR for some time (maintenance window etc.) is fine but is it ok to not update counts of underreplicated ledgers?

@dlg99 Good catch. Thanks for your review and comments. :)
AutoRecovery's responsibilities may need to be divided into the following departments:

Auditor: 1. Different ledger status detection.
2. and initiate ledger recovery tasks.
ReplicationWorker: Execute tasks triggered by Auditor.

maybe disable AutoRecovery needs to be divided into 2 parts:

  1. Completely prohibit AutoRecovery work, including Auditor, ReplicationWorker
  2. Only the replication work of AutoRecovery is prohibited, and the Auditor still performs the detection work and updates the relevant indicators.

This should be a new feature, we need other discussions and PRs.

@hangc0276 hangc0276 merged commit cfc6b97 into apache:master Oct 25, 2022
@wenbingshen wenbingshen deleted the disablePlacementPolicyCheck branch October 25, 2022 02:36
zymap pushed a commit that referenced this pull request Oct 26, 2022
### Motivation

When `placementPolicyCheck` is enabled and then `bookkeeper shell autorecovery -disable` is executed, `placementPolicyCheck` will detect ledgers that do not satisfy `placementPolicy` and write to zookeeper, but `ReplicationWorker` no longer obtains ledgers from zookeeper for replication work because autorecovery is disabled, which results in a large number of temporary nodes on zookeeper , when there are more ledgers that do not satisfy placementPolicy, the problem will get worse.

The method of `ReplicationWorker` to get ledger:
```java
private boolean rereplicate() throws InterruptedException, BKException,
            UnavailableException {
        long ledgerIdToReplicate = underreplicationManager
                .getLedgerToRereplicate();

       ...
```

So we should also disable `placementPolicyCheck`.

(cherry picked from commit cfc6b97)
hangc0276 pushed a commit to hangc0276/bookkeeper that referenced this pull request Nov 7, 2022
…e#3561)

### Motivation

When `placementPolicyCheck` is enabled and then `bookkeeper shell autorecovery -disable` is executed, `placementPolicyCheck` will detect ledgers that do not satisfy `placementPolicy` and write to zookeeper, but `ReplicationWorker` no longer obtains ledgers from zookeeper for replication work because autorecovery is disabled, which results in a large number of temporary nodes on zookeeper , when there are more ledgers that do not satisfy placementPolicy, the problem will get worse.

The method of `ReplicationWorker` to get ledger:
```java
private boolean rereplicate() throws InterruptedException, BKException,
            UnavailableException {
        long ledgerIdToReplicate = underreplicationManager
                .getLedgerToRereplicate();

       ...
```

So we should also disable `placementPolicyCheck`.

(cherry picked from commit cfc6b97)
hangc0276 pushed a commit to hangc0276/bookkeeper that referenced this pull request Nov 7, 2022
…e#3561)

### Motivation

When `placementPolicyCheck` is enabled and then `bookkeeper shell autorecovery -disable` is executed, `placementPolicyCheck` will detect ledgers that do not satisfy `placementPolicy` and write to zookeeper, but `ReplicationWorker` no longer obtains ledgers from zookeeper for replication work because autorecovery is disabled, which results in a large number of temporary nodes on zookeeper , when there are more ledgers that do not satisfy placementPolicy, the problem will get worse.

The method of `ReplicationWorker` to get ledger:
```java
private boolean rereplicate() throws InterruptedException, BKException,
            UnavailableException {
        long ledgerIdToReplicate = underreplicationManager
                .getLedgerToRereplicate();

       ...
```

So we should also disable `placementPolicyCheck`.

(cherry picked from commit cfc6b97)
nicoloboschi pushed a commit to datastax/bookkeeper that referenced this pull request Jan 11, 2023
…e#3561)

### Motivation

When `placementPolicyCheck` is enabled and then `bookkeeper shell autorecovery -disable` is executed, `placementPolicyCheck` will detect ledgers that do not satisfy `placementPolicy` and write to zookeeper, but `ReplicationWorker` no longer obtains ledgers from zookeeper for replication work because autorecovery is disabled, which results in a large number of temporary nodes on zookeeper , when there are more ledgers that do not satisfy placementPolicy, the problem will get worse.

The method of `ReplicationWorker` to get ledger:
```java
private boolean rereplicate() throws InterruptedException, BKException,
            UnavailableException {
        long ledgerIdToReplicate = underreplicationManager
                .getLedgerToRereplicate();

       ...
```

So we should also disable `placementPolicyCheck`.

(cherry picked from commit cfc6b97)
(cherry picked from commit e61a913)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants