Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Fix rackaware placement policy init error #12097

Merged
merged 1 commit into from
Sep 20, 2021
Merged

Conversation

addisonj
Copy link
Contributor

Since the release of Pulsar 2.8 and upgrade to BK 4.12, the default
rackAwarePlacementPolicy has been failing with the following exception:

org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to initialize DNS Resolver org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping, used default subnet resolver : java.lang.RuntimeException: java.lang.NullPointerException java.lang.NullPointerException

This regression occured in commit 4c60262

The core of the issue is that setConf is called before
setBookieAddressResolver has been set (see
https://github.com/apache/bookkeeper/blob/034ef8566ad037937a4d58a28f70631175744f53/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L264-L286)

This results in the NPE.

We are safe to simply not eagerly init the cache in setConf as the
getRack call will re-check the cache.

We also protect against this possible NPE for safety

This is difficult to test directly but we should have an integration test to validate that PlacementPolicy is working as expected.

CC @eolivelli

Since the release of Pulsar 2.8 and upgrade to BK 4.12, the default
rackAwarePlacementPolicy has been failing with the following exception:

```
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to initialize DNS Resolver org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping, used default subnet resolver : java.lang.RuntimeException: java.lang.NullPointerException java.lang.NullPointerException
```

This regression occured in commit 4c60262

The core of the issue is that `setConf` is called before
`setBookieAddressResolver` has been set (see
https://github.com/apache/bookkeeper/blob/034ef8566ad037937a4d58a28f70631175744f53/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L264-L286)

This results in the NPE.

We are safe to simply not eagerly init the cache in setConf as the
getRack call will re-check the cache.

We also protect against this possible NPE for safety
@addisonj
Copy link
Contributor Author

/pulsarbot run-failure-checks

1 similar comment
@addisonj
Copy link
Contributor Author

/pulsarbot run-failure-checks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work Addison.

I wonder why we didn't catch this problem with tests.

Would you mind adding a reproducer test case for the problem?

If we cannot go with a unit test we can make a smoke integration test that simply configures rackaware bookie placement. No need to fully verify the behaviour but just that Pulsar does not break

}
BookieAddressResolver addressResolver = getBookieAddressResolver();
if (addressResolver == null) {
LOG.warn("Bookie address resolver not yet initialized, skipping resolution");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this case happen?
Or is this only a precautionary null check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precautionary, I figure if it can return null we should be defensive

@sijie
Copy link
Member

sijie commented Sep 20, 2021

@eolivelli I believe the test set the DNS resolver manually before the test. Hence it doesn't catch the real usage.

@eolivelli
Copy link
Contributor

I believe the test set the DNS resolver manually
Which test ?

My idea is to add an integration test that reproduces the behaviour of a sys admin that configures this feature.
It should be easy to see the error in the logs of the test.

It looks like we do not have such kind of integration tests.

@addisonj do you think it is worth do add it ?

btw please answer to my comment about the null check, and I am happy with this patch

@addisonj
Copy link
Contributor Author

@eolivelli I do think it is worth the test, but I would suggest we make a follow up item to do that. While we could do a smoke test and see the logs, the log is pretty subtle and easy to miss... I would instead suggest we just put the time in to do a proper test that validates placement by querying metadata. However, I don't want to hold up this fix for that test as this is quite a big breakage

@eolivelli
Copy link
Contributor

I don't want to hold up this fix for that test
This usually means that we won't be adding such test :-)
Usually as soon as the problem/urgency is fixed no one takes care of following up (there is no more interest in spending cycles).

By the way I don't want to block this patch, I hope you will find time to follow up.
Please create an issue

@eolivelli eolivelli merged commit 9975fe4 into master Sep 20, 2021
@eolivelli eolivelli deleted the fix_placement_npe branch September 20, 2021 15:22
@merlimat merlimat added this to the 2.9.0 milestone Sep 20, 2021
@merlimat merlimat added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Sep 20, 2021
merlimat pushed a commit that referenced this pull request Sep 21, 2021
Since the release of Pulsar 2.8 and upgrade to BK 4.12, the default
rackAwarePlacementPolicy has been failing with the following exception:

```
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to initialize DNS Resolver org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping, used default subnet resolver : java.lang.RuntimeException: java.lang.NullPointerException java.lang.NullPointerException
```

This regression occured in commit 4c60262

The core of the issue is that `setConf` is called before
`setBookieAddressResolver` has been set (see
https://github.com/apache/bookkeeper/blob/034ef8566ad037937a4d58a28f70631175744f53/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L264-L286)

This results in the NPE.

We are safe to simply not eagerly init the cache in setConf as the
getRack call will re-check the cache.

We also protect against this possible NPE for safety
@merlimat merlimat added release/2.8.2 type/bug The PR fixed a bug or issue reported a bug labels Sep 21, 2021
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Oct 20, 2021
Since the release of Pulsar 2.8 and upgrade to BK 4.12, the default
rackAwarePlacementPolicy has been failing with the following exception:

```
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to initialize DNS Resolver org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping, used default subnet resolver : java.lang.RuntimeException: java.lang.NullPointerException java.lang.NullPointerException
```

This regression occured in commit apache@4c60262

The core of the issue is that `setConf` is called before
`setBookieAddressResolver` has been set (see
https://github.com/apache/bookkeeper/blob/034ef8566ad037937a4d58a28f70631175744f53/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L264-L286)

This results in the NPE.

We are safe to simply not eagerly init the cache in setConf as the
getRack call will re-check the cache.

We also protect against this possible NPE for safety

(cherry picked from commit 9975fe4)
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
Since the release of Pulsar 2.8 and upgrade to BK 4.12, the default
rackAwarePlacementPolicy has been failing with the following exception:

```
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to initialize DNS Resolver org.apache.pulsar.zookeeper.ZkBookieRackAffinityMapping, used default subnet resolver : java.lang.RuntimeException: java.lang.NullPointerException java.lang.NullPointerException
```

This regression occured in commit apache@4c60262

The core of the issue is that `setConf` is called before
`setBookieAddressResolver` has been set (see
https://github.com/apache/bookkeeper/blob/034ef8566ad037937a4d58a28f70631175744f53/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L264-L286)

This results in the NPE.

We are safe to simply not eagerly init the cache in setConf as the
getRack call will re-check the cache.

We also protect against this possible NPE for safety
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.8 Archived: 2.8 is end of life release/2.8.2 release/2.9.0 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants