Describe the bug
NPE can occur in IntermedaiteStateCalcStage when applying pending messages to the intermediateStateMap. Specifically, when it tries to apply a message with DROPPED toState, it calls .remove(..) on a map that is null
2024/10/29 01:48:13.046 ERROR [GenericHelixController] [HelixController-pipeline-default-CLUSTERNAME-(70ae9461_DEFAULT)] [helix] [] Exception while executing DEFAULT pipeline for cluster CLUSTERNAME. Will not continue to next pipeline
java.lang.NullPointerException: null
at org.apache.helix.controller.stages.IntermediateStateCalcStage.lambda$computeIntermediateMap$2(IntermediateStateCalcStage.java:868) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at java.util.HashMap.forEach(HashMap.java:1337) ~[?:?]
at org.apache.helix.controller.stages.IntermediateStateCalcStage.computeIntermediateMap(IntermediateStateCalcStage.java:864) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at org.apache.helix.controller.stages.IntermediateStateCalcStage.computeIntermediatePartitionState(IntermediateStateCalcStage.java:402) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at org.apache.helix.controller.stages.IntermediateStateCalcStage.compute(IntermediateStateCalcStage.java:180) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at org.apache.helix.controller.stages.IntermediateStateCalcStage.process(IntermediateStateCalcStage.java:85) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
for (Map.Entry<Partition, Map<String, Message>> entry : pendingMessageMap.entrySet()) {
entry.getValue().forEach((key, value) -> {
if (!value.getToState().equals(HelixDefinedState.DROPPED.name())) {
intermediateStateMap.setState(entry.getKey(), value.getTgtName(), value.getToState());
} else {
intermediateStateMap.getStateMap().get(entry.getKey()).remove(value.getTgtName());
}
});
To Reproduce
Unable to reproduce outside of unit tests. Currently I think the behavior occurs when:
- Resource has partition with 1 replica .
- Message is sent to instance A to drop replica, but replica does not exist in instance's current state anymore.
- Controller snapshots cluster and runs pipeline.
- IntermediateStateCalc will attempt to call .remove() on a map that does not exist
I think the above state can be reached when:
- Race condition where node reads the message, drops the current state, but hasn't deleted the message yet so it is still seen as a pending message
- Node goes offline so there is no current state
Expected behavior
Failing to remove because map is null should not error out in my opinion. Can add null check or a getOrDefault to return empty map
Describe the bug
NPE can occur in
IntermedaiteStateCalcStagewhen applying pending messages to theintermediateStateMap. Specifically, when it tries to apply a message with DROPPED toState, it calls .remove(..) on a map that is nullTo Reproduce
Unable to reproduce outside of unit tests. Currently I think the behavior occurs when:
I think the above state can be reached when:
Expected behavior
Failing to remove because map is null should not error out in my opinion. Can add null check or a getOrDefault to return empty map