Skip to content

NullPointerException in IntermedaiteStateCalcStage #2973

@GrantPSpencer

Description

@GrantPSpencer

Describe the bug

NPE can occur in IntermedaiteStateCalcStage when applying pending messages to the intermediateStateMap. Specifically, when it tries to apply a message with DROPPED toState, it calls .remove(..) on a map that is null

2024/10/29 01:48:13.046 ERROR [GenericHelixController] [HelixController-pipeline-default-CLUSTERNAME-(70ae9461_DEFAULT)] [helix] [] Exception while executing DEFAULT pipeline for cluster CLUSTERNAME. Will not continue to next pipeline
java.lang.NullPointerException: null
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.lambda$computeIntermediateMap$2(IntermediateStateCalcStage.java:868) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at java.util.HashMap.forEach(HashMap.java:1337) ~[?:?]
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.computeIntermediateMap(IntermediateStateCalcStage.java:864) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.computeIntermediatePartitionState(IntermediateStateCalcStage.java:402) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.compute(IntermediateStateCalcStage.java:180) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.process(IntermediateStateCalcStage.java:85) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
    for (Map.Entry<Partition, Map<String, Message>> entry : pendingMessageMap.entrySet()) {
      entry.getValue().forEach((key, value) -> {
        if (!value.getToState().equals(HelixDefinedState.DROPPED.name())) {
          intermediateStateMap.setState(entry.getKey(), value.getTgtName(), value.getToState());
        } else {
          intermediateStateMap.getStateMap().get(entry.getKey()).remove(value.getTgtName());
        }
      });

To Reproduce

Unable to reproduce outside of unit tests. Currently I think the behavior occurs when:

  1. Resource has partition with 1 replica .
  2. Message is sent to instance A to drop replica, but replica does not exist in instance's current state anymore.
  3. Controller snapshots cluster and runs pipeline.
  4. IntermediateStateCalc will attempt to call .remove() on a map that does not exist

I think the above state can be reached when:

  1. Race condition where node reads the message, drops the current state, but hasn't deleted the message yet so it is still seen as a pending message
  2. Node goes offline so there is no current state

Expected behavior

Failing to remove because map is null should not error out in my opinion. Can add null check or a getOrDefault to return empty map

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions