Skip to content

Investigate and fix TestEnqueueJobs #1235

@kaisun2000

Description

@kaisun2000

Log:

[ERROR] testQueueParallelJobs(org.apache.helix.integration.task.TestEnqueueJobs) Time elapsed: 6,457.223 s <<< FAILURE!
org.apache.helix.HelixException: Workflow "testQueueParallelJobs" context is null or job "testQueueParallelJobs_JOB3" is not in states: [COMPLETED]
at org.apache.helix.integration.task.TestEnqueueJobs.testQueueParallelJobs(TestEnqueueJobs.java:145)

---------- Test Class TestWorkflowAndJobPoll thread leakage detected! ---------------
START testJobQueueAddingJobsAtSametime at Fri Aug 07 01:20:25 PDT 2020
END testJobQueueAddingJobsAtSametime at Fri Aug 07 01:20:27 PDT 2020, took: 2487ms.
START testJobQueueAddingJobsOneByOne at Fri Aug 07 01:20:27 PDT 2020
END testJobQueueAddingJobsOneByOne at Fri Aug 07 01:20:33 PDT 2020, took: 5983ms.
START testJobSubmitGenericWorkflows at Fri Aug 07 01:20:33 PDT 2020
END testJobSubmitGenericWorkflows at Fri Aug 07 01:20:37 PDT 2020, took: 4011ms.
START testQueueJobsMaxCapacity at Fri Aug 07 01:20:37 PDT 2020
END testQueueJobsMaxCapacity at Fri Aug 07 01:57:57 PDT 2020, took: 2240047ms.
START testQueueParallelJobs at Fri Aug 07 01:57:57 PDT 2020
4980696 [Timer-594] ERROR org.apache.helix.controller.GenericHelixController - Time task failed. Rebalance task type: OnDemandRebalance, cluster: CLUSTER_TestTaskStage
org.apache.helix.HelixException: HelixManager (ZkClient) is not connected. Call HelixManager#connect()
at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:387)
at org.apache.helix.manager.zk.ZKHelixManager.getHelixDataAccessor(ZKHelixManager.java:658)
at org.apache.helix.controller.GenericHelixController$RebalanceTask.run(GenericHelixController.java:288)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

END testQueueParallelJobs at Fri Aug 07 03:45:34 PDT 2020, took: 6457223ms.
11399885 [HelixController-pipeline-default-CLUSTER_TestEnqueueJobs-(8c59dfb2_DEFAULT)] ERROR org.apache.helix.controller.GenericHelixController - Cluster manager: controller_0 is not leader for CLUSTER_TestEnqueueJobs. Pipeline will not be invoked
TestEnqueueJobs has active threads cnt:1591
zookeeper server threads has 4 thread
Failure zkclient/zooKeeper session threads has 310 thread
0 thread:main-EventThread

Note, hanging 6457 second, close to 2 hours.


Some analysis as why it seems propertyStore has issue.

  // Add 4 jobs to the queue
  for (int i = 0; i < numberOfJobsAddedBeforeControllerSwitch; i++) {
    _driver.enqueueJob(queueName, "JOB" + i, jobBuilder);
  }
  // Wait until all of the enqueued jobs (Job0 to Job3) are finished
  for (int i = 0; i < numberOfJobsAddedBeforeControllerSwitch; i++) {
    **_driver.pollForJobState(queueName, TaskUtil.getNamespacedJobName(queueName, "JOB" + i),
        TaskState.COMPLETED);**
  }
  // Stop the Controller
  _controller.syncStop();

The highlight line throws exception

HelixException: Workflow "testQueueParallelJobs" context is null or job "testQueueParallelJobs_JOB3"

Then @afterclass

  public void afterClass() throws Exception {
    if (_controller != null && _controller.isConnected()) {
      _controller.syncStop();
    }
    if (_manager != null && _manager.isConnected()) {
      _manager.disconnect();
    }
    stopParticipants();
    deleteCluster(CLUSTER_NAME);
  }

tearing down controller and close its zkclient, but not the timer _onDemand which caused the ERROR org.apache.helix.controller.GenericHelixController - Time task failed.
This timer issue is fine. Fixed in another pull request.

public TaskState pollForJobState(String workflowName, String jobName, long timeout,
      TaskState... states) throws InterruptedException {
    // Get workflow config
...
  Set<TaskState> allowedStates = new HashSet<>(Arrays.asList(states));
  // Wait for state
  long st = System.currentTimeMillis();
  do {
    Thread.sleep(timeToSleep);
    **ctx = getWorkflowContext(workflowName);
  } while ((ctx == null || ctx.getJobState(jobName) == null
      || !allowedStates.contains(ctx.getJobState(jobName)))
      && System.currentTimeMillis() < st + timeout);
  if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {
    throw new HelixException(
        String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s",
            workflowName, jobName, allowedStates));
  }
  return ctx.getJobState(jobName);
}

It must be ctx = getWorkflowContext(workflowName) hangs or it would timeout a lot eariler.

public WorkflowContext getWorkflowContext(String workflow) {
  return TaskUtil.getWorkflowContext(_propertyStore, workflow);
}

protected static WorkflowContext getWorkflowContext(HelixPropertyStore<ZNRecord> propertyStore,
    String workflow) {
  ZNRecord r = propertyStore.get(
      Joiner.on("/").join(TaskConstants.REBALANCER_CONTEXT_ROOT, workflow, CONTEXT_NODE), null,
      AccessOption.PERSISTENT);
  return r != null ? new WorkflowContext(r) : null;
}

So it seems that _propertyStore hangs. There is bug over there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions