Investigate and fix TestEnqueueJobs

Log:
>[ERROR] testQueueParallelJobs(org.apache.helix.integration.task.TestEnqueueJobs) Time elapsed: 6,457.223 s <<< FAILURE!
org.apache.helix.HelixException: Workflow "testQueueParallelJobs" context is null or job "testQueueParallelJobs_JOB3" is not in states: [COMPLETED]
at org.apache.helix.integration.task.TestEnqueueJobs.testQueueParallelJobs(TestEnqueueJobs.java:145)

>---------- Test Class TestWorkflowAndJobPoll thread leakage detected! ---------------
START testJobQueueAddingJobsAtSametime at Fri Aug 07 01:20:25 PDT 2020
END testJobQueueAddingJobsAtSametime at Fri Aug 07 01:20:27 PDT 2020, took: 2487ms.
START testJobQueueAddingJobsOneByOne at Fri Aug 07 01:20:27 PDT 2020
END testJobQueueAddingJobsOneByOne at Fri Aug 07 01:20:33 PDT 2020, took: 5983ms.
START testJobSubmitGenericWorkflows at Fri Aug 07 01:20:33 PDT 2020
END testJobSubmitGenericWorkflows at Fri Aug 07 01:20:37 PDT 2020, took: 4011ms.
START testQueueJobsMaxCapacity at Fri Aug 07 01:20:37 PDT 2020
END testQueueJobsMaxCapacity at Fri Aug 07 01:57:57 PDT 2020, took: 2240047ms.
START testQueueParallelJobs at Fri Aug 07 01:57:57 PDT 2020
4980696 [Timer-594] ERROR org.apache.helix.controller.GenericHelixController  - Time task failed. Rebalance task type: OnDemandRebalance, cluster: CLUSTER_TestTaskStage
**org.apache.helix.HelixException: HelixManager (ZkClient) is not connected. Call HelixManager#connect()
        at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:387)
        at org.apache.helix.manager.zk.ZKHelixManager.getHelixDataAccessor(ZKHelixManager.java:658)
        at org.apache.helix.controller.GenericHelixController$RebalanceTask.run(GenericHelixController.java:288)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)**
END testQueueParallelJobs at Fri Aug 07 03:45:34 PDT 2020, took: 6457223ms.
11399885 [HelixController-pipeline-default-CLUSTER_TestEnqueueJobs-(8c59dfb2_DEFAULT)] ERROR org.apache.helix.controller.GenericHelixController  - Cluster manager: controller_0 is not leader for CLUSTER_TestEnqueueJobs. Pipeline will not be invoked
TestEnqueueJobs has active threads cnt:1591
zookeeper server threads has 4 thread
Failure zkclient/zooKeeper session threads has 310 thread
0 thread:main-EventThread

Note, hanging 6457 second, close to 2 hours.

-----------------------------------------------------------------------------------------------------
Some analysis as why it seems propertyStore has issue.

```
  // Add 4 jobs to the queue
  for (int i = 0; i < numberOfJobsAddedBeforeControllerSwitch; i++) {
    _driver.enqueueJob(queueName, "JOB" + i, jobBuilder);
  }
  // Wait until all of the enqueued jobs (Job0 to Job3) are finished
  for (int i = 0; i < numberOfJobsAddedBeforeControllerSwitch; i++) {
    **_driver.pollForJobState(queueName, TaskUtil.getNamespacedJobName(queueName, "JOB" + i),
        TaskState.COMPLETED);**
  }
  // Stop the Controller
  _controller.syncStop();
```
The highlight line throws exception 
> HelixException: Workflow "testQueueParallelJobs" context is null or job "testQueueParallelJobs_JOB3"

Then    @AfterClass
```
  public void afterClass() throws Exception {
    if (_controller != null && _controller.isConnected()) {
      _controller.syncStop();
    }
    if (_manager != null && _manager.isConnected()) {
      _manager.disconnect();
    }
    stopParticipants();
    deleteCluster(CLUSTER_NAME);
  }
```
tearing down controller and close its zkclient, but not the timer _onDemand which caused the ERROR org.apache.helix.controller.GenericHelixController  - Time task failed. 
This timer issue is fine. Fixed in another pull request.

```
public TaskState pollForJobState(String workflowName, String jobName, long timeout,
      TaskState... states) throws InterruptedException {
    // Get workflow config
...
  Set<TaskState> allowedStates = new HashSet<>(Arrays.asList(states));
  // Wait for state
  long st = System.currentTimeMillis();
  do {
    Thread.sleep(timeToSleep);
    **ctx = getWorkflowContext(workflowName);
  } while ((ctx == null || ctx.getJobState(jobName) == null
      || !allowedStates.contains(ctx.getJobState(jobName)))
      && System.currentTimeMillis() < st + timeout);
  if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {
    throw new HelixException(
        String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s",
            workflowName, jobName, allowedStates));
  }
  return ctx.getJobState(jobName);
}
```

It must be `ctx = getWorkflowContext(workflowName)` hangs or it would timeout a lot eariler.

```
public WorkflowContext getWorkflowContext(String workflow) {
  return TaskUtil.getWorkflowContext(_propertyStore, workflow);
}

protected static WorkflowContext getWorkflowContext(HelixPropertyStore<ZNRecord> propertyStore,
    String workflow) {
  ZNRecord r = propertyStore.get(
      Joiner.on("/").join(TaskConstants.REBALANCER_CONTEXT_ROOT, workflow, CONTEXT_NODE), null,
      AccessOption.PERSISTENT);
  return r != null ? new WorkflowContext(r) : null;
}

```

So it seems that _propertyStore hangs. There is bug over there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and fix TestEnqueueJobs #1235

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate and fix TestEnqueueJobs #1235

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions