-
Notifications
You must be signed in to change notification settings - Fork 244
Description
Log:
[ERROR] testQueueParallelJobs(org.apache.helix.integration.task.TestEnqueueJobs) Time elapsed: 6,457.223 s <<< FAILURE!
org.apache.helix.HelixException: Workflow "testQueueParallelJobs" context is null or job "testQueueParallelJobs_JOB3" is not in states: [COMPLETED]
at org.apache.helix.integration.task.TestEnqueueJobs.testQueueParallelJobs(TestEnqueueJobs.java:145)
---------- Test Class TestWorkflowAndJobPoll thread leakage detected! ---------------
START testJobQueueAddingJobsAtSametime at Fri Aug 07 01:20:25 PDT 2020
END testJobQueueAddingJobsAtSametime at Fri Aug 07 01:20:27 PDT 2020, took: 2487ms.
START testJobQueueAddingJobsOneByOne at Fri Aug 07 01:20:27 PDT 2020
END testJobQueueAddingJobsOneByOne at Fri Aug 07 01:20:33 PDT 2020, took: 5983ms.
START testJobSubmitGenericWorkflows at Fri Aug 07 01:20:33 PDT 2020
END testJobSubmitGenericWorkflows at Fri Aug 07 01:20:37 PDT 2020, took: 4011ms.
START testQueueJobsMaxCapacity at Fri Aug 07 01:20:37 PDT 2020
END testQueueJobsMaxCapacity at Fri Aug 07 01:57:57 PDT 2020, took: 2240047ms.
START testQueueParallelJobs at Fri Aug 07 01:57:57 PDT 2020
4980696 [Timer-594] ERROR org.apache.helix.controller.GenericHelixController - Time task failed. Rebalance task type: OnDemandRebalance, cluster: CLUSTER_TestTaskStage
org.apache.helix.HelixException: HelixManager (ZkClient) is not connected. Call HelixManager#connect()
at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:387)
at org.apache.helix.manager.zk.ZKHelixManager.getHelixDataAccessor(ZKHelixManager.java:658)
at org.apache.helix.controller.GenericHelixController$RebalanceTask.run(GenericHelixController.java:288)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
END testQueueParallelJobs at Fri Aug 07 03:45:34 PDT 2020, took: 6457223ms.
11399885 [HelixController-pipeline-default-CLUSTER_TestEnqueueJobs-(8c59dfb2_DEFAULT)] ERROR org.apache.helix.controller.GenericHelixController - Cluster manager: controller_0 is not leader for CLUSTER_TestEnqueueJobs. Pipeline will not be invoked
TestEnqueueJobs has active threads cnt:1591
zookeeper server threads has 4 thread
Failure zkclient/zooKeeper session threads has 310 thread
0 thread:main-EventThread
Note, hanging 6457 second, close to 2 hours.
Some analysis as why it seems propertyStore has issue.
// Add 4 jobs to the queue
for (int i = 0; i < numberOfJobsAddedBeforeControllerSwitch; i++) {
_driver.enqueueJob(queueName, "JOB" + i, jobBuilder);
}
// Wait until all of the enqueued jobs (Job0 to Job3) are finished
for (int i = 0; i < numberOfJobsAddedBeforeControllerSwitch; i++) {
**_driver.pollForJobState(queueName, TaskUtil.getNamespacedJobName(queueName, "JOB" + i),
TaskState.COMPLETED);**
}
// Stop the Controller
_controller.syncStop();
The highlight line throws exception
HelixException: Workflow "testQueueParallelJobs" context is null or job "testQueueParallelJobs_JOB3"
Then @afterclass
public void afterClass() throws Exception {
if (_controller != null && _controller.isConnected()) {
_controller.syncStop();
}
if (_manager != null && _manager.isConnected()) {
_manager.disconnect();
}
stopParticipants();
deleteCluster(CLUSTER_NAME);
}
tearing down controller and close its zkclient, but not the timer _onDemand which caused the ERROR org.apache.helix.controller.GenericHelixController - Time task failed.
This timer issue is fine. Fixed in another pull request.
public TaskState pollForJobState(String workflowName, String jobName, long timeout,
TaskState... states) throws InterruptedException {
// Get workflow config
...
Set<TaskState> allowedStates = new HashSet<>(Arrays.asList(states));
// Wait for state
long st = System.currentTimeMillis();
do {
Thread.sleep(timeToSleep);
**ctx = getWorkflowContext(workflowName);
} while ((ctx == null || ctx.getJobState(jobName) == null
|| !allowedStates.contains(ctx.getJobState(jobName)))
&& System.currentTimeMillis() < st + timeout);
if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {
throw new HelixException(
String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s",
workflowName, jobName, allowedStates));
}
return ctx.getJobState(jobName);
}
It must be ctx = getWorkflowContext(workflowName) hangs or it would timeout a lot eariler.
public WorkflowContext getWorkflowContext(String workflow) {
return TaskUtil.getWorkflowContext(_propertyStore, workflow);
}
protected static WorkflowContext getWorkflowContext(HelixPropertyStore<ZNRecord> propertyStore,
String workflow) {
ZNRecord r = propertyStore.get(
Joiner.on("/").join(TaskConstants.REBALANCER_CONTEXT_ROOT, workflow, CONTEXT_NODE), null,
AccessOption.PERSISTENT);
return r != null ? new WorkflowContext(r) : null;
}
So it seems that _propertyStore hangs. There is bug over there.