Skip to content

Conversation

@ivoson
Copy link
Contributor

@ivoson ivoson commented Nov 14, 2025

What changes were proposed in this pull request?

Currently, worker will schedule tasks forwarding SendHeartbeat and WorkDirCleanup while handleRegisterResponse.

While worker registration could happen multiple times in case of heartbeat timeout/disconnected from master, in these cases the tasks would be scheduled multiple times.

To fix the issue:

  • Adding heartbeatTask and workDirCleanupTask in worker to tell whether these tasks have been scheduled
  • heartbeatTask and workDirCleanupTask will be initialized after the 1st registration, and then skipped scheduling these tasks in later registration.
  • Cancel the task and reset heartbeatTask and workDirCleanupTask when worker stops.

Why are the changes needed?

Fix the issue repeatedly scheduling SendHeartbeat/WorkDirClean tasks after worker registration.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT added

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Nov 14, 2025
@ivoson ivoson marked this pull request as ready for review November 14, 2025 02:08
@ivoson
Copy link
Contributor Author

ivoson commented Nov 14, 2025

cc @Ngone51 @LuciferYang @dongjoon-hyun can you please take a look? Thanks

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! LGTM.


private var registerMasterFutures: Array[JFuture[_]] = null
private var registrationRetryTimer: Option[JScheduledFuture[_]] = None
private[worker] var heartbeatTask: Option[JScheduledFuture[_]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The identifier marked as [work] seems to serve the purpose of merely being callable within test cases, right? Given that the current WorkerSuite already has with PrivateMethodTester, can we adopt the approach of using invokePrivate for testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

cleanupThreadExecutor.shutdownNow()
metricsSystem.report()
cancelLastRegistrationRetry()
heartbeatTask.foreach(_.cancel(true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handleRegisterResponse is a synchronized code block. Don't the operations on heartbeatTask and workDirCleanupTask within onStop also require synchronized protection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronized block was introduced by #9138 to avoid some race conditions in very early implementation with some async call back...

Looks like not be necessary now...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker is a ThreadSafeRpcEndpoint already. The synchronized protection seems to be unnecessary today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so can we remove that unnecessary synchronized in a separate pr ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a separate task to revisit the synchronized usage here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants