[CELEBORN-1829] Replace waitThreadPoll's thread pool with ScheduledExecutorService in Controller#3059
[CELEBORN-1829] Replace waitThreadPoll's thread pool with ScheduledExecutorService in Controller#3059zaynt4606 wants to merge 20 commits intoapache:mainfrom
Conversation
| shuffleKey, | ||
| JavaUtils.newConcurrentHashMap[Long, (Int, RpcCallContext)]()) | ||
| val epochWaitTimeMap = shuffleCommitTime.get(shuffleKey) | ||
| epochWaitTimeMap.synchronized { |
There was a problem hiding this comment.
deadlock exists when lock both commitInfo and epochWaitTimeMap
There was a problem hiding this comment.
Done~
Read and write security of epoch in epochWaitTimeMap is guaranteed by commitInfo's lock so the lock of epochWaitTimeMap is needless.
| } | ||
| } | ||
|
|
||
| private def checkCommitTimeout(shuffleCommitTime: ConcurrentHashMap[ |
There was a problem hiding this comment.
AddUt for checkCommitTimeout
| String, | ||
| ConcurrentHashMap[Long, (Int, RpcCallContext)]]): Unit = { | ||
| val delta = 100 | ||
| val shuffleCommitTimeout = conf.workerShuffleCommitTimeout |
There was a problem hiding this comment.
conf.workerShuffleCommitTimeout can be used as property of the class
| val commitInfo = shuffleCommitInfos.get(shuffleKey).get(epoch) | ||
| commitInfo.synchronized { | ||
| if (waitTime * delta < shuffleCommitTimeout) { | ||
| if (commitInfo.status == CommitInfo.COMMIT_FINISHED) { |
There was a problem hiding this comment.
COMMIT_FINISHED check can be move ahead of timeout check.
Once the commit is complete, it can return even if it has timed out
| var shuffleMapperAttempts: ConcurrentHashMap[String, AtomicIntegerArray] = _ | ||
| // shuffleKey -> (epoch -> CommitInfo) | ||
| var shuffleCommitInfos: ConcurrentHashMap[String, ConcurrentHashMap[Long, CommitInfo]] = _ | ||
| var shuffleCommitTime: ConcurrentHashMap[String, ConcurrentHashMap[Long, (Int, RpcCallContext)]] = |
There was a problem hiding this comment.
pls add description for shuffleCommitTime
| shuffleKey, | ||
| JavaUtils.newConcurrentHashMap[Long, (Int, RpcCallContext)]()) | ||
| val epochWaitTimeMap = shuffleCommitTime.get(shuffleKey) | ||
| epochWaitTimeMap.putIfAbsent(epoch, (0, context)) |
There was a problem hiding this comment.
Maybe there is already a context here.
There was a problem hiding this comment.
context is a parameter passed to handleCommitFiles and can not used by non handleCommitFiles internal functions
| shuffleCommitTime.asScala.foreach { | ||
| case (shuffleKey, epochWaitTimeMap) => | ||
| epochWaitTimeMap.asScala.foreach { case (epoch, (waitTime, context)) => | ||
| val commitInfo = shuffleCommitInfos.get(shuffleKey).get(epoch) |
There was a problem hiding this comment.
should check shuffleCommitInfos.get(shuffleKey) is null or not
| epochWaitTimeMap.remove(epoch) | ||
| } else { | ||
| if (waitTime * delta < shuffleCommitTimeout) { | ||
| shuffleCommitTime.get(shuffleKey).put(epoch, (waitTime + 1, context)) |
There was a problem hiding this comment.
waitTime -> use startTimestamp
There was a problem hiding this comment.
has been updated~
| commitInfo.response.failedReplicaIds) | ||
| shuffleCommitInfos.get(shuffleKey).put( | ||
| epoch, | ||
| new CommitInfo(replyResponse, CommitInfo.COMMIT_FINISHED)) |
There was a problem hiding this comment.
You should not create a commit info here because you have synchronized operations on the commitInfo object. I think you can change the value of this commit info object.
However, this won't be a trouble here because the epoch will not be replicated.
There was a problem hiding this comment.
Yes! Modifying commitInfo looks more elegant.
Has been updated.
|
Thanks. merge to main(v0.6.0) |
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Cluster test & UT.