Avoid reducer file group load convoys#3687
Closed
sunchao wants to merge 204 commits into
Closed
Conversation
…obin mode ### What changes were proposed in this pull request? SlotsAllocator should select disks randomly in RoundRobin mode ### Why are the changes needed? The current round robin selection mechanism is to select the first disk of each worker first, then the second disk of each worker, and finally the third disk. This can easily cause disk storage space skew. We should select disks randomly instead of selecting the first disk first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes apache#3275 from leixm/CELEBORN-2008. Authored-by: Xianming Lei <xianming.lei@shopee.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit d2befe0) Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request? ### Why are the changes needed? Driver may have a large number of `PartitionLocation` objects, reducing some unnecessary fields of `PartitionLocation` can reduce the memory pressure of Driver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes apache#3274 from cxzl25/CELEBORN-2007. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 634343e) Signed-off-by: Wang, Fei <fwang12@ebay.com>
…nType every time ### What changes were proposed in this pull request? ### Why are the changes needed? `org.apache.celeborn.client.LifecycleManager.getPartitionType` may be called frequently, but in Spark scenario it requires each parsing configuration, which is not necessary. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes apache#3273 from cxzl25/CELEBORN-2006. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit a554261) Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request? Add release guide and fix several issues during 0.6.0 release. ### Why are the changes needed? Add docs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested locally. Closes apache#3271 from turboFei/release_guide. Authored-by: Fei Wang <fwang12@ebay.com> Signed-off-by: Fei Wang <fwang12@ebay.com> (cherry picked from commit 81c3d91) Signed-off-by: Fei Wang <fwang12@ebay.com>
### What changes were proposed in this pull request? Use `tmp` subfolder for svc staging dir. ### Why are the changes needed? Refer: https://github.com/apache/celeborn/blob/81c3d91f75e0ffcf3e347748adf4017a9e560d62/build/release/release.sh#L67 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local. Closes apache#3278 from turboFei/release_guide_follow. Authored-by: Fei Wang <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 637c423) Signed-off-by: Wang, Fei <fwang12@ebay.com>
…ng release notes ### What changes were proposed in this pull request? Add release utils tool. Copied from: https://github.com/apache/kyuubi/blob/master/build/release/pre_gen_release_notes.py https://github.com/apache/kyuubi/blob/master/build/release/release_utils.py ### Why are the changes needed? To reduce the release efforts ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` RELEASE_TAG=v0.6.0-rc0 PREVIOUS_RELEASE_TAG=v0.5.0 build/release/pre_gen_release_notes.py ``` ``` (base) ➜ celeborn-p2 git:(release_utils) RELEASE_TAG=v0.6.0-rc0 PREVIOUS_RELEASE_TAG=v0.5.0 build/release/pre_gen_release_notes.py Gathering new commits between tags v0.5.0 and v0.6.0-rc0 ================================================================================== Release tag: v0.6.0-rc0 Previous release tag: v0.5.0 Number of commits in this range: 535 Show all commits? [y/n]: y 2d3c484 Wang, Fei [RELEASE] Bump 0.6.0 f7be341 Jinqian Fan [CELEBORN-1902] Read client throws PartitionConnectionException (Closes apache#3147) 2a847ba Wang, Fei [MINOR] Change some config version (Closes apache#3269) ... 142d0a0 mingji Bump 0.6.0-SNAPSHOT ================================================================================== Does this look correct? [y/n]: y ================================================================================== Found 1 release commits Found 2 revert commits Found 19 commits with no Ticket ==================== Warning: these commits will be ignored ====================== Release (1) 2d3c484 Wang, Fei [RELEASE] Bump 0.6.0 Revert (2) c316fdb zaynt4606 Revert "[CELEBORN-1376] Push data failed should always release request body" (Closes apache#2992) (Reverts b65b543) 8d0b4cf waitinfuture [CELEBORN-1506][BUG] Revert "[CELEBORN-1036][FOLLOWUP] totalInflightReqs should decrement when batchIdSet contains the batchId to avoid duplicate caller of removeBatch" (Closes apache#2621) No Ticket (19) 2a847ba Wang, Fei [MINOR] Change some config version (Closes apache#3269) 54732c7 Nicolas Fraison Update celeborn conf to add S3 in default and doc for policy (Closes apache#3218) a063622 Cheng Pan [MINOR][INFRA] Do not cancel GHA jobs on committing to main/branch-* branches (Closes apache#3235) 529fd6e cxzl25 [MINOR] Avoid use `_$eq` in Scala file (Closes apache#3208) dfeaef1 Cheng Pan [MINOR] Add spec link to JavaSerializer (Closes apache#3194) 05b6ad4 Sanskar Modi [MINOR] Change config versions (Closes apache#3142) 6f5ad2d Wang, Fei [MINOR] Refine the log for fetch failure and rpc metrics dump (Closes apache#3136) b9e4bbb cxzl25 [MINOR] Change some config version (Closes apache#3082) 4ccb0c7 SteNicholas [MINOR] Rename org.apache.celeborn.plugin.flink.readclient to org.apache.celeborn.plugin.flink.client (Closes apache#3048) 8052321 Sanskar Modi [MINOR] Add documentation for `CELEBORN_NO_DAEMONIZE` (Closes apache#3020) 43e1b8a FMX [MINOR] Update DingTalk group link (Closes apache#2948) 7fbf0e2 Wang, Fei [MINOR] Fix missing blanks in docs (Closes apache#2917) 71e3c03 Wang, Fei [MINOR] Fix docs typo (Closes apache#2890) d44b23c Weijie Guo [MINOR] Remove unused TODO comments in CelebornTierProducerAgent#processBuffer (Closes apache#2883) 7018996 jiang13021 [MINOR] Fix typo in ExceptionUtils (Closes apache#2841) 8bd5ac0 SteNicholas [MINOR] Add navigation for REST API document (Closes apache#2775) 3cc043a SteNicholas [MINOR] Delete DEPLOY_ON_K8S.md (Closes apache#2752) f226424 Bowen Liang [CLEBORN-1555] Replace deprecated config celeborn.storage.activeTypes in docs and tests (Closes apache#2675) 142d0a0 mingji Bump 0.6.0-SNAPSHOT ==================== Warning: the above commits will be ignored ================== 513 effective commits left to process after filtering. OK to proceed? [y/n]: y =========================== Compiling contributor list =========================== Processed commit f7be341 authored by Jinqian Fan on Wed May 21 16:58:30 2025 -0700 Processed commit 082f0dd authored by Sanskar Modi on Wed May 21 16:37:38 2025 -0700 Processed commit 45b94bf authored by Yi Chen on Wed May 21 01:21:45 2025 -0700 ... Processed commit 450dac8 authored by Nicholas Jiang on Mon Jun 3 17:47:01 2024 +0800 ================================================================================== Commits list is successfully written to commits-v0.6.0-rc0.txt! Contributors list is successfully written to contributors-v0.6.0-rc0.txt! ============ Warnings encountered while creating the contributor list ============ Found the following invalid authors: avishnus Madhukar525722 xy2953396112 Please update 'known_translations'. Please correct these in the final contributors list at contributors-v0.6.0-rc0.txt. ================================================================================== ``` ``` cat build/release/contributors-v0.6.0-rc0.txt * Tao Zheng * Amandeep Singh * Ziyi Wu * Sanskar Modi * Yuting Wang * Cheng Pan * Aravind Patnam * Zhao Zhao * Saurabh Dubey * xy2953396112 * Bowen Liang * Shlomi Uubul * Erik Fang * Yi Chen * Leo Li * Nicolas Fraison * Jiashu Xiong * Pengqi Li * Jiaming Xie * Keyong Zhou * Jinqian Fan * Guangwei Hong * Yi Zhu * Madhukar525722 * Jianfu Li * Chongchen Chen * Biao Geng * Lianne Li * Fei Wang * Mridul Muralidharan * Wang, Fei * avishnus * Xu Huang * Weijie Guo * Xinyu Wang * Yajun Gao * He Zhao * Björn Boschman * Shaoyun Chen * Kerwin Zhang * Kun Wan * Zhengqi Zhang * Minchu Yang * Haotian Cao * Xianming Lei * Shengjie Wang * Veli Yang * Arsen Gumin * Mingxiao Feng * Yuxin Tan * Aidar Bariev * Nan Zhu * Fu Chen * Binjie Yang * Yanze Jiang * Nicholas Jiang ``` ``` cat build/release/commits-v0.6.0-rc0.txt|wc -l 532 ``` Closes apache#3280 from turboFei/release_utils. Authored-by: Fei Wang <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 48fb71e) Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request? Add license for http5. https://github.com/apache/celeborn/blob/637c42338eab27adc21b8a266deac46bfe9d167d/dev/deps/dependencies-server#L36-L38 ### Why are the changes needed? Fix license. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual Closes apache#3281 from turboFei/license. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit d65ff56) Signed-off-by: Wang, Fei <fwang12@ebay.com>
… spark-3.5 profile to 2.13.8 ### What changes were proposed in this pull request? Upgrade scala binary version of `spark-3.3`, `spark-3.4`, `spark-3.5` profile to 2.13.8. ### Why are the changes needed? The scala binary version of `spark-3.3`, `spark-3.4`, `spark-3.5` profile is 2.13.8 for scala 2.13. - https://github.com/apache/spark/blob/v3.3.4/pom.xml#L3548 - https://github.com/apache/spark/blob/v3.4.4/pom.xml#L3646 - https://github.com/apache/spark/blob/v3.5.5/pom.xml#L3681 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes apache#3284 from SteNicholas/CELEBORN-2013. Authored-by: 子懿 <ziyi.jxf@antgroup.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 0f89d3d) Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request? ### Why are the changes needed? Config `celeborn.master.slot.assign.loadAware.fetchTimeWeight` default value is 1, and slotsallocation document is configured as 0. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes apache#3287 from cxzl25/minor_doc_slot. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 14d7212) Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request? - Support retry on IOException failures for RpcRequest in addition with RpcTimeoutException. - Moved duplicate code to Utils ### Why are the changes needed? Currently if a request fails with SocketException or IOException it does not get retried which leads to stage failures. Celeborn should retry on such connection failures. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA Closes apache#3286 from s0nskar/setup_lifecycle_exception. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 612464c) Signed-off-by: Wang, Fei <fwang12@ebay.com>
…orage backend ### What changes were proposed in this pull request? - Add support for configuring volumeClaimTemplates by adding new values `master.volumeClaimTemplates` and `worker.volumeClaimTemplates`. ### Why are the changes needed? Use volume claim template to support various storage backend. ### Does this PR introduce _any_ user-facing change? Yes. New Helm values `master.volumeClaimTemplates` and `worker.volumeClaimTemplates` are added. ### How was this patch tested? ```bash helm unittest charts/celeborn --file "tests/**/*_test.yaml" --strict --debug ``` Closes apache#3277 from ChenYi015/helm/volume-claim-templates. Authored-by: Yi Chen <github@chenyicn.net> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit c83d498) Signed-off-by: Wang, Fei <fwang12@ebay.com>
… LifecycleManager ### What changes were proposed in this pull request? Exclude worker in lifecycle manager if the commit files request on workers fails with `COMMIT_FILE_EXCEPTION` or after multiple retries. ### Why are the changes needed? If worker is under high load and not able to process request because of high CPU, we should exclude it so it will not affect the next retry to shuffle stage. Internally, we are seeing commit file futures in worker under high load are getting timed out and next retry of the stage is again picking same servers and failing. Similarly, we are seeing continuous RpcTimeout for workers but those workers are again getting selected for next retry. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA Closes apache#3276 from s0nskar/worker_exlude_on_commit_exception. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit aeac31f) Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request? Bump spark 4.0 version to 4.0.0. ### Why are the changes needed? Spark 4.0.0 is ready. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes apache#3282 from turboFei/spark_4.0. Lead-authored-by: Fei Wang <fwang12@ebay.com> Co-authored-by: Wang, Fei <fwang12@ebay.com> Co-authored-by: Fei Wang <cn.feiwang@gmail.com> Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request? Support dependencies of `spark-4.0` profile. Follow up apache#3282. ### Why are the changes needed? apache#3282 is lack of dependencies support of `spark-4.0` profile. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Dependencies check: maven-jdk17 (spark-4.0). Closes apache#3298 from SteNicholas/CELEBORN-1413. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit c79e02f) Signed-off-by: SteNicholas <programgeek@163.com>
…s the metrics dashboard ### What changes were proposed in this pull request? Revert role name change in [CELEBORN-1627](apache#2777) ### Why are the changes needed? Fix the issue where the case of name affects the metrics dashboard ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual Closes apache#3299 from RexXiong/CELEBORN-1627-FOLLOWUP. Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 0227a1a) Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request? Support http authentication for Celeborn CLI. ### Why are the changes needed? Current CLI does not work if the authentication is enabled for master or worker. ### Does this PR introduce _any_ user-facing change? Yes, a new option. ### How was this patch tested? UT. Closes apache#3300 from turboFei/cli_auth. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 5f58fb1) Signed-off-by: Wang, Fei <fwang12@ebay.com>
…d, numBytesOutPerSecond metrics for RemoteShuffleServiceFactory ### What changes were proposed in this pull request? Introduce `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics for `RemoteShuffleServiceFactory`. Scope | Infix | Metrics | Description | Type -- | -- | -- | -- | -- Task | Shuffle.Remote.[ShuffleId] | numBytesIn | The total number of bytes this shuffle has read. | Counter | Task | Shuffle.Remote.[ShuffleId] | numBytesOut | The total number of bytes this shuffle has written. | Counter | Task | Shuffle.Remote.[ShuffleId] | numRecordsOut | The total number of records this shuffle has written. | Counter | Task | Shuffle.Remote.[ShuffleId] | numBytesInPerSecond | The number of bytes this shuffle reads per second. | Meter | Task | Shuffle.Remote.[ShuffleId] | numBytesOutPerSecond | The number of bytes this shuffle writes per second. | Meter | Task | Shuffle.Remote.[ShuffleId] | numRecordsOutPerSecond | The number of records this shuffle writes per second. | Meter | Note: - `numBytesIn` and `numBytesOut` metrics include the total number of bytes for records and events. - `numRecordsOut` metric only includes the total number of records, instead of records and events. ### Why are the changes needed? There is no any metrics related to shuffle read operations and operations writing shuffle data for flink shuffle. It's proposed to introduce `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics for `RemoteShuffleServiceFactory`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - `RemoteShuffleOutputGateSuiteJ#testSimpleWriteData` - `RemoteShuffleResultPartitionSuiteJ` Closes apache#3272 from SteNicholas/CELEBORN-2005. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
…, numBytesOut, numRecordsOut, numBytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics ### What changes were proposed in this pull request? Introduce `ShuffleMetricGroup` for `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics. Follow up apache#3272. ### Why are the changes needed? `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics should put shuffle id into variables, which could introduce `ShuffleMetricGroup` to support. Meanwhile, apache#3272 would print many same logs as follows that shoud be improved: ``` 2025-05-28 10:48:54,433 WARN [flink-akka.actor.default-dispatcher-18] org.apache.flink.metrics.MetricGroup [] - Name collision: Group already contains a Metric with the name 'numRecordsOut'. Metric will not be reported.[11.66.62.202, taskmanager, antc4flink3980005426-taskmanager-3-70, antc4flink3980005426, [vertex-2]HashJoin(joinType=[LeftOuterJoin], where=[(f0 = f00)], select=[f0, f1, f2, f3, f4, f5, f6, f00, f10, f20, f30, f40, f50, f60], build=[right]) -> Sink: Sink(table=[default_catalog.default_database.sink], fields=[f0, f1, f2, f3, f4, f5, f6, f00, f10, f20, f30, f40, f50, f60]), 2, Shuffle, Remote, 1] ``` ### Does this PR introduce _any_ user-facing change? Introduce `celeborn.client.flink.metrics.scope.shuffle` config option to define the scope format string that is applied to all metrics scoped to a shuffle: - Variables: - Shuffle: `<task_id>, <task_name>, <task_attempt_id>, <task_attempt_num>, <subtask_index>, <shuffle_id>`. - Metrics: Scope | Metrics | Description | Type -- | -- | -- | -- Shuffle | numBytesIn | The total number of bytes this shuffle has read. | Counter | Shuffle | numBytesOut| The total number of bytes this shuffle has written. | Counter | Shuffle | numRecordsOut | The total number of records this shuffle has written. | Counter | Shuffle | numBytesInPerSecond | The number of bytes this shuffle reads per second. | Meter | Shuffle | numBytesOutPerSecond | The number of bytes this shuffle writes per second. | Meter | Shuffle | numRecordsOutPerSecond | The number of records this shuffle writes per second. | Meter | ### How was this patch tested? Manual test.   Closes apache#3296 from SteNicholas/CELEBORN-2005. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Weijie Guo <reswqa@163.com>
### What changes were proposed in this pull request? Support min number of workers to assign slots on for a shuffle. ### Why are the changes needed? PR apache#3039 updated the default value of `celeborn.master.slot.assign.extraSlots` to avoid skew in shuffle stage with less number of reducers. However, it will also affect the stage with large number of reducers, thus not ideal. We are introducing a new config `celeborn.master.slot.assign.minWorkers` which will ensure that shuffle stages with less number of reducers will not cause load imbalance on few nodes. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? NA Closes apache#3297 from s0nskar/min_workers. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit aceee64) Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request? ### Why are the changes needed? ```java Exception in thread "main" java.lang.NoSuchMethodError: 'boolean org.apache.spark.util.Utils.isLocalMaster(org.apache.spark.SparkConf)' at org.apache.spark.shuffle.celeborn.SparkShuffleManager.executorCores(SparkShuffleManager.java:464) at org.apache.spark.shuffle.celeborn.SparkShuffleManager.<init>(SparkShuffleManager.java:117) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:75) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:53) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486) at org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2584) at org.apache.spark.shuffle.ShuffleManager$.create(ShuffleManager.scala:108) at org.apache.spark.SparkEnv.initializeShuffleManager(SparkEnv.scala:226) at org.apache.spark.SparkContext.<init>(SparkContext.scala:589) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:3055 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test Closes apache#3305 from cxzl25/CELEBORN-2023. Lead-authored-by: sychen <sychen@ctrip.com> Co-authored-by: cxzl25 <3898450+cxzl25@users.noreply.github.com> Signed-off-by: 子懿 <ziyi.jxf@antgroup.com> (cherry picked from commit 7eee202) Signed-off-by: 子懿 <ziyi.jxf@antgroup.com>
### What changes were proposed in this pull request? Fix the remaining master sub commands that does not transfer auth header. ### Why are the changes needed? Before, this mistake was not detected by GA. Because the authentication configs was not trasnferred when setting mini celeborn cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. Closes apache#3303 from turboFei/auth_header_followup. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: 子懿 <ziyi.jxf@antgroup.com> (cherry picked from commit 5a50686) Signed-off-by: 子懿 <ziyi.jxf@antgroup.com>
### What changes were proposed in this pull request? 1. Fix a NPE when reading HDFS files. 2. Change partition manager will generate correct storage info. 3. Add assertions for tier writers. ### Why are the changes needed? Regression for release 0.6. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Cluster tests. Closes apache#3302 from FMX/b2021. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com> (cherry picked from commit 7bde738) Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
ShuffleClientImpl.updateFileGroupcurrently performs the blockingGetReducerFileGroupRPC from insideConcurrentHashMap.compute. When several reducer tasks touch the same shuffle while that RPC is slow, one task waits in the RPC while peer task threads block on the map reservation node. This turns a single slow file-group lookup into a local executor-side convoy.What changes were proposed in this PR?
computepath with a per-shuffle single-flight load using an in-flightCompletableFuture.How was this PR tested?
build/mvn --no-transfer-progress -pl client -Dtest=ShuffleClientSuiteJ testbuild/mvn --no-transfer-progress -DskipTests spotless:apply