[Test] test with fine grained pr reverted by YutingWang98 · Pull Request #2612 · apache/celeborn

YutingWang98 · 2024-07-10T17:24:55Z

What changes were proposed in this pull request?

This is the version used for testing without any fined grained changes, and would expect no data loss. However. there are some jobs still lost a few records during shuffle read
with fine grain change reverted

reverted run 1
    Shuffle Write Size / Records: 20.8 TiB / 1965259729
    Shuffle Read Size / Records: 20.8 TiB / 1965259728
reverted run 2
    Shuffle Write Size / Records: 20.8 TiB / 1965259729
    Shuffle Read Size / Records: 20.8 TiB / 1965259729
reverted run 3
    Shuffle Write Size / Records: 20.8 TiB / 1965259729
    Shuffle Read Size / Records: 20.8 TiB / 1965259722

Suspect related to

[[CELEBORN-1312] Move handleRequestPartitions out of sync block](2456325)
[CELEBORN-1320] Use ReviveManager for soft splits #2377

Both ^ 2 prs were included in the tests

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

grep -E 'spark-1' /opt/celeborn/logs/celeborn.log* | grep -E 'WARN|ERROR'

cfmcgrady · 2024-07-12T09:58:05Z

there are some jobs still lost a few records during shuffle read

Could you please provide more details on how to reproduce this bug?

waitinfuture · 2024-07-12T12:53:45Z

I think the bug is related with #2134 , please see the detailed description in #2621 @YutingWang98 @cfmcgrady

…eqs should decrement when batchIdSet contains the batchId to avoid duplicate caller of removeBatch" ### What changes were proposed in this pull request? One of our users reported a dataloss issue in #2612 , I tried to reproduce the bug with the following setup: 1. Partition data is far larger than `spark.celeborn.client.shuffle.partitionSplit.threshold`, which means split happens very often 2. `spark.celeborn.client.shuffle.partitionSplit.threshold` is larger than `celeborn.worker.shuffle.partitionSplit.max`, which means when split happens, it is `HARD_SPLIT` 3. `celeborn.client.shuffle.batchHandleChangePartition.enabled` is true, which means when hard split happens, `LifecycleManager` will commit the splits before the stage finishes Configs in spark side: ``` spark.celeborn.client.push.maxReqsInFlight.perWorker | 256 spark.celeborn.client.push.maxReqsInFlight.total | 2048 spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled | true spark.celeborn.client.shuffle.compression.codec | zstd spark.celeborn.client.shuffle.partitionSplit.threshold | 48m spark.celeborn.client.spark.fetch.throwsFetchFailure | true spark.celeborn.client.spark.push.sort.memory.adaptiveThreshold | true spark.celeborn.client.spark.push.sort.memory.threshold | 512m spark.celeborn.client.spark.shuffle.writer | sort spark.celeborn.master.endpoints | master-1-1:9097 ``` Configs in celeborn side: ``` celeborn.metrics.enabled=false celeborn.replicate.io.numConnectionsPerPeer=24 celeborn.application.heartbeat.timeout=120s celeborn.worker.storage.dirs=/mnt/disk1,/mnt/disk2 celeborn.network.timeout=2000s celeborn.ha.enabled=false celeborn.worker.closeIdleConnections=true celeborn.worker.monitor.disk.enabled=false celeborn.worker.flusher.threads=16 celeborn.worker.graceful.shutdown.enabled=true celeborn.worker.rpc.port=9100 celeborn.worker.push.port=9101 celeborn.worker.fetch.port=9102 celeborn.worker.replicate.port=9103 celeborn.worker.shuffle.partitionSplit.max=10m // this is made to be small ``` My query on 10T TPCDS: ``` select max(ss_sold_time_sk ), max(ss_item_sk ), max(ss_customer_sk ), max(ss_cdemo_sk ), max(ss_hdemo_sk ), max(ss_addr_sk ), max(ss_store_sk ), max(ss_promo_sk ), max(ss_ticket_number ), max(ss_quantity ), max(ss_wholesale_cost ), max(ss_list_price ), max(ss_sales_price ), max(ss_ext_discount_amt ), max(ss_ext_sales_price ), max(ss_ext_wholesale_cost), max(ss_ext_list_price ), max(ss_ext_tax ), max(ss_coupon_amt ), max(ss_net_paid ), max(ss_net_paid_inc_tax ), max(ss_net_profit ), max(ss_sold_date_sk ) from ( select * from store_sales where ss_sold_date_sk is not null distribute by ss_sold_date_sk ) a; ``` After digging into it, I found the bug is introduced by #2134 . #2134 added check in `InFlightRequestTracker#addBatch` and `InFlightRequestTracker#removeBatch` and only increments/decrements `totalInflightReqs` when `batchIdSet` contains current `batchId`, which conflicts with `ShuffleClientImpl#PushDataRpcResponseCallback#updateLatestPartition`, which calls `addBatch` first then calls `removeBatch` with the same batchId. As a result, the call to `addBatch` fails to increment `totalInflightReqs`, but the call to `removeBatch` decrements `totalInflightReqs`, which means the retried push is not counted, then later `limitZeroInFlight` in `mapperEnd` will return even though the retried push fails. This PR fixes the bug by reverting #2134 ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test. Closes #2621 from waitinfuture/1506. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>

…eqs should decrement when batchIdSet contains the batchId to avoid duplicate caller of removeBatch" ### What changes were proposed in this pull request? One of our users reported a dataloss issue in #2612 , I tried to reproduce the bug with the following setup: 1. Partition data is far larger than `spark.celeborn.client.shuffle.partitionSplit.threshold`, which means split happens very often 2. `spark.celeborn.client.shuffle.partitionSplit.threshold` is larger than `celeborn.worker.shuffle.partitionSplit.max`, which means when split happens, it is `HARD_SPLIT` 3. `celeborn.client.shuffle.batchHandleChangePartition.enabled` is true, which means when hard split happens, `LifecycleManager` will commit the splits before the stage finishes Configs in spark side: ``` spark.celeborn.client.push.maxReqsInFlight.perWorker | 256 spark.celeborn.client.push.maxReqsInFlight.total | 2048 spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled | true spark.celeborn.client.shuffle.compression.codec | zstd spark.celeborn.client.shuffle.partitionSplit.threshold | 48m spark.celeborn.client.spark.fetch.throwsFetchFailure | true spark.celeborn.client.spark.push.sort.memory.adaptiveThreshold | true spark.celeborn.client.spark.push.sort.memory.threshold | 512m spark.celeborn.client.spark.shuffle.writer | sort spark.celeborn.master.endpoints | master-1-1:9097 ``` Configs in celeborn side: ``` celeborn.metrics.enabled=false celeborn.replicate.io.numConnectionsPerPeer=24 celeborn.application.heartbeat.timeout=120s celeborn.worker.storage.dirs=/mnt/disk1,/mnt/disk2 celeborn.network.timeout=2000s celeborn.ha.enabled=false celeborn.worker.closeIdleConnections=true celeborn.worker.monitor.disk.enabled=false celeborn.worker.flusher.threads=16 celeborn.worker.graceful.shutdown.enabled=true celeborn.worker.rpc.port=9100 celeborn.worker.push.port=9101 celeborn.worker.fetch.port=9102 celeborn.worker.replicate.port=9103 celeborn.worker.shuffle.partitionSplit.max=10m // this is made to be small ``` My query on 10T TPCDS: ``` select max(ss_sold_time_sk ), max(ss_item_sk ), max(ss_customer_sk ), max(ss_cdemo_sk ), max(ss_hdemo_sk ), max(ss_addr_sk ), max(ss_store_sk ), max(ss_promo_sk ), max(ss_ticket_number ), max(ss_quantity ), max(ss_wholesale_cost ), max(ss_list_price ), max(ss_sales_price ), max(ss_ext_discount_amt ), max(ss_ext_sales_price ), max(ss_ext_wholesale_cost), max(ss_ext_list_price ), max(ss_ext_tax ), max(ss_coupon_amt ), max(ss_net_paid ), max(ss_net_paid_inc_tax ), max(ss_net_profit ), max(ss_sold_date_sk ) from ( select * from store_sales where ss_sold_date_sk is not null distribute by ss_sold_date_sk ) a; ``` After digging into it, I found the bug is introduced by #2134 . #2134 added check in `InFlightRequestTracker#addBatch` and `InFlightRequestTracker#removeBatch` and only increments/decrements `totalInflightReqs` when `batchIdSet` contains current `batchId`, which conflicts with `ShuffleClientImpl#PushDataRpcResponseCallback#updateLatestPartition`, which calls `addBatch` first then calls `removeBatch` with the same batchId. As a result, the call to `addBatch` fails to increment `totalInflightReqs`, but the call to `removeBatch` decrements `totalInflightReqs`, which means the retried push is not counted, then later `limitZeroInFlight` in `mapperEnd` will return even though the retried push fails. This PR fixes the bug by reverting #2134 ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test. Closes #2621 from waitinfuture/1506. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> (cherry picked from commit 8d0b4cf) Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>

…eqs should decrement when batchIdSet contains the batchId to avoid duplicate caller of removeBatch" ### What changes were proposed in this pull request? One of our users reported a dataloss issue in apache#2612 , I tried to reproduce the bug with the following setup: 1. Partition data is far larger than `spark.celeborn.client.shuffle.partitionSplit.threshold`, which means split happens very often 2. `spark.celeborn.client.shuffle.partitionSplit.threshold` is larger than `celeborn.worker.shuffle.partitionSplit.max`, which means when split happens, it is `HARD_SPLIT` 3. `celeborn.client.shuffle.batchHandleChangePartition.enabled` is true, which means when hard split happens, `LifecycleManager` will commit the splits before the stage finishes Configs in spark side: ``` spark.celeborn.client.push.maxReqsInFlight.perWorker | 256 spark.celeborn.client.push.maxReqsInFlight.total | 2048 spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled | true spark.celeborn.client.shuffle.compression.codec | zstd spark.celeborn.client.shuffle.partitionSplit.threshold | 48m spark.celeborn.client.spark.fetch.throwsFetchFailure | true spark.celeborn.client.spark.push.sort.memory.adaptiveThreshold | true spark.celeborn.client.spark.push.sort.memory.threshold | 512m spark.celeborn.client.spark.shuffle.writer | sort spark.celeborn.master.endpoints | master-1-1:9097 ``` Configs in celeborn side: ``` celeborn.metrics.enabled=false celeborn.replicate.io.numConnectionsPerPeer=24 celeborn.application.heartbeat.timeout=120s celeborn.worker.storage.dirs=/mnt/disk1,/mnt/disk2 celeborn.network.timeout=2000s celeborn.ha.enabled=false celeborn.worker.closeIdleConnections=true celeborn.worker.monitor.disk.enabled=false celeborn.worker.flusher.threads=16 celeborn.worker.graceful.shutdown.enabled=true celeborn.worker.rpc.port=9100 celeborn.worker.push.port=9101 celeborn.worker.fetch.port=9102 celeborn.worker.replicate.port=9103 celeborn.worker.shuffle.partitionSplit.max=10m // this is made to be small ``` My query on 10T TPCDS: ``` select max(ss_sold_time_sk ), max(ss_item_sk ), max(ss_customer_sk ), max(ss_cdemo_sk ), max(ss_hdemo_sk ), max(ss_addr_sk ), max(ss_store_sk ), max(ss_promo_sk ), max(ss_ticket_number ), max(ss_quantity ), max(ss_wholesale_cost ), max(ss_list_price ), max(ss_sales_price ), max(ss_ext_discount_amt ), max(ss_ext_sales_price ), max(ss_ext_wholesale_cost), max(ss_ext_list_price ), max(ss_ext_tax ), max(ss_coupon_amt ), max(ss_net_paid ), max(ss_net_paid_inc_tax ), max(ss_net_profit ), max(ss_sold_date_sk ) from ( select * from store_sales where ss_sold_date_sk is not null distribute by ss_sold_date_sk ) a; ``` After digging into it, I found the bug is introduced by apache#2134 . apache#2134 added check in `InFlightRequestTracker#addBatch` and `InFlightRequestTracker#removeBatch` and only increments/decrements `totalInflightReqs` when `batchIdSet` contains current `batchId`, which conflicts with `ShuffleClientImpl#PushDataRpcResponseCallback#updateLatestPartition`, which calls `addBatch` first then calls `removeBatch` with the same batchId. As a result, the call to `addBatch` fails to increment `totalInflightReqs`, but the call to `removeBatch` decrements `totalInflightReqs`, which means the retried push is not counted, then later `limitZeroInFlight` in `mapperEnd` will return even though the retried push fails. This PR fixes the bug by reverting apache#2134 ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test. Closes apache#2621 from waitinfuture/1506. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> (cherry picked from commit 8d0b4cf)

YutingWang98 · 2024-07-15T23:27:30Z

sorry about the late reply @waitinfuture @cfmcgrady, yes, no longer seeing the data loss after applying #2621, closing this PR. And thanks for making the fix

waitinfuture · 2024-07-16T01:46:48Z

Good to know, thanks for verifying! @YutingWang98

test with fine grained pr reverted

91cdcaa

YutingWang98 changed the title ~~[Revereted test] test with fine grained pr reverted~~ [Test] test with fine grained pr reverted Jul 10, 2024

YutingWang98 added 9 commits July 10, 2024 17:21

driver logs with fine grained changes reverted

5bc975f

Update CelebornConf.scala

1f7b87b

Update celeborn-defaults.conf.template

b64f141

Update celeborn-defaults.conf.template

7284714

driver logs with all the celeborn related logs

73b92ab

executor 1's celeborn related logs

326ff05

WARN and ERROR of the worker logs for the job

54cb956

WARN and ERROR from the wrorker logs for the job

66c1cb7

grep -E 'spark-1' /opt/celeborn/logs/celeborn.log* | grep -E 'WARN|ERROR'

Delete workerlogs_warn_error.txt

17ae10a

waitinfuture mentioned this pull request Jul 12, 2024

[CELEBORN-1506][BUG] Revert "[CELEBORN-1036][FOLLOWUP] totalInflightReqs should decrement when batchIdSet contains the batchId to avoid duplicate caller of removeBatch" #2621

Closed

YutingWang98 closed this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test] test with fine grained pr reverted#2612

[Test] test with fine grained pr reverted#2612
YutingWang98 wants to merge 10 commits intoapache:branch-0.4from
YutingWang98:patch-1

YutingWang98 commented Jul 10, 2024 •

edited

Loading

Uh oh!

cfmcgrady commented Jul 12, 2024

Uh oh!

waitinfuture commented Jul 12, 2024 •

edited

Loading

Uh oh!

YutingWang98 commented Jul 15, 2024 •

edited

Loading

Uh oh!

waitinfuture commented Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YutingWang98 commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cfmcgrady commented Jul 12, 2024

Uh oh!

waitinfuture commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YutingWang98 commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waitinfuture commented Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YutingWang98 commented Jul 10, 2024 •

edited

Loading

waitinfuture commented Jul 12, 2024 •

edited

Loading

YutingWang98 commented Jul 15, 2024 •

edited

Loading