Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] spark procedure occur Shuffle data lost for shuffle xx partitionId xx frequently #2275

Closed
zhengshubin opened this issue Feb 1, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@zhengshubin
Copy link

What is the bug(with logs or screenshots)?

A clear and concise description of what the bug is.
celeborn 0.3.2 spark 3.3.3
log:
org.apache.kyuubi.KyuubiSQLException:
Error operating ExecuteStatement: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 8.0 failed 4 times,
org.apache.celeborn.common.exception.CelebornIOException:
Shuffle data lost for shuffle 1 partitionId 9!
org.apache.celeborn.client.ShuffleClientImpl.loadFileGroup(ShuffleClientImpl.java:1603)
org.apache.celeborn.client.ShuffleClientImpl.readPartition(ShuffleClientImpl.java:1612)
org.apache.spark.shuffle.celeborn.CelebornShuffleReader.$anonfun$read$1(CelebornShuffleReader.scala:89)
org.apache.spark.shuffle.celeborn.CelebornShuffleReader.$anonfun$read$1$adapted(CelebornShuffleReader.scala:81)
scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
scala.collection.Iterator.foreach(Iterator.scala:943)
scala.collection.Iterator.foreach$(Iterator.scala:943)
scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
org.apache.paimon.spark.commands.WriteIntoPaimonTable.$anonfun$run$6(WriteIntoPaimonTable.scala:141)
org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
org.apache.spark.rdd.RDD.iterator(RDD.scala:329)

How to reproduce the bug?

Steps to reproduce the bug.

/cc @who-need-to-know

/assign @who-can-solve-this-bug

@zhengshubin zhengshubin added the bug Something isn't working label Feb 1, 2024
@FMX
Copy link
Contributor

FMX commented Feb 1, 2024

Hi @zhengshubin, Can you provide more details like your Celeborn cluster's scale and workload info?
This error looks like there is a shuffle lost. Shuffle lost in Celeborn occasionally happens when the Celeborn cluster is on high load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants