[SPARK-46480][CORE][SQL] Fix NPE when table cache task attempt #44445

ulysses-you · 2023-12-21T12:18:03Z

What changes were proposed in this pull request?

This pr adds a check: we only mark the cached partition is materialized if the task is not failed and not interrupted. And adds a new method isFailed in TaskContext.

Why are the changes needed?

Before this pr, when do cache, task failure can cause NPE in other tasks

java.lang.NullPointerException
	at java.nio.ByteBuffer.wrap(ByteBuffer.java:396)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.accessors1$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:155)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)

Does this PR introduce any user-facing change?

yes, it's a bug fix

How was this patch tested?

add test

Was this patch authored or co-authored using generative AI tooling?

no

ulysses-you · 2023-12-21T12:20:51Z

cc @cloud-fan @yaooqinn

core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala

yaooqinn

LGTM

cloud-fan · 2023-12-21T12:34:49Z

which variable can be null?

ulysses-you · 2023-12-21T13:45:05Z

@cloud-fan according to the stack,the DefaultCachedBatch.buffers[i] is null.

mridulm

Looks reasonable to me.

QQ: Do we want to do this only is entire iterator has been consumed ? (and so fully materialized ?)

HyukjinKwon · 2023-12-22T00:03:54Z

core/src/main/scala/org/apache/spark/TaskContext.scala

+  /**
+   * Returns true if the task has failed.
+   */
+  def isFailed(): Boolean


Would be great if we can add this into PySpark side as well ...

It seems python taskcontext does not have isCompleted and isInterrupted method. Is there any reason ?

We should have them all :-).

I don't mind doing separately in a differnet PR

I will create a new pr for it later

@HyukjinKwon does pyspark support update status from jvm side ? It seems we can only send a snapshot status to python side ?

cloud-fan · 2023-12-22T01:59:57Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

@@ -275,6 +275,8 @@ private[spark] class TaskContextImpl(
  @GuardedBy("this")
  override def isCompleted(): Boolean = synchronized(completed)

+  override def isFailed(): Boolean = synchronized(failureCauseOpt.isDefined)
+
  override def isInterrupted(): Boolean = reasonIfKilled.isDefined


I'm trying to reason about the relationship between these 3 flags:

isCompleted is false only when the task is interrupted? Or when the task is still running?

isFailed is true only if the task completes and fails?

isInterrupted can be true before the task completes?

isCompleted includes all the state success, failure and cancellation. If isCompleted is true, then the task won't be in running any more.

ulysses-you · 2023-12-22T02:19:44Z

@mridulm yes, to make sure the cached rdd is materialized after all the tasks are succeeded

yaooqinn · 2023-12-22T05:26:58Z

Thanks, merged to master according to the affect versions in JIRA

This pr adds a check: we only mark the cached partition is materialized if the task is not failed and not interrupted. And adds a new method `isFailed` in `TaskContext`. Before this pr, when do cache, task failure can cause NPE in other tasks ``` java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:396) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.accessors1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:155) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) ``` yes, it's a bug fix add test no Closes apache#44445 from ulysses-you/fix-cache. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

ulysses-you · 2023-12-22T05:44:12Z

@yaooqinn There are some conflicts. I created #44457 for branch-3.5

This pr backports #44445 for branch-3.5 ### What changes were proposed in this pull request? This pr adds a check: we only mark the cached partition is materialized if the task is not failed and not interrupted. And adds a new method `isFailed` in `TaskContext`. ### Why are the changes needed? Before this pr, when do cache, task failure can cause NPE in other tasks ``` java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:396) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.accessors1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:155) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) ``` ### Does this PR introduce _any_ user-facing change? yes, it's a bug fix ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #44457 from ulysses-you/fix-cache-3.5. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>

github-actions bot added SQL CORE labels Dec 21, 2023

yaooqinn reviewed Dec 21, 2023

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala Outdated Show resolved Hide resolved

yaooqinn approved these changes Dec 21, 2023

View reviewed changes

mridulm reviewed Dec 21, 2023

View reviewed changes

HyukjinKwon reviewed Dec 22, 2023

View reviewed changes

cloud-fan reviewed Dec 22, 2023

View reviewed changes

Fix NPE when table cache task attempt

5273fb8

ulysses-you force-pushed the fix-cache branch from f853fbd to 5273fb8 Compare December 22, 2023 02:15

github-actions bot added the BUILD label Dec 22, 2023

cloud-fan approved these changes Dec 22, 2023

View reviewed changes

yaooqinn closed this in 43f7932 Dec 22, 2023

ulysses-you mentioned this pull request Dec 22, 2023

[SPARK-46480][CORE][SQL][3.5] Fix NPE when table cache task attempt #44457

Closed

ulysses-you deleted the fix-cache branch December 22, 2023 05:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46480][CORE][SQL] Fix NPE when table cache task attempt #44445

[SPARK-46480][CORE][SQL] Fix NPE when table cache task attempt #44445

ulysses-you commented Dec 21, 2023

ulysses-you commented Dec 21, 2023

yaooqinn left a comment

cloud-fan commented Dec 21, 2023

ulysses-you commented Dec 21, 2023

mridulm left a comment

HyukjinKwon Dec 22, 2023

ulysses-you Dec 22, 2023

HyukjinKwon Dec 22, 2023

HyukjinKwon Dec 22, 2023

ulysses-you Dec 22, 2023

ulysses-you Dec 22, 2023

cloud-fan Dec 22, 2023

ulysses-you Dec 22, 2023

ulysses-you commented Dec 22, 2023

yaooqinn commented Dec 22, 2023

ulysses-you commented Dec 22, 2023

[SPARK-46480][CORE][SQL] Fix NPE when table cache task attempt #44445

[SPARK-46480][CORE][SQL] Fix NPE when table cache task attempt #44445

Conversation

ulysses-you commented Dec 21, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

ulysses-you commented Dec 21, 2023

yaooqinn left a comment

Choose a reason for hiding this comment

cloud-fan commented Dec 21, 2023

ulysses-you commented Dec 21, 2023

mridulm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulysses-you commented Dec 22, 2023

yaooqinn commented Dec 22, 2023

ulysses-you commented Dec 22, 2023