[SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak #43691

xieshuaihu · 2023-11-07T02:08:03Z

What changes were proposed in this pull request?

Make ArrowBatchIterator implement AutoCloseable and ArrowConverters.createEmptyArrowBatch() call close() to avoid memory leak.

Why are the changes needed?

ArrowConverters.createEmptyArrowBatch don't call super.hasNext, if TaskContext.get returns None, then memory allocated in ArrowBatchIterator is leaked.

In spark connect, createEmptyArrowBatch is called in SparkConnectPlanner and SparkConnectPlanExecution, which cause a long running driver consume all off-heap memory specified by -XX:MaxDirectMemorySize.

This is the exception stack:

org.apache.arrow.memory.OutOfMemoryException: Failure allocating buffer.
	at io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:67)
	at org.apache.arrow.memory.NettyAllocationManager.<init>(NettyAllocationManager.java:77)
	at org.apache.arrow.memory.NettyAllocationManager.<init>(NettyAllocationManager.java:84)
	at org.apache.arrow.memory.NettyAllocationManager$1.create(NettyAllocationManager.java:34)
	at org.apache.arrow.memory.BaseAllocator.newAllocationManager(BaseAllocator.java:354)
	at org.apache.arrow.memory.BaseAllocator.newAllocationManager(BaseAllocator.java:349)
	at org.apache.arrow.memory.BaseAllocator.bufferWithoutReservation(BaseAllocator.java:337)
	at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:315)
	at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:279)
	at org.apache.arrow.vector.BaseValueVector.allocFixedDataAndValidityBufs(BaseValueVector.java:192)
	at org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:338)
	at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:308)
	at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:273)
	at org.apache.spark.sql.execution.arrow.ArrowWriter$.$anonfun$create$1(ArrowWriter.scala:44)
	at scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
	at scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
	at scala.collection.convert.JavaCollectionWrappers$JListWrapper.map(JavaCollectionWrappers.scala:103)
	at org.apache.spark.sql.execution.arrow.ArrowWriter$.create(ArrowWriter.scala:43)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.<init>(ArrowConverters.scala:93)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchWithSchemaIterator.<init>(ArrowConverters.scala:138)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.<init>(ArrowConverters.scala:231)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$.createEmptyArrowBatch(ArrowConverters.scala:229)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.handleSqlCommand(SparkConnectPlanner.scala:2481)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.process(SparkConnectPlanner.scala:2426)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handleCommand(ExecuteThreadRunner.scala:202)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:158)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:132)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:189)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:189)
	at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withContextClassLoader$1(SessionHolder.scala:176)
	at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:178)
	at org.apache.spark.sql.connect.service.SessionHolder.withContextClassLoader(SessionHolder.scala:175)
	at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:188)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:132)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:84)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:228)
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 1069547799, max: 1073741824)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:845)
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:774)
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:721)
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:696)
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
	at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:137)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
	at io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.newDirectBufferL(PooledByteBufAllocatorL.java:181)
	at io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.directBuffer(PooledByteBufAllocatorL.java:214)
	at io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:58)
	... 37 more

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually test

Was this patch authored or co-authored using generative AI tooling?

No

xieshuaihu · 2023-11-07T03:34:58Z

cc @HyukjinKwon @dongjoon-hyun

dongjoon-hyun · 2023-11-07T04:07:38Z

cc @sunchao , too

LuciferYang · 2023-11-07T13:40:27Z

hmm... a little strange to call super.hasNext in next. How about make ArrowBatchIterator with AutoCloseable and closing ArrowBatchWithSchemaIterator in the finally block when TaskContext.get is null?

xieshuaihu · 2023-11-08T01:52:49Z

@LuciferYang This pr is updated, please review again

cfmcgrady · 2023-11-08T02:39:55Z

the PR title/description should be updated

LuciferYang · 2023-11-08T03:02:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

        Iterator.empty, schema, 0L, 0L,
        timeZoneId, errorOnDuplicatedFieldNames, TaskContext.get()) {
      override def hasNext: Boolean = true
-    }.next()
+    }
+    val emptyBatch = batches.next()


Utils.tryWithSafeFinally { batches.next() } { if (TaskContext.get() == null) { // If taskContext is not null, it can be successfully closed, right? batches.close() } }

LuciferYang

+1, LGTM if test pass

hvanhovell

LGTM - good catch!

xieshuaihu · 2023-11-09T01:03:06Z

It seems like that this test failure is unrelated to this pr

LuciferYang · 2023-11-09T05:44:49Z

Could re-trigger the failed one? make all tasks green is safer @xieshuaihu

xieshuaihu · 2023-11-09T07:45:05Z

@LuciferYang all ut passed

…h call close() to avoid memory leak ### What changes were proposed in this pull request? Make `ArrowBatchIterator` implement `AutoCloseable` and `ArrowConverters.createEmptyArrowBatch()` call close() to avoid memory leak. ### Why are the changes needed? `ArrowConverters.createEmptyArrowBatch` don't call `super.hasNext`, if `TaskContext.get` returns `None`, then memory allocated in `ArrowBatchIterator` is leaked. In spark connect, `createEmptyArrowBatch` is called in [SparkConnectPlanner](https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L2558) and [SparkConnectPlanExecution](https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala#L224), which cause a long running driver consume all off-heap memory specified by `-XX:MaxDirectMemorySize`. This is the exception stack: ``` org.apache.arrow.memory.OutOfMemoryException: Failure allocating buffer. at io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:67) at org.apache.arrow.memory.NettyAllocationManager.<init>(NettyAllocationManager.java:77) at org.apache.arrow.memory.NettyAllocationManager.<init>(NettyAllocationManager.java:84) at org.apache.arrow.memory.NettyAllocationManager$1.create(NettyAllocationManager.java:34) at org.apache.arrow.memory.BaseAllocator.newAllocationManager(BaseAllocator.java:354) at org.apache.arrow.memory.BaseAllocator.newAllocationManager(BaseAllocator.java:349) at org.apache.arrow.memory.BaseAllocator.bufferWithoutReservation(BaseAllocator.java:337) at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:315) at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:279) at org.apache.arrow.vector.BaseValueVector.allocFixedDataAndValidityBufs(BaseValueVector.java:192) at org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:338) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:308) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:273) at org.apache.spark.sql.execution.arrow.ArrowWriter$.$anonfun$create$1(ArrowWriter.scala:44) at scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100) at scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87) at scala.collection.convert.JavaCollectionWrappers$JListWrapper.map(JavaCollectionWrappers.scala:103) at org.apache.spark.sql.execution.arrow.ArrowWriter$.create(ArrowWriter.scala:43) at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.<init>(ArrowConverters.scala:93) at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchWithSchemaIterator.<init>(ArrowConverters.scala:138) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.<init>(ArrowConverters.scala:231) at org.apache.spark.sql.execution.arrow.ArrowConverters$.createEmptyArrowBatch(ArrowConverters.scala:229) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.handleSqlCommand(SparkConnectPlanner.scala:2481) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.process(SparkConnectPlanner.scala:2426) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handleCommand(ExecuteThreadRunner.scala:202) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:158) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:132) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:189) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:189) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withContextClassLoader$1(SessionHolder.scala:176) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:178) at org.apache.spark.sql.connect.service.SessionHolder.withContextClassLoader(SessionHolder.scala:175) at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:188) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:132) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:84) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:228) Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 1069547799, max: 1073741824) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:845) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:774) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:721) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:696) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215) at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180) at io.netty.buffer.PoolArena.allocate(PoolArena.java:137) at io.netty.buffer.PoolArena.allocate(PoolArena.java:129) at io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.newDirectBufferL(PooledByteBufAllocatorL.java:181) at io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.directBuffer(PooledByteBufAllocatorL.java:214) at io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:58) ... 37 more ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43691 from xieshuaihu/spark-45814. Authored-by: xieshuaihu <xieshuaihu@agora.io> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit c128f81) Signed-off-by: yangjie01 <yangjie01@baidu.com>

LuciferYang · 2023-11-09T07:59:18Z

Merged into master and branch-3.5. Thanks @xieshuaihu @hvanhovell @dongjoon-hyun @cfmcgrady

This patch conflicts with branch-3.4. Could you submit a separate pr for branch-3.4? @xieshuaihu

xieshuaihu · 2023-11-09T08:01:56Z

I will submit a new pr @LuciferYang

LuciferYang · 2023-11-09T08:03:25Z

Thanks @xieshuaihu ~

HyukjinKwon

LGTM

HyukjinKwon · 2023-12-20T10:03:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+    } {
+      // If taskContext is null, `batches.close()` should be called to avoid memory leak.
+      if (TaskContext.get() == null) {
+        batches.close()


Can we actually add a small test for this? This memory leaking should be easy to test if the close is actually being invoked with createEmptyArrowBatch.

call hasNext to avoid memory leak

e16e996

github-actions bot added the SQL label Nov 7, 2023

xieshuaihu changed the title ~~[SPARK-45814][CONNECT][CORE]Make ArrowConverters.createEmptyArrowBatch call hasNext to avoid memory leak~~ [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call hasNext to avoid memory leak Nov 7, 2023

xieshuaihu added 2 commits November 8, 2023 09:41

better close()

8652a40

better close()

860c8a1

LuciferYang reviewed Nov 8, 2023

View reviewed changes

xieshuaihu changed the title ~~[SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call hasNext to avoid memory leak~~ [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak Nov 8, 2023

follow review

2574223

LuciferYang approved these changes Nov 8, 2023

View reviewed changes

hvanhovell approved these changes Nov 8, 2023

View reviewed changes

LuciferYang closed this in c128f81 Nov 9, 2023

HyukjinKwon reviewed Nov 10, 2023

View reviewed changes

HyukjinKwon reviewed Dec 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak #43691

[SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak #43691

xieshuaihu commented Nov 7, 2023 •

edited

xieshuaihu commented Nov 7, 2023

dongjoon-hyun commented Nov 7, 2023

LuciferYang commented Nov 7, 2023

xieshuaihu commented Nov 8, 2023

cfmcgrady commented Nov 8, 2023

LuciferYang Nov 8, 2023 •

edited

xieshuaihu Nov 8, 2023

LuciferYang left a comment

hvanhovell left a comment

xieshuaihu commented Nov 9, 2023

LuciferYang commented Nov 9, 2023

xieshuaihu commented Nov 9, 2023

LuciferYang commented Nov 9, 2023

xieshuaihu commented Nov 9, 2023

LuciferYang commented Nov 9, 2023

HyukjinKwon left a comment

HyukjinKwon Dec 20, 2023

[SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak #43691

[SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak #43691

Conversation

xieshuaihu commented Nov 7, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xieshuaihu commented Nov 7, 2023

dongjoon-hyun commented Nov 7, 2023

LuciferYang commented Nov 7, 2023

xieshuaihu commented Nov 8, 2023

cfmcgrady commented Nov 8, 2023

LuciferYang Nov 8, 2023 • edited

Choose a reason for hiding this comment

xieshuaihu Nov 8, 2023

Choose a reason for hiding this comment

LuciferYang left a comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

xieshuaihu commented Nov 9, 2023

LuciferYang commented Nov 9, 2023

xieshuaihu commented Nov 9, 2023

LuciferYang commented Nov 9, 2023

xieshuaihu commented Nov 9, 2023

LuciferYang commented Nov 9, 2023

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon Dec 20, 2023

Choose a reason for hiding this comment

xieshuaihu commented Nov 7, 2023 •

edited

LuciferYang Nov 8, 2023 •

edited