[SPARK-32715][CORE] Fix memory leak when failed to store pieces of broadcast #29558

LantaoJin · 2020-08-27T12:31:26Z

What changes were proposed in this pull request?

In TorrentBroadcast.scala

L133: if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false))
L137: TorrentBroadcast.blockifyObject(value, blockSize, SparkEnv.get.serializer, compressionCodec)
L147: if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true))

After the original value is saved successfully(TorrentBroadcast.scala: L133), but the following blockifyObject()(L137) or store piece(L147) steps are failed. There is no opportunity to release broadcast from memory.

This patch is to remove all pieces of the broadcast when failed to blockify or failed to store some pieces of a broadcast.

Why are the changes needed?

We use Spark thrift-server as a long-running service. A bad query submitted a heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the bad query but we found the driver's memory usage was still high and full GCs were still frequent. By investigating with GC dump and log, we found the broadcast may memory leak.

2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure)
2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
116G->112G(170G), 184.9121920 secs]
[Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]
1: 676531691 72035438432 [B
2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
3: 99551 12018117568 [Ljava.lang.Object;
4: 26570 4349629040 [I
5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
6: 1708819 256299456 [C
7: 2338 179615208 [J
8: 1703669 54517408 java.lang.String
9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
10: 177396 25545024 java.net.URI
...

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually test. This UT is hard to write and the patch is straightforward.

…oadcast

Ngone51 · 2020-08-27T13:24:06Z

Why the application doesn't fail when the exception raised in TorrentBroadcast.writeBlocks()? Is it because of the thrift-server?

SparkQA · 2020-08-27T15:04:58Z

Test build #127954 has finished for PR 29558 at commit f9314ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-08-28T01:53:05Z

Why the application doesn't fail when the exception raised in TorrentBroadcast.writeBlocks()? Is it because of the thrift-server?

Yes. I think in many cases, this exception won't fail the whole application. In our case, it happened in executing a query which included a BroadcastExchangeExec, it broadcasts a relation. It failed the query execution.

LantaoJin · 2020-09-01T02:49:46Z

Ping @cloud-fan @dongjoon-hyun

dongjoon-hyun · 2020-09-01T02:52:44Z

Thank you for pinging me, @LantaoJin . According to SPARK-32715 JIRA description, is this 3.1.0-only bug?

dongjoon-hyun · 2020-09-01T02:58:31Z

core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala

+    } catch {
+      case t: Throwable =>
+        logError(s"Store broadcast $broadcastId fail, remove all pieces of the broadcast")
+        blockManager.removeBroadcast(id, tellMaster = true)


Is this correct? What happens if this is thrown at line 135, @LantaoJin ?

We can begin to Try..Catch from L137 if the L134 is atomic. But I think L157: blockManager.removeBroadcast(id, tellMaster = true) is no harmful even exception throws from L135. Please correct me if I understand incorrectly.

Would it has the thread-safe issue since blockManager is an IsolatedRpcEndpoint?

I agree with @dongjoon-hyun - blockManager.putSingle for broadcastid should be outside try/catch. Rest look fine.

LantaoJin · 2020-09-01T03:09:46Z

Thank you for pinging me, @LantaoJin . According to SPARK-32715 JIRA description, is this 3.1.0-only bug?

No. I just use the latest version. I will update that jira page.

Ngone51

Shall we add a unit test?

Ngone51 · 2020-09-03T11:45:43Z

core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala

+    } catch {
+      case t: Throwable =>
+        logError(s"Store broadcast $broadcastId fail, remove all pieces of the broadcast")
+        blockManager.removeBroadcast(id, tellMaster = true)


Would it has the thread-safe issue since blockManager is an IsolatedRpcEndpoint?

Ngone51 · 2020-09-03T11:46:47Z

cc @tgravescs @jiangxb1987

LantaoJin · 2020-09-14T12:28:32Z

Would it has the thread-safe issue since blockManager is an IsolatedRpcEndpoint?

removeBroadcast() should be thread-safe

SparkQA · 2020-09-14T14:52:13Z

Test build #128649 has finished for PR 29558 at commit b565c33.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-09-15T00:33:22Z

retest this please

dongjoon-hyun · 2020-09-15T01:18:40Z

Thank you for update, @LantaoJin .

dongjoon-hyun

+1, LGTM. Thank you, @LantaoJin and @mridulm and @Ngone51 .
Merged to master/3.0/2.4.

…oadcast ### What changes were proposed in this pull request? In TorrentBroadcast.scala ```scala L133: if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)) L137: TorrentBroadcast.blockifyObject(value, blockSize, SparkEnv.get.serializer, compressionCodec) L147: if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true)) ``` After the original value is saved successfully(TorrentBroadcast.scala: L133), but the following `blockifyObject()`(L137) or store piece(L147) steps are failed. There is no opportunity to release broadcast from memory. This patch is to remove all pieces of the broadcast when failed to blockify or failed to store some pieces of a broadcast. ### Why are the changes needed? We use Spark thrift-server as a long-running service. A bad query submitted a heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the bad query but we found the driver's memory usage was still high and full GCs were still frequent. By investigating with GC dump and log, we found the broadcast may memory leak. > 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc): 116G->112G(170G), 184.9121920 secs] [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)] 1: 676531691 72035438432 [B 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow 3: 99551 12018117568 [Ljava.lang.Object; 4: 26570 4349629040 [I 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow; 6: 1708819 256299456 [C 7: 2338 179615208 [J 8: 1703669 54517408 java.lang.String 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper 10: 177396 25545024 java.net.URI ... ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test. This UT is hard to write and the patch is straightforward. Closes #29558 from LantaoJin/SPARK-32715. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7a9b066) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

SparkQA · 2020-09-15T04:09:36Z

Test build #128684 has finished for PR 29558 at commit b565c33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…oadcast ### What changes were proposed in this pull request? In TorrentBroadcast.scala ```scala L133: if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)) L137: TorrentBroadcast.blockifyObject(value, blockSize, SparkEnv.get.serializer, compressionCodec) L147: if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true)) ``` After the original value is saved successfully(TorrentBroadcast.scala: L133), but the following `blockifyObject()`(L137) or store piece(L147) steps are failed. There is no opportunity to release broadcast from memory. This patch is to remove all pieces of the broadcast when failed to blockify or failed to store some pieces of a broadcast. ### Why are the changes needed? We use Spark thrift-server as a long-running service. A bad query submitted a heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the bad query but we found the driver's memory usage was still high and full GCs were still frequent. By investigating with GC dump and log, we found the broadcast may memory leak. > 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc): 116G->112G(170G), 184.9121920 secs] [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)] 1: 676531691 72035438432 [B 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow 3: 99551 12018117568 [Ljava.lang.Object; 4: 26570 4349629040 [I 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow; 6: 1708819 256299456 [C 7: 2338 179615208 [J 8: 1703669 54517408 java.lang.String 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper 10: 177396 25545024 java.net.URI ... ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test. This UT is hard to write and the patch is straightforward. Closes apache#29558 from LantaoJin/SPARK-32715. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7a9b066) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

[SPARK-32715][CORE] Fix memory leak when failed to store pieces of br…

6647592

…oadcast

probot-autolabeler bot added the CORE label Aug 27, 2020

refactor

f9314ae

dongjoon-hyun reviewed Sep 1, 2020

View reviewed changes

Ngone51 reviewed Sep 3, 2020

View reviewed changes

address comment

b565c33

mridulm approved these changes Sep 14, 2020

View reviewed changes

dongjoon-hyun approved these changes Sep 15, 2020

View reviewed changes

dongjoon-hyun closed this in 7a9b066 Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32715][CORE] Fix memory leak when failed to store pieces of broadcast #29558

[SPARK-32715][CORE] Fix memory leak when failed to store pieces of broadcast #29558

LantaoJin commented Aug 27, 2020 •

edited

Loading

Ngone51 commented Aug 27, 2020

SparkQA commented Aug 27, 2020

LantaoJin commented Aug 28, 2020 •

edited

Loading

LantaoJin commented Sep 1, 2020

dongjoon-hyun commented Sep 1, 2020

dongjoon-hyun Sep 1, 2020

LantaoJin Sep 1, 2020

Ngone51 Sep 3, 2020

mridulm Sep 10, 2020

LantaoJin commented Sep 1, 2020

Ngone51 left a comment

Ngone51 Sep 3, 2020

Ngone51 commented Sep 3, 2020

LantaoJin commented Sep 14, 2020 •

edited

Loading

SparkQA commented Sep 14, 2020

LantaoJin commented Sep 15, 2020

dongjoon-hyun commented Sep 15, 2020

dongjoon-hyun left a comment •

edited

Loading

SparkQA commented Sep 15, 2020

[SPARK-32715][CORE] Fix memory leak when failed to store pieces of broadcast #29558

[SPARK-32715][CORE] Fix memory leak when failed to store pieces of broadcast #29558

Conversation

LantaoJin commented Aug 27, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Aug 27, 2020

SparkQA commented Aug 27, 2020

LantaoJin commented Aug 28, 2020 • edited Loading

LantaoJin commented Sep 1, 2020

dongjoon-hyun commented Sep 1, 2020

dongjoon-hyun Sep 1, 2020

Choose a reason for hiding this comment

LantaoJin Sep 1, 2020

Choose a reason for hiding this comment

Ngone51 Sep 3, 2020

Choose a reason for hiding this comment

mridulm Sep 10, 2020

Choose a reason for hiding this comment

LantaoJin commented Sep 1, 2020

Ngone51 left a comment

Choose a reason for hiding this comment

Ngone51 Sep 3, 2020

Choose a reason for hiding this comment

Ngone51 commented Sep 3, 2020

LantaoJin commented Sep 14, 2020 • edited Loading

SparkQA commented Sep 14, 2020

LantaoJin commented Sep 15, 2020

dongjoon-hyun commented Sep 15, 2020

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

SparkQA commented Sep 15, 2020

LantaoJin commented Aug 27, 2020 •

edited

Loading

LantaoJin commented Aug 28, 2020 •

edited

Loading

LantaoJin commented Sep 14, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading