[SPARK-27021][CORE] Cleanup of Netty event loop group for shuffle chunk fetch requests #23930

attilapiros · 2019-03-01T18:53:05Z

What changes were proposed in this pull request?

Creating an Netty EventLoopGroup leads to creating a new Thread pool for handling the events. For stopping the threads of the pool the event loop group should be shut down which is properly done for transport servers and clients by calling for example the shutdownGracefully() method (for details see the close() method of TransportClientFactory and TransportServer). But there is a separate event loop group for shuffle chunk fetch requests which is in pipeline for handling fetch request (shared between the client and server) and owned by the TransportContext and this was never shut down.

How was this patch tested?

With existing unittest.

This leak is in the production system too but its effect is spiking in the unittest.

Checking the core unittest logs before the PR:

$ grep "LEAK IN SUITE" unit-tests.log | grep -o shuffle-chunk-fetch-handler | wc -l
381

And after the PR without whitelisting in thread audit and with an extra await after the
chunkFetchWorkers.shutdownGracefully():

$ grep "LEAK IN SUITE" unit-tests.log | grep -o shuffle-chunk-fetch-handler | wc -l
0

SparkQA · 2019-03-01T23:40:51Z

Test build #102920 has finished for PR 23930 at commit 3f3567d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class TransportContext implements Closeable

common/network-shuffle/src/test/java/org/apache/spark/network/sasl/SaslIntegrationSuite.java

core/src/test/scala/org/apache/spark/ThreadAudit.scala

vanzin · 2019-03-05T18:35:04Z

looks good pending tests.

SparkQA · 2019-03-05T19:30:29Z

Test build #103053 has finished for PR 23930 at commit 3b469d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-03-05T20:30:50Z

Merging to master.

…nk fetch requests Creating an Netty `EventLoopGroup` leads to creating a new Thread pool for handling the events. For stopping the threads of the pool the event loop group should be shut down which is properly done for transport servers and clients by calling for example the `shutdownGracefully()` method (for details see the `close()` method of `TransportClientFactory` and `TransportServer`). But there is a separate event loop group for shuffle chunk fetch requests which is in pipeline for handling fetch request (shared between the client and server) and owned by the `TransportContext` and this was never shut down. With existing unittest. This leak is in the production system too but its effect is spiking in the unittest. Checking the core unittest logs before the PR: ``` $ grep "LEAK IN SUITE" unit-tests.log | grep -o shuffle-chunk-fetch-handler | wc -l 381 ``` And after the PR without whitelisting in thread audit and with an extra `await` after the ` chunkFetchWorkers.shutdownGracefully()`: ``` $ grep "LEAK IN SUITE" unit-tests.log | grep -o shuffle-chunk-fetch-handler | wc -l 0 ``` Closes apache#23930 from attilapiros/SPARK-27021. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 5668c42)

…nk fetch requests Creating an Netty `EventLoopGroup` leads to creating a new Thread pool for handling the events. For stopping the threads of the pool the event loop group should be shut down which is properly done for transport servers and clients by calling for example the `shutdownGracefully()` method (for details see the `close()` method of `TransportClientFactory` and `TransportServer`). But there is a separate event loop group for shuffle chunk fetch requests which is in pipeline for handling fetch request (shared between the client and server) and owned by the `TransportContext` and this was never shut down. With existing unittest. This leak is in the production system too but its effect is spiking in the unittest. Checking the core unittest logs before the PR: ``` $ grep "LEAK IN SUITE" unit-tests.log | grep -o shuffle-chunk-fetch-handler | wc -l 381 ``` And after the PR without whitelisting in thread audit and with an extra `await` after the ` chunkFetchWorkers.shutdownGracefully()`: ``` $ grep "LEAK IN SUITE" unit-tests.log | grep -o shuffle-chunk-fetch-handler | wc -l 0 ``` Closes apache#23930 from attilapiros/SPARK-27021. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 5668c42) (cherry picked from commit 0951a21bfd03d499891cbf49295645240840689a) Change-Id: Ifadc5bd9ea6842e0f21af89a26320978f404c50c (cherry picked from commit 196bddfda48055a4d3db64250ef813974a46d346)

initial version

3f3567d

vanzin reviewed Mar 4, 2019

View reviewed changes

common/network-shuffle/src/test/java/org/apache/spark/network/sasl/SaslIntegrationSuite.java Show resolved Hide resolved

core/src/test/scala/org/apache/spark/ThreadAudit.scala Outdated Show resolved Hide resolved

applying review comments

3b469d7

vanzin closed this in 5668c42 Mar 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27021][CORE] Cleanup of Netty event loop group for shuffle chunk fetch requests #23930

[SPARK-27021][CORE] Cleanup of Netty event loop group for shuffle chunk fetch requests #23930

attilapiros commented Mar 1, 2019

SparkQA commented Mar 1, 2019

vanzin commented Mar 5, 2019

SparkQA commented Mar 5, 2019

vanzin commented Mar 5, 2019

[SPARK-27021][CORE] Cleanup of Netty event loop group for shuffle chunk fetch requests #23930

[SPARK-27021][CORE] Cleanup of Netty event loop group for shuffle chunk fetch requests #23930

Conversation

attilapiros commented Mar 1, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 1, 2019

vanzin commented Mar 5, 2019

SparkQA commented Mar 5, 2019

vanzin commented Mar 5, 2019