[SPARK-20848][SQL] Shutdown the pool after reading parquet files #18073

viirya · 2017-05-23T14:11:05Z

What changes were proposed in this pull request?

From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads.

We should shutdown the pool after reading parquet files.

How was this patch tested?

Added a test to ParquetFileFormatSuite.

Please review http://spark.apache.org/contributing.html before opening a pull request.

viirya · 2017-05-23T14:12:44Z

cc @srowen

srowen · 2017-05-23T14:18:08Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

-    parFiles.flatMap { currentFile =>
+    val readParquetTaskSupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
+    parFiles.tasksupport = readParquetTaskSupport
+    val footers = parFiles.flatMap { currentFile =>


You could probably put the shutdown in a finally block to avoid adding this new reference. I think you can rearrange to use the existing one below. It does help shut it down in case of an error too. Also I think you can hold a reference just to the ForkJoinPool in order to shut it down rather than a ref to ForkJoinTaskSupport. No big deal

SparkQA · 2017-05-23T16:22:52Z

Test build #77254 has finished for PR 18073 at commit e4940b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-24T02:33:53Z

cc @cloud-fan @gatorsmile This is a regression in 2.1. I think we may want to include this in 2.2. Please take a look. Thanks.

cloud-fan · 2017-05-24T02:52:38Z

...c/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormatSuite.scala

+      val numThreadAfter = Thread.activeCount
+      // Hard to test a correct thread number,
+      // but it shouldn't increase more than a reasonable number.
+      assert(numThreadAfter - numThreadBefore < 20)


after waiting for enough time, can we expect this to be 0?

It reduces to a few (about 3) after waiting an enough time. The number returned by Thread.activeCount is only an estimate. So we may not expect this to be 0.

this looks hacky, can we think of a better way to test it? If not, I suggest to remove this test, as the fix is straightforward and we can verify it manually by some profile tools.

OK. Let's remove the test.

SparkQA · 2017-05-24T04:51:00Z

Test build #77279 has finished for PR 18073 at commit 14e09aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-24T08:35:18Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -479,7 +479,8 @@ object ParquetFileFormat extends Logging {
      partFiles: Seq[FileStatus],
      ignoreCorruptFiles: Boolean): Seq[Footer] = {
    val parFiles = partFiles.par
-    parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
+    val pool = new ForkJoinPool(8)


will it be better to share one global thread pool? Creating a lot of thread pools may not increase the concurrency

The main concern is that if we share a thread pool for parquet reading, we may limit the concurrency as @srowen pointed out in the JIRA.

If we have multiple parquet reading in parallel, they will share one pool. Currently they own their pools.

Not sure if using a shared one will change current behavior.

ok let's keep the previous behavior

cloud-fan · 2017-05-24T09:05:36Z

@viirya can you test it manually with JVisualVM or other tools and attach the screen shot in this PR? thanks!

SparkQA · 2017-05-24T10:58:22Z

Test build #77296 has finished for PR 18073 at commit 7e57595.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-24T14:02:38Z

@cloud-fan My dev environment is not convenient to run GUI-based tools like jconsole. I use a command-line tool jvmtop.

Screen shots (the column "#T" is the number of threads):

Before:

After:

## What changes were proposed in this pull request? From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads. We should shutdown the pool after reading parquet files. ## How was this patch tested? Added a test to ParquetFileFormatSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18073 from viirya/SPARK-20848. (cherry picked from commit f72ad30) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2017-05-24T16:37:12Z

thanks, merging to master/2.2/2.1!

gatorsmile · 2017-05-24T18:25:10Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -495,6 +496,8 @@ object ParquetFileFormat extends Logging {
        } else {
          throw new IOException(s"Could not read footer for file: $currentFile", e)
        }
+      } finally {
+        pool.shutdown()


Why we terminate pool inside flatMap?

Why not doing it outside? For example?

val parFiles = partFiles.par val pool = new ForkJoinPool(8) parFiles.tasksupport = new ForkJoinTaskSupport(pool) try { parFiles.flatMap { currentFile => ... }.seq } finally { pool.shutdown() }

I would expect this will fail some test, but it didn't...

When you fix this error, could you call ThreadUtils.newForkJoinPool instead to set a better thread name?

Why not doing it outside? For example?

~~Just realized the toSeq is lazy. But shutting down in flatMap is also not correct.~~ NVM. I was wrong.

@zsxwing @gatorsmile

I was shutdowning it outside at the beginning of this PR. I changed to current way after @srowen's suggestion.

I was thinking it can be wrong initially. But seems it is fine and I think the tasks are all invoked at the beginning and no more tasks are submitted later, so to shutdown inside is ok.

I can go to submit a follow-up if you still think we need to change it. Thank you.

I don't check the details. But I guess the implementation will submit tasks one by one. Then it's possible that when the first task is shutting down the pool, some tasks has not yet been submitted.

Ok. We should take a safer approach. Let me submit a follow-up for this. Thanks @zsxwing.

… files ## What changes were proposed in this pull request? This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name. ## How was this patch tested? Manually test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18100 from viirya/SPARK-20848-followup. (cherry picked from commit 6b68d61) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… files ## What changes were proposed in this pull request? This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name. ## How was this patch tested? Manually test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18100 from viirya/SPARK-20848-followup.

From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads. We should shutdown the pool after reading parquet files. Added a test to ParquetFileFormatSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18073 from viirya/SPARK-20848. (cherry picked from commit f72ad30) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… files This is a follow-up to apache#18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name. Manually test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18100 from viirya/SPARK-20848-followup. (cherry picked from commit 6b68d61) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Shutdown the pool after reading parquet files.

e4940b9

srowen reviewed May 23, 2017

View reviewed changes

Address comment.

14e09aa

cloud-fan reviewed May 24, 2017

View reviewed changes

Remove hacky test.

7e57595

asfgit closed this in f72ad30 May 24, 2017

gatorsmile reviewed May 24, 2017

View reviewed changes

viirya mentioned this pull request May 24, 2017

[SPARK-20848][SQL][Follow-up] Shutdown the pool after reading parquet files #18100

Closed

viirya mentioned this pull request Jun 10, 2017

[SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive tables with many partitions #18216

Closed

viirya deleted the SPARK-20848 branch December 27, 2023 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20848][SQL] Shutdown the pool after reading parquet files #18073

[SPARK-20848][SQL] Shutdown the pool after reading parquet files #18073

viirya commented May 23, 2017

viirya commented May 23, 2017

srowen May 23, 2017

SparkQA commented May 23, 2017

viirya commented May 24, 2017

cloud-fan May 24, 2017

viirya May 24, 2017 •

edited

Loading

cloud-fan May 24, 2017

viirya May 24, 2017

SparkQA commented May 24, 2017

cloud-fan May 24, 2017

viirya May 24, 2017 •

edited

Loading

viirya May 24, 2017

cloud-fan May 24, 2017

cloud-fan commented May 24, 2017

SparkQA commented May 24, 2017

viirya commented May 24, 2017

cloud-fan commented May 24, 2017

gatorsmile May 24, 2017

gatorsmile May 24, 2017

zsxwing May 24, 2017

zsxwing May 24, 2017 •

edited

Loading

viirya May 24, 2017

zsxwing May 24, 2017 •

edited

Loading

viirya May 24, 2017

[SPARK-20848][SQL] Shutdown the pool after reading parquet files #18073

[SPARK-20848][SQL] Shutdown the pool after reading parquet files #18073

Conversation

viirya commented May 23, 2017

What changes were proposed in this pull request?

How was this patch tested?

viirya commented May 23, 2017

Choose a reason for hiding this comment

SparkQA commented May 23, 2017

viirya commented May 24, 2017

Choose a reason for hiding this comment

viirya May 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 24, 2017

Choose a reason for hiding this comment

viirya May 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented May 24, 2017

SparkQA commented May 24, 2017

viirya commented May 24, 2017

cloud-fan commented May 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing May 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing May 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya May 24, 2017 •

edited

Loading

viirya May 24, 2017 •

edited

Loading

zsxwing May 24, 2017 •

edited

Loading

zsxwing May 24, 2017 •

edited

Loading