[SPARK-23457][SQL] Register task completion listeners first in ParquetFileFormat #20619

dongjoon-hyun · 2018-02-15T16:59:12Z

What changes were proposed in this pull request?

ParquetFileFormat leaks opened files in some cases. This PR prevents that by registering task completion listers first before initialization.

Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
	at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
	at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:538)
	at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
	at

How was this patch tested?

Manual. The following test case generates the same leakage.

  test("SPARK-23457 Register task completion listeners first in ParquetFileFormat") {
    withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> s"${Int.MaxValue}") {
      withTempDir { dir =>
        val basePath = dir.getCanonicalPath
        Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, "first").toString)
        Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, "second").toString)
        val df = spark.read.parquet(
          new Path(basePath, "first").toString,
          new Path(basePath, "second").toString)
        val e = intercept[SparkException] {
          df.collect()
        }
        assert(e.getCause.isInstanceOf[OutOfMemoryError])
      }
    }
  }

…etFileFormat

dongjoon-hyun · 2018-02-15T17:01:43Z

Hi, @cloud-fan and @gatorsmile .
This is the same kind of PR about opened file leakage for ParquetFileFormat. Could you review this?

kiszk · 2018-02-15T17:39:07Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala


      // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
-      if (parquetReader.isInstanceOf[VectorizedParquetRecordReader] &&
-          enableVectorizedReader) {
+      if (enableVectorizedReader) {


Would it be possible to merge this if-statement into the above if-statement?

Yep. It looks possible. I'll update together after getting more reviews. Thanks, @kiszk .

yea it seems more reasonable to merge this if-else now.

gatorsmile · 2018-02-15T17:45:04Z

cc @ala @michal-databricks @cloud-fan

dongjoon-hyun · 2018-02-15T17:56:09Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

      }

-      val iter = new RecordReaderIterator(parquetReader)
-      taskContext.foreach(_.addTaskCompletionListener(_ => iter.close()))


According to the reported leakage, this is too late.

SparkQA · 2018-02-15T20:11:03Z

Test build #87482 has finished for PR 20619 at commit 43f809f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-16T02:40:17Z

It looks good to me that we move the registrations to the new (earlier) places.

felixcheung · 2018-02-16T09:39:44Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

        val vectorizedReader = new VectorizedParquetRecordReader(
          convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity)
+        val recordReaderIterator = new RecordReaderIterator(vectorizedReader)
+        // Register a task completion lister before `initalization`.


could new VectorizedParquetRecordReader or new RecordReaderIterator fail?

Those constructors didn't look heavy to me.

cloud-fan · 2018-02-16T14:53:25Z

can we provide a manual test like the OOM one in your ORC PR?

dongjoon-hyun · 2018-02-16T18:48:23Z

Yep. I'll try for this, too. @cloud-fan .

dongjoon-hyun · 2018-02-17T03:20:07Z

The reproducible test case is added into PR description and the code is updated according to @kiszk and @cloud-fan 's comments.

cloud-fan · 2018-02-17T03:23:24Z

LGTM

dongjoon-hyun · 2018-02-17T03:25:00Z

Thank you for last-minute review before your vacation. I'm lucky. :)

gatorsmile · 2018-02-17T04:34:06Z

He is already on vacation. : )

SparkQA · 2018-02-17T04:59:39Z

Test build #87516 has finished for PR 20619 at commit e08d06c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-02-17T06:11:38Z

Oh.. It was a review from a vacation.

dongjoon-hyun · 2018-02-17T06:11:45Z

Retest this please.

dongjoon-hyun · 2018-02-17T06:12:55Z

The failure is irrelevant to this PR.

org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a sbt.testing.NestedSuiteSelector)

SparkQA · 2018-02-17T07:52:06Z

Test build #87518 has finished for PR 20619 at commit e08d06c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-17T07:58:45Z

retest this please

dongjoon-hyun · 2018-02-17T08:07:45Z

Thank you for retriggering, @gatorsmile .

SparkQA · 2018-02-17T10:24:18Z

Test build #87520 has finished for PR 20619 at commit e08d06c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-17T10:27:34Z

Would it be worth to add this JIRA number in a comment as we did for ORC?

kiszk · 2018-02-17T10:28:02Z

retest this please

SparkQA · 2018-02-17T12:50:47Z

Test build #87523 has finished for PR 20619 at commit e08d06c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-17T14:48:19Z

Umm, we still see the following exception in the log ...

Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
	at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
	at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
	at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:254)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:140)
	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2.apply(OrcFileFormat.scala:197)
	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2.apply(OrcFileFormat.scala:161)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1834)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2063)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2063)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	... 3 more

dongjoon-hyun · 2018-02-17T17:13:20Z

Yep. @kiszk . @mgaido91 also reports that, so I'm investigating that more.

However, that doesn't mean this approach is not proper. You can see the manual test case example in previous ORC-related PR and this PR. This approach definitely reduces the number of point of failures.

For the remaining issue, I think we may need a different approach in a different code path.

dongjoon-hyun · 2018-02-17T17:13:45Z

For the following, I'll create another one.

Would it be worth to add this JIRA number in a comment as we did for ORC?

mgaido91 · 2018-02-17T17:27:09Z

LGTM

dongjoon-hyun · 2018-02-17T17:47:49Z

Thank you for review, @mgaido91 .

kiszk · 2018-02-17T17:56:57Z

LGTM with one minor comment

dongjoon-hyun · 2018-02-17T18:02:27Z

Thank you, @kiszk . I added SPARK-23390 in the PR description.

Would it be worth to add this JIRA number in a comment as we did for ORC?

dongjoon-hyun · 2018-02-17T18:02:55Z

Retest this please.

dongjoon-hyun · 2018-02-17T18:37:27Z

Oh, @kiszk . The following meat really comment in the code. Sorry, I misunderstood.

Would it be worth to add this JIRA number in a comment as we did for ORC?

dongjoon-hyun · 2018-02-17T18:39:09Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

        val vectorizedReader = new VectorizedParquetRecordReader(
          convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity)
+        val iter = new RecordReaderIterator(vectorizedReader)
+        // SPARK-23457 Register a task completion lister before `initialization`.


Now, SPARK-23457 is added.

SparkQA · 2018-02-17T20:27:45Z

Test build #87527 has finished for PR 20619 at commit e08d06c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-17T21:01:21Z

Test build #87528 has finished for PR 20619 at commit 8bd02d8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-02-17T23:58:15Z

The final failure is irrelevant to this.

org.apache.spark.sql.sources.CreateTableAsSelectSuite.(It is not a test it is a sbt.testing.SuiteSelector)

dongjoon-hyun · 2018-02-17T23:58:21Z

Retest this please.

SparkQA · 2018-02-18T02:32:50Z

Test build #87533 has finished for PR 20619 at commit 8bd02d8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-18T06:10:58Z

retest this please

SparkQA · 2018-02-18T08:05:01Z

Test build #87534 has finished for PR 20619 at commit 8bd02d8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-18T08:14:32Z

retest this please

SparkQA · 2018-02-18T09:59:25Z

Test build #87535 has finished for PR 20619 at commit 8bd02d8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-18T13:52:32Z

retest this please.

SparkQA · 2018-02-18T17:05:53Z

Test build #87537 has finished for PR 20619 at commit 8bd02d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-20T05:33:18Z

thanks, merging to master!

dongjoon-hyun · 2018-02-20T06:08:33Z

Thank you all!

dongjoon-hyun · 2018-03-01T18:19:22Z

Hi, @cloud-fan .
Since 2.3 is announced, can we have this in branch-2.3 for Apache Spark 2.3.1?

cloud-fan · 2018-03-02T03:28:50Z

Yea, please go ahead.

dongjoon-hyun · 2018-03-02T04:29:15Z

Thank you, @cloud-fan !

[SPARK-23390][SQL] Register task completion listerners first in Parqu…

43f809f

…etFileFormat

kiszk reviewed Feb 15, 2018

View reviewed changes

dongjoon-hyun commented Feb 15, 2018

View reviewed changes

felixcheung reviewed Feb 16, 2018

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-23390][SQL] Register task completion listerners first in ParquetFileFormat~~ [SPARK-23390][SQL] Register task completion listeners first in ParquetFileFormat Feb 17, 2018

Address comments

e08d06c

dongjoon-hyun changed the title ~~[SPARK-23390][SQL] Register task completion listeners first in ParquetFileFormat~~ [SPARK-23457][SQL] Register task completion listeners first in ParquetFileFormat Feb 17, 2018

Address comment.

8bd02d8

dongjoon-hyun commented Feb 17, 2018

View reviewed changes

asfgit closed this in f5850e7 Feb 20, 2018

dongjoon-hyun deleted the SPARK-23390 branch February 20, 2018 06:08

dongjoon-hyun mentioned this pull request Mar 3, 2018

[SPARK-23457][SQL][BRANCH-2.3] Register task completion listeners first in ParquetFileFormat #20714

Closed

[SPARK-23457][SQL] Register task completion listeners first in ParquetFileFormat #20619

[SPARK-23457][SQL] Register task completion listeners first in ParquetFileFormat #20619

Conversation

dongjoon-hyun commented Feb 15, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Feb 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 15, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 15, 2018

kiszk commented Feb 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 16, 2018

dongjoon-hyun commented Feb 16, 2018

dongjoon-hyun commented Feb 17, 2018

cloud-fan commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018 • edited Loading

gatorsmile commented Feb 17, 2018

SparkQA commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018 • edited Loading

dongjoon-hyun commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

SparkQA commented Feb 17, 2018

gatorsmile commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

SparkQA commented Feb 17, 2018

kiszk commented Feb 17, 2018

kiszk commented Feb 17, 2018

SparkQA commented Feb 17, 2018

kiszk commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018 • edited Loading

mgaido91 commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

kiszk commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 17, 2018

SparkQA commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

dongjoon-hyun commented Feb 17, 2018

SparkQA commented Feb 18, 2018

kiszk commented Feb 18, 2018

SparkQA commented Feb 18, 2018

kiszk commented Feb 18, 2018

SparkQA commented Feb 18, 2018

viirya commented Feb 18, 2018

SparkQA commented Feb 18, 2018

cloud-fan commented Feb 20, 2018

dongjoon-hyun commented Feb 20, 2018

dongjoon-hyun commented Mar 1, 2018

cloud-fan commented Mar 2, 2018

dongjoon-hyun commented Mar 2, 2018

dongjoon-hyun commented Feb 15, 2018 •

edited

Loading

dongjoon-hyun commented Feb 17, 2018 •

edited

Loading

dongjoon-hyun commented Feb 17, 2018 •

edited

Loading

dongjoon-hyun commented Feb 17, 2018 •

edited

Loading