[SPARK-27635][SQL] Prevent from splitting too many partitions smaller than row group size in Parquet file format #24527

LantaoJin · 2019-05-05T09:54:54Z

What changes were proposed in this pull request?

The scenario is submitting multiple jobs concurrently with spark dynamic allocation enabled. The issue happens in determining RDD partition numbers. When there are more available CPU cores, spark will try to split RDD to more pieces. But since the file is stored as parquet format, parquet's row group is actually the basic unit block to read data. Splitting RDD to too many small pieces doesn't make sense.
Jobs will launch too many partitions and never complete.

Force set the default parallelism to a fixed number (for example 200) could workaround.

How was this patch tested?

Exist UTs

… than row group size in Parquet file format

AmplabJenkins · 2019-05-05T10:01:53Z

Can one of the admins verify this patch?

dongjoon-hyun · 2019-05-05T15:55:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+        fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes // parquet.block.size
+      case _ =>
+        FilePartition.maxSplitBytes(fsRelation.sparkSession, selectedPartitions)
+    }


Hi, @LantaoJin . It would be very helpful if you provide a test case for your following claim.

Splitting RDD to too many small pieces doesn't make sense. Jobs will launch too many partitions and never complete.

It may be hard to provide a UT. This case only happens in one of our jobs which we enable multi-thread to read from one HDFS folder and write to different target HDFS folders with different filters. With DRA enabled and the job launched near 2000 executors with near 8000 active tasks. When the job runs for a while, the task number of filter/scan stages increases from 200 to over 5000. And we got many below logs:

19/04/29 06:13:48 INFO FileSourceScanExec: Planning scan with bin packing, max size: 129026539 bytes, open cost is considered as scanning 4194304 bytes.
19/04/29 06:13:48 INFO FileSourceScanExec: Planning scan with bin packing, max size: 129026539 bytes, open cost is considered as scanning 4194304 bytes.
19/04/29 06:13:49 INFO FileSourceScanExec: Planning scan with bin packing, max size: 129026539 bytes, open cost is considered as scanning 4194304 bytes.

Changed to

19/04/29 06:15:49 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4474908 bytes, open cost is considered as scanning 4194304 bytes.
19/04/29 06:16:15 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
19/04/29 06:16:21 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
19/04/29 06:16:23 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
19/04/29 06:16:23 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.

This issue would gone in four cases:

set "spark.default.parallelism" to a fixed value.

Disable DRA and set num-executors to a low value.

The app can not get too many resources to launch executors

Run jobs one by one instead multi-thread to run.

All of above will prevent app to require too many partitions since less cores:

override def defaultParallelism(): Int = { // if not set, more resources, more cores conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) } def maxSplitBytes( sparkSession: SparkSession, selectedPartitions: Seq[PartitionDirectory]): Long = { val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes val defaultParallelism = sparkSession.sparkContext.defaultParallelism val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum val bytesPerCore = totalBytes / defaultParallelism // more cores, less bytesPerCore // less bytesPerCore, less maxSplitBytes Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) } def splitFiles( sparkSession: SparkSession, file: FileStatus, filePath: Path, isSplitable: Boolean, maxSplitBytes: Long, partitionValues: InternalRow): Seq[PartitionedFile] = { if (isSplitable) { (0L until file.getLen by maxSplitBytes).map { offset => // less maxSplitBytes, more partitions val remaining = file.getLen - offset val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining val hosts = getBlockHosts(getBlockLocations(file), offset, size) PartitionedFile(partitionValues, filePath.toUri.toString, offset, size, hosts) } } else { Seq(getPartitionedFile(file, filePath, partitionValues)) } }

felixcheung · 2019-05-06T03:35:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

-      FilePartition.maxSplitBytes(fsRelation.sparkSession, selectedPartitions)
+    val maxSplitBytes = relation.fileFormat match {
+      case _ : ParquetSource =>
+        fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes // parquet.block.size


shouldn't this be something provided by the ParquetSource?

Actually, ParquetSource here is alias of ParquetFileFormat

import org.apache.spark.sql.execution.datasources.parquet.{ParquetFileFormat => ParquetSource}

Sorry, please ignore. Currently, FileFormat doesn't provide this. How to split is determined in DataSourceScanExec

dongjoon-hyun

Hi, @LantaoJin .

First, I'm not sure about making ParquetSource only logic.

Second, Spark computes like the following. So, if you set openCostInBytes=128MB, you will get what you did in this PR.

min(spark.sql.files.maxPartitionBytes, max(spark.sql.files.openCostInBytes, bytesPerCore))
= min(128MB, max(128MB, bytesPerCore))
= 128MB

Could you try that way? If you agree, I'd like to close this PR and issue as INVALID.

cc @gatorsmile

LantaoJin · 2019-06-10T13:00:19Z

@dongjoon-hyun Yes, I see. Close.

dongjoon-hyun · 2019-06-10T16:09:49Z

Thank you, @LantaoJin .

[SPARK-27635][SQL] Prevent from splitting too many partitions smaller…

ba43e19

… than row group size in Parquet file format

dongjoon-hyun reviewed May 5, 2019

View reviewed changes

felixcheung reviewed May 6, 2019

View reviewed changes

dongjoon-hyun requested changes Jun 9, 2019

View reviewed changes

LantaoJin closed this Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27635][SQL] Prevent from splitting too many partitions smaller than row group size in Parquet file format #24527

[SPARK-27635][SQL] Prevent from splitting too many partitions smaller than row group size in Parquet file format #24527

LantaoJin commented May 5, 2019

AmplabJenkins commented May 5, 2019

dongjoon-hyun May 5, 2019

LantaoJin May 6, 2019 •

edited

Loading

felixcheung May 6, 2019

LantaoJin May 6, 2019

LantaoJin May 6, 2019

dongjoon-hyun left a comment •

edited

Loading

LantaoJin commented Jun 10, 2019

dongjoon-hyun commented Jun 10, 2019

[SPARK-27635][SQL] Prevent from splitting too many partitions smaller than row group size in Parquet file format #24527

[SPARK-27635][SQL] Prevent from splitting too many partitions smaller than row group size in Parquet file format #24527

Conversation

LantaoJin commented May 5, 2019

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented May 5, 2019

dongjoon-hyun May 5, 2019

Choose a reason for hiding this comment

LantaoJin May 6, 2019 • edited Loading

Choose a reason for hiding this comment

felixcheung May 6, 2019

Choose a reason for hiding this comment

LantaoJin May 6, 2019

Choose a reason for hiding this comment

LantaoJin May 6, 2019

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

LantaoJin commented Jun 10, 2019

dongjoon-hyun commented Jun 10, 2019

LantaoJin May 6, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading