[SPARK-24906][SQL] Adaptively enlarge split / partition size for Parq… #21868

habren · 2018-07-25T02:12:17Z

Please refer to https://issues.apache.org/jira/browse/SPARK-24906 for more detail and test

For columnar file, such as, when spark sql read the table, each split will be 128 MB by default since spark.sql.files.maxPartitionBytes is default to 128MB. Even when user set it to a large value, such as 512MB, the task may read only few MB or even hundreds of KB. Because the table (Parquet) may consists of dozens of columns while the SQL only need few columns. And spark will prune the unnecessary columns.

In this case, spark DataSourceScanExec can enlarge maxPartitionBytes adaptively.

For example, there is 40 columns , 20 are integer while another 20 are long. When use query on an integer type column and an long type column, the maxPartitionBytes should be 20 times larger. (204+208) / (4+8) = 20.

With this optimization, the number of task will be smaller and the job will run faster. More importantly, for a very large cluster (more the 10 thousand nodes), it will relieve RM's schedule pressure.

Here is the test

The table named test2 has more than 40 columns and there are more than 5 TB data each hour.

When we issue a very simple query

select count(device_id) from test2 where date=20180708 and hour='23'

There are 72176 tasks and the duration of the job is 4.8 minutes

Most tasks last less than 1 second and read less than 1.5 MB data

After the optimization, there are only 1615 tasks and the job last only 30 seconds. It almost 10 times faster.

The median of read data is 44.2MB.

https://issues.apache.org/jira/browse/SPARK-24906

maropu · 2018-07-25T13:47:29Z

Thanks for the work, but, probably, we first need consensus to work on this because this part is pretty performance-sensitive... As @viirya described in the jira, I think we need more general approach than the current fix (for example, I'm not sure that this fix don't have any performance degression).

viirya · 2018-07-25T22:11:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+
+  val PARQUET_ARRAY_LENGTH = buildConf("spark.sql.parquet.array.length")
+    .intConf
+    .createWithDefault(ArrayType.defaultConcreteType.defaultSize)


ArrayType.defaultConcreteType is ArrayType(NullType, containsNull = true). I think using this you won't get a reasonable number.

Thanks for your comments. I set the default value to StringType.defaultSize (8). It's default size, use should configure it according to the real data

viirya · 2018-07-25T22:15:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+          .reduceOption(_ + _).getOrElse(StringType.defaultSize)
+        val totalColumnSize = relation.dataSchema.map(_.dataType).map(getTypeLength(_))
+          .reduceOption(_ + _).getOrElse(StringType.defaultSize)
+        val multiplier = totalColumnSize / selectedColumnSize


Seems here you can only get the ratio of selected columns to total columns. The actual type sizes are not put into consideration.

There are many data types. CalendarIntervalType StructType MapType NullType UserDefinedType AtomicType(TimestampType StringType HiveStringType BooleanType DateType BinaryType NumericType) ObjectType ArrayType. For AtomicType, the size is fixed to the defaultSize. For complex type, such as StructType, MapType, ArrayType, the size is mutable. So I make it configurable with default value. With the data type size, multiplier is not only the ratio of selected columns to total columns, but the total size of selected columns to total size of all columns.

@viirya As defined in getTypeLength, user can define the complex types' length as per the data statistics. And the length for AtomicType can be determined by AtomicType.defaultSize. So the multiplier is the ratio of the total length of the selected columns to the total length of all columns.

def getTypeLength (dataType : DataType) : Int = {
if (dataType.isInstanceOf[StructType]) {
fsRelation.sparkSession.sessionState.conf.parquetStructTypeLength
} else if (dataType.isInstanceOf[ArrayType]) {
fsRelation.sparkSession.sessionState.conf.parquetArrayTypeLength
} else if (dataType.isInstanceOf[MapType]) {
fsRelation.sparkSession.sessionState.conf.parquetMapTypeLength
} else {
dataType.defaultSize
}
}

@viirya Now it also support ORC. Please help to review

HyukjinKwon · 2018-07-26T07:21:07Z

BTW, why does this PR target branch-2.3? I think it should be master.

habren · 2018-07-26T07:36:55Z

@maropu If I understand correct, your concern is about how to calculate the multiplier.

Here is my idea. A parquet file has several row groups and each group consists of many columns. Typically, we make the maxPartitionBytes to 128MB which is the same as the block size and the row group size. As recommended by parquet, the row group size and hdfs block size should be 512MB or 1GB. In our environment, it's 512MB.

In DataSourceScanExec, all files are splitted as per maxPartitionBytes (default to 128MB) and openCostBytes (4MB) no matter is columnar file or row based file. And each task will read a split / partition (typically 128MB). For row based file, the read bytes for a split is 128MB. But for column file, the read bytes is typically much smaller than 128MB because a split / row group consist of several columns and only few of them are read because of column pruning. As a result, the read types of the task is very small. So I want to try our best to make sure a Task read 128MB (or some other value according to maxPartitionBytes and openCostBytes). And the idea is enlarge the split / partition according to how the columns are pruned.

I don't think there will be performance downgrade if the read bytes is approximately equal to maxPartitionBytes. Because, with this fix, the Task will read the same data volume no matter it's column file or row based file.

habren · 2018-07-30T09:52:33Z

Hi @maropu and @viirya Do you agree with the basic idea that we should take column pruning in to consideration during splitting the input files?

habren · 2018-08-08T23:10:16Z

Hi @maropu and @viirya Do you agree with the basic idea that we should take column pruning in to consideration during splitting the input files?

maropu · 2018-08-09T01:37:25Z

Is this a parquet-specific issue? e.g., how about ORC?

habren · 2018-08-09T02:39:28Z

@maropu Thanks for your comments. ORC can also benefit from this change since ORC is also columnar file format. Do you think I should add ORC support by change the below line

if(fsRelation.fileFormat.isInstanceOf[ParquetSource])

to
if(fsRelation.fileFormat.isInstanceOf[ParquetSource] || fsRelation.fileFormat.isInstanceOf[OrcFileFormat])

HyukjinKwon · 2018-08-10T09:42:09Z

??? why does this still target branch-2.3? is this a backport?

habren · 2018-08-10T12:49:46Z

@HyukjinKwon Thanks for your comments. I will submit it to master soon

habren · 2018-08-16T01:33:13Z

Hi @HyukjinKwon I moved the change to master branch just now. Please help to review

gatorsmile · 2018-08-16T07:04:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+        val selectedColumnSize = requiredSchema.map(_.dataType).map(getTypeLength(_))
+          .reduceOption(_ + _).getOrElse(StringType.defaultSize)
+        val totalColumnSize = relation.dataSchema.map(_.dataType).map(getTypeLength(_))
+          .reduceOption(_ + _).getOrElse(StringType.defaultSize)


The type based estimation is very rough. This is still hard for end users to decide the initial size.

@gatorsmile The target of this change is not making users easy to set the partition size. Instead, when user set the partition size, this change will try its best to make sure the read size is close to the value that set by user. Without this change, when user set partition size to 128MB, the actual read size may be 1MB or even smaller because of column pruning.

I think his point is that the estimation is super rough which I agree with .. I am less sure if we should go ahead or not partially by this reason as well.

@HyukjinKwon I agree that the estimation is rough especially for complex type. For AtomicType, it works better. And at least it take column pruning into consideration.

viirya · 2018-08-16T22:56:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -25,17 +25,16 @@ import java.util.zip.Deflater
 import scala.collection.JavaConverters._
 import scala.collection.immutable
 import scala.util.matching.Regex
-


Please don't remove these blank lines. Can you revert it back?

@viirya Ok, I added it back

viirya · 2018-08-17T00:37:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val PARQUET_ARRAY_LENGTH = buildConf("spark.sql.parquet.array.length")
+    .doc("Set the default size of array column")
+    .intConf
+    .createWithDefault(StringType.defaultSize)


This feature includes so many configs, my concern is it is hard for end users to set them.

viirya · 2018-08-17T01:32:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val PARQUET_STRUCT_LENGTH = buildConf("spark.sql.parquet.struct.length")
+    .doc("Set the default size of struct column")
+    .intConf
+    .createWithDefault(StringType.defaultSize)


And these configs assume that different storage formats use the same size?

…uet and Orc

HyukjinKwon · 2018-08-17T03:56:45Z

@habren, BTW, just for clarification, you can set the bigger number to spark.sql.files.maxPartitionBytes explicitly and that resolved your issue. This one is to handle it dynamically, right?

HyukjinKwon · 2018-08-17T03:58:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -459,6 +460,29 @@ object SQLConf {
    .intConf
    .createWithDefault(4096)

+  val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit")


This configuration doesn't look specific to parquet anymore.

@HyukjinKwon I changed the configuration name just now

HyukjinKwon · 2018-08-17T04:00:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -459,6 +460,29 @@ object SQLConf {
    .intConf
    .createWithDefault(4096)

+  val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit")
+    .doc("For columnar file format (e.g., Parquet), it's possible that only few (not all) " +


it's I would avoid abbreviation.

I updated it accordingly

HyukjinKwon · 2018-08-17T04:00:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit")
+    .doc("For columnar file format (e.g., Parquet), it's possible that only few (not all) " +
+      "columns are needed. So, it's better to make sure that the total size of the selected " +
+      "columns is about 128 MB "


It sounds not describing what the configuration does actually.

I updated the description just now. Please help to review it again. Thanks a lot @HyukjinKwon

HyukjinKwon · 2018-08-17T04:01:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .intConf
+    .createWithDefault(StringType.defaultSize)
+
+  val PARQUET_MAP_LENGTH = buildConf("spark.sql.parquet.map.length")


I wouldn't do this. This makes more complicated and I would just set a bigger number for maxPartitionBytes.

Yeah, I was thinking that.

@HyukjinKwon @viirya Setting spark.sql.files.maxPartitionBytes explicitly do works. For you or other advanced users, it's convenient to set a bigger number of maxPartitionBytes.

But for ad-hoc query, the selected columns are different for different queries, and it's not convenient or event impossible for users to set different maxPartitionBytes for different queries.

And for general user (non advanced user), it's not easy for them to calculate a proper value of maxPartitionBytes.

You know, in many big company, there may be one or few teams are familiar with the details of Spark, and they maintain the Spark cluster. Other teams are general users of Spark and they care more about their business, such as data warehouse build up and recommendation algorithm. This feature try to handle it dynamically even the users are not familiar with Spark.

…uet and Orc. Change parquet to columnar

habren · 2018-08-17T12:06:23Z

@HyukjinKwon Yes this is to handle it dynamically.
For ad-hoc query, the selected columns are different for different queries, and it's not convenient or event impossible for users to set different maxPartitionBytes for different queries.
And for general user (non advanced user), it's not easy for them to set a proper value of maxPartitionBytes.
So, this change make it easier

BTW, just for clarification, you can set the bigger number to spark.sql.files.maxPartitionBytes explicitly and that resolved your issue. This one is to handle it dynamically, right?

AmplabJenkins · 2018-10-22T15:47:48Z

Can one of the admins verify this patch?

HyukjinKwon · 2018-11-11T03:32:47Z

I think we should fix this. Basically the dynamic estimation logic is too flaky, and I think we need this for the current status. Let's don't add it for now.

While I am revisiting old PRs, I am trying to suggest to close PRs that look not likely to be merged. Let me suggest to close this for now but please feel free to recreate a PR if you strongly this is needed in Spark. No objection.

viirya reviewed Jul 25, 2018

View reviewed changes

habren closed this Jul 26, 2018

habren reopened this Jul 26, 2018

habren changed the base branch from branch-2.3 to master August 16, 2018 01:29

gatorsmile reviewed Aug 16, 2018

View reviewed changes

viirya reviewed Aug 17, 2018

View reviewed changes

[SPARK-24906][SQL] Adaptively enlarge split / partition size for Parq…

47f2767

…uet and Orc

HyukjinKwon reviewed Aug 17, 2018

View reviewed changes

[SPARK-24906][SQL] Adaptively enlarge split / partition size for Parq…

067014e

…uet and Orc. Change parquet to columnar

HyukjinKwon mentioned this pull request Nov 11, 2018

[INFRA] Close stale PRs #23001

Closed

asfgit closed this in a3ba3a8 Nov 11, 2018

[SPARK-24906][SQL] Adaptively enlarge split / partition size for Parq… #21868

[SPARK-24906][SQL] Adaptively enlarge split / partition size for Parq… #21868

Conversation

habren commented Jul 25, 2018

maropu commented Jul 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 26, 2018

habren commented Jul 26, 2018 • edited Loading

habren commented Jul 30, 2018

habren commented Aug 8, 2018

maropu commented Aug 9, 2018

habren commented Aug 9, 2018 • edited Loading

HyukjinKwon commented Aug 10, 2018

habren commented Aug 10, 2018

habren commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 17, 2018

Choose a reason for hiding this comment

habren Aug 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

habren commented Aug 17, 2018

AmplabJenkins commented Oct 22, 2018

HyukjinKwon commented Nov 11, 2018

habren commented Jul 26, 2018 •

edited

Loading

habren commented Aug 9, 2018 •

edited

Loading

habren Aug 17, 2018 •

edited

Loading