[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns #24296

gengliangwang · 2019-04-04T06:21:14Z

What changes were proposed in this pull request?

When scanning file sources, we can prune unnecessary partition columns on constructing input partitions, so that:

Reduce the data transformation from Driver to Executors
Make it easier to implement columnar batch readers, since the partition columns are already pruned.

How was this patch tested?

Existing unit tests.

SparkQA · 2019-04-04T07:05:01Z

Test build #104276 has finished for PR 24296 at commit 88ce4fc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FileScanBuilder(

gengliangwang · 2019-04-04T07:08:24Z

retest this please.

SparkQA · 2019-04-04T10:22:46Z

Test build #104278 has finished for PR 24296 at commit 88ce4fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FileScanBuilder(

gengliangwang · 2019-04-04T12:58:24Z

The test failures should be fixed after #24284 is merged.

cloud-fan · 2019-04-05T05:57:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

@@ -40,7 +45,23 @@ abstract class FileScan(
  protected def partitions: Seq[FilePartition] = {
    val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty)
    val maxSplitBytes = FilePartition.maxSplitBytes(sparkSession, selectedPartitions)
+    val partitionAttributes = fileIndex.partitionSchema.toAttributes
+    val attributeMap = partitionAttributes.map(a => getAttributeName(a) -> a).toMap
+    val readPartitionAttributes = readPartitionSchema.toAttributes.map { readAttr =>


we should not create an attribute and just use its name. This can be

readPartitionSchema.map { readField => attributeMap.get(normalize(readField.name)).getOrElse ... }

cloud-fan · 2019-04-05T06:01:01Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala

@@ -52,24 +52,24 @@ import org.apache.spark.util.SerializableConfiguration
 case class OrcPartitionReaderFactory(
    sqlConf: SQLConf,
    broadcastedConf: Broadcast[SerializableConfiguration],
+    resultSchema: StructType,


let's update the parameter doc in the classdoc.

SparkQA · 2019-04-05T07:05:02Z

Test build #104312 has finished for PR 24296 at commit d16dbab.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FileScanBuilder(

gengliangwang · 2019-04-05T07:39:10Z

retest this please.

cloud-fan · 2019-04-05T08:14:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala

+    StructType(fields)
+  }
+
+  // Define as method instead of value, since `requiredSchema` is mutable.


but we should not create a set inside loop body. How about

def createRequiredNameset ... ... val requiredNameSet = createRequiredNameset.. val fields = partitionSchema.fields.filter { field => ... }

cloud-fan · 2019-04-05T08:20:31Z

LGTM except one comment

SparkQA · 2019-04-05T10:14:41Z

Test build #104313 has finished for PR 24296 at commit f9c9986.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-05T11:50:22Z

Test build #104317 has finished for PR 24296 at commit f9c9986.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-05T12:35:11Z

Test build #104318 has finished for PR 24296 at commit b6dab15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-08T07:14:19Z

thanks, merging to master!

prune partition values

d16dbab

gengliangwang force-pushed the prunePartitionValue branch from 88ce4fc to d16dbab Compare April 5, 2019 05:51

cloud-fan reviewed Apr 5, 2019

View reviewed changes

address comments

f9c9986

cloud-fan reviewed Apr 5, 2019

View reviewed changes

gengliangwang added 2 commits April 5, 2019 16:23

address comment

a6db4d8

revise

b6dab15

cloud-fan closed this in 02e9f93 Apr 8, 2019

gengliangwang mentioned this pull request Apr 8, 2019

[SPARK-27271][SQL] Migrate Text to File Data Source V2 #24207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns #24296

[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns #24296

gengliangwang commented Apr 4, 2019

SparkQA commented Apr 4, 2019

gengliangwang commented Apr 4, 2019

SparkQA commented Apr 4, 2019

gengliangwang commented Apr 4, 2019

cloud-fan Apr 5, 2019

cloud-fan Apr 5, 2019

SparkQA commented Apr 5, 2019

gengliangwang commented Apr 5, 2019

cloud-fan Apr 5, 2019

cloud-fan commented Apr 5, 2019

SparkQA commented Apr 5, 2019

SparkQA commented Apr 5, 2019

SparkQA commented Apr 5, 2019

cloud-fan commented Apr 8, 2019

[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns #24296

[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns #24296

Conversation

gengliangwang commented Apr 4, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 4, 2019

gengliangwang commented Apr 4, 2019

SparkQA commented Apr 4, 2019

gengliangwang commented Apr 4, 2019

cloud-fan Apr 5, 2019

Choose a reason for hiding this comment

cloud-fan Apr 5, 2019

Choose a reason for hiding this comment

SparkQA commented Apr 5, 2019

gengliangwang commented Apr 5, 2019

cloud-fan Apr 5, 2019

Choose a reason for hiding this comment

cloud-fan commented Apr 5, 2019

SparkQA commented Apr 5, 2019

SparkQA commented Apr 5, 2019

SparkQA commented Apr 5, 2019

cloud-fan commented Apr 8, 2019