Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns #24296

Closed

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

When scanning file sources, we can prune unnecessary partition columns on constructing input partitions, so that:

  1. Reduce the data transformation from Driver to Executors
  2. Make it easier to implement columnar batch readers, since the partition columns are already pruned.

How was this patch tested?

Existing unit tests.

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104276 has finished for PR 24296 at commit 88ce4fc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class FileScanBuilder(

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104278 has finished for PR 24296 at commit 88ce4fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class FileScanBuilder(

@gengliangwang
Copy link
Member Author

The test failures should be fixed after #24284 is merged.

@@ -40,7 +45,23 @@ abstract class FileScan(
protected def partitions: Seq[FilePartition] = {
val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty)
val maxSplitBytes = FilePartition.maxSplitBytes(sparkSession, selectedPartitions)
val partitionAttributes = fileIndex.partitionSchema.toAttributes
val attributeMap = partitionAttributes.map(a => getAttributeName(a) -> a).toMap
val readPartitionAttributes = readPartitionSchema.toAttributes.map { readAttr =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not create an attribute and just use its name. This can be

readPartitionSchema.map { readField =>
  attributeMap.get(normalize(readField.name)).getOrElse ...
}

@@ -52,24 +52,24 @@ import org.apache.spark.util.SerializableConfiguration
case class OrcPartitionReaderFactory(
sqlConf: SQLConf,
broadcastedConf: Broadcast[SerializableConfiguration],
resultSchema: StructType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update the parameter doc in the classdoc.

@SparkQA
Copy link

SparkQA commented Apr 5, 2019

Test build #104312 has finished for PR 24296 at commit d16dbab.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class FileScanBuilder(

@gengliangwang
Copy link
Member Author

retest this please.

StructType(fields)
}

// Define as method instead of value, since `requiredSchema` is mutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we should not create a set inside loop body. How about

def createRequiredNameset ...
...
val requiredNameSet = createRequiredNameset..
val fields = partitionSchema.fields.filter { field => ... }

@cloud-fan
Copy link
Contributor

LGTM except one comment

@SparkQA
Copy link

SparkQA commented Apr 5, 2019

Test build #104313 has finished for PR 24296 at commit f9c9986.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2019

Test build #104317 has finished for PR 24296 at commit f9c9986.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2019

Test build #104318 has finished for PR 24296 at commit b6dab15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants