[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files #11646

marmbrus · 2016-03-11T04:46:27Z

This PR adds a new strategy, FileSourceStrategy, that can be used for planning scans of collections of files that might be partitioned or bucketed.

Compared with the existing planning logic in DataSourceStrategy this version has the following desirable properties:

It removes the need to have RDD, broadcastedHadoopConf and other distributed concerns in the public API of org.apache.spark.sql.sources.FileFormat
Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns
It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf)
it natively supports bucketing files into partitions, and thus does not require coalescing / creating a UnionRDD with the correct partitioning.
Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm.

Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API.

A stub for FileScanRDD is also added, but most methods remain unimplemented.

Other minor cleanups:

partition pruning is pushed into FileCatalog so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore)
The partitions from the FileCatalog now propagate information about file sizes all the way up to the planner so we can intelligently spread files out.
Array -> Seq in some internal APIs to avoid unnecessary toArray calls
Rename Partition to PartitionDirectory to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

marmbrus · 2016-03-11T04:46:51Z

/cc @cloud-fan @nongli @davies

SparkQA · 2016-03-11T06:36:01Z

Test build #52899 has finished for PR 11646 at commit 4f29845.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitionedFile(
- case class FilePartition(val index: Int, files: Seq[PartitionedFile]) extends Partition
- class FileScanRDD(
- case class Partition(values: InternalRow, files: Seq[FileStatus])

SparkQA · 2016-03-11T06:46:23Z

Test build #52900 has finished for PR 11646 at commit 7ad3119.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-11T07:01:26Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

-  def partitionSpec(schema: Option[StructType]): PartitionSpec
+  def partitionSpec(): PartitionSpec
+
+  def listFiles(filters: Seq[Expression]): Seq[Partition]


Is this used to do partition pruning?

The name looks confusing, it's called listFiles but return a Seq[Partition]. Maybe we should call it listPartitions?

I'll add some comments about the filters. I called it listFiles because it is actually enumerating all of the files for each partition internally (unlike prunePartitions which is only selecting the partition directories).

marmbrus · 2016-03-11T20:03:13Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

+      partitionSchema: StructType,
+      dataSchema: StructType,
+      filters: Seq[Filter],
+      options: Map[String, String]): PartitionedFile => Iterator[InternalRow] = {


@rxin @nongli

Should we make this our own iterator (i.e. w/ close() w/o `hasNext())?

How can this API support row batch?

The same way as before (relying on erasure). Before we make this public we should decide if we want to make all external sources return batches or have two interfaces.

SparkQA · 2016-03-11T20:18:57Z

Test build #52939 has finished for PR 11646 at commit 65596df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-11T22:51:19Z

Test build #52946 has finished for PR 11646 at commit 35be8d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-14T21:53:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+      val scan =
+        PhysicalRDD(
+          l.output,
+          new FileScanRDD(


Should we move this part (creating FileScanRDD) into DataSourceScan? We could have a new FileScan for this rule.

Good idea, that would be cleaner. We can't do that until we get rid of the old code path though (since there are two different types are RDDs we want to create).

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

davies · 2016-03-14T22:05:16Z

LGTM

davies · 2016-03-14T22:18:11Z

Don't forget to create a JIRA for this.

SparkQA · 2016-03-14T23:54:53Z

Test build #53123 has finished for PR 11646 at commit 4744b97.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class NativeDDLCommand(val sql: String) extends RunnableCommand
- case class CreateDatabase(
- case class CreateFunction(
- case class AlterTableRename(
- case class AlterTableSetProperties(
- case class AlterTableUnsetProperties(
- case class AlterTableSerDeProperties(
- case class AlterTableStorageProperties(
- case class AlterTableNotClustered(
- case class AlterTableNotSorted(
- case class AlterTableSkewed(
- case class AlterTableNotSkewed(
- case class AlterTableNotStoredAsDirs(
- case class AlterTableSkewedLocation(
- case class AlterTableAddPartition(
- case class AlterTableRenamePartition(
- case class AlterTableExchangePartition(
- case class AlterTableDropPartition(
- case class AlterTableArchivePartition(
- case class AlterTableUnarchivePartition(
- case class AlterTableSetFileFormat(
- case class AlterTableSetLocation(
- case class AlterTableTouch(
- case class AlterTableCompact(
- case class AlterTableMerge(
- case class AlterTableChangeCol(
- case class AlterTableAddCol(
- case class AlterTableReplaceCol(
- case class In(attribute: String, values: Array[Any]) extends Filter

marmbrus · 2016-03-15T02:20:05Z

Thanks! Merging to master.

cloud-fan · 2016-03-15T02:33:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+private[sql] object FileSourceStrategy extends Strategy with Logging {
+  def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
+    case PhysicalOperation(projects, filters, l@LogicalRelation(files: HadoopFsRelation, _, _))
+      if files.fileFormat.toString == "TestFileFormat" =>


what does this match for?

Currently only the test format, but in a follow up PR we'll add Parquet and eventually remove this check and the old code path.

cloud-fan · 2016-03-15T02:43:35Z

Sorry for the late review, LGTM.

tedyu · 2016-03-17T23:36:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+/**
+ * A single file that should be read, along with partition column values that
+ * need to be prepended to each row.  The reading should start at the first
+ * valid record found after `offset`.


offset -> start

…ed scans of files This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed. Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties: - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat` - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf) - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning. - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm. Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API. A stub for `FileScanRDD` is also added, but most methods remain unimplemented. Other minor cleanups: - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore) - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out. - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes. Author: Michael Armbrust <michael@databricks.com> Closes apache#11646 from marmbrus/fileStrategy.

cloud-fan · 2016-03-25T12:47:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+          val splitFiles = selectedPartitions.flatMap { partition =>
+            partition.files.flatMap { file =>
+              assert(file.getLen != 0)
+              (0L to file.getLen by maxSplitBytes).map { offset =>


Is it safe to split file ourselves? If it's a text file, how can we guarantee we don't break lines while splitting the file?

We found that at least ParquetRecordReader and LineRecordReader are designed to support splits that are cut at arbitrary positions. But is this a mandatory conversion that applies to all Hadoop record readers?

Aww... Unfortunately ORC doesn't obey this convention, and it breaks ORC version of buildReader() implemented in PR #11936. Seems that we have to have FileFormat to tell how to generate file splits.

An example snippet is shown in PR #11936 to illustrate the problem.

One apparent naive "fix" for this is to include at least one whole file in each partition. But this doesn't work well with large input files.

Sorry, false alarm... See here.

cloud-fan · 2016-03-30T07:03:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+      val selectedPartitions = files.location.listFiles(partitionKeyFilters.toSeq)
+
+      val filterAttributes = AttributeSet(afterScanFilters)
+      val requiredExpressions: Seq[NamedExpression] = filterAttributes.toSeq ++ projects


What if there is no Project or Filter above the relation? We should read all columns but here we treat it as no column is required.

I don't think thats how it works. In PhysicalOperation, when there are no projections, we use the output of the child as the list of projections (i.e. all columns).

marmbrus added 5 commits March 9, 2016 10:28

WIP

94cede6

WIP

09df4c8

Merge remote-tracking branch 'apache/master' into fileStrategy

ef6ddaf

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

cleanup

9990d41

more cleanup

4f29845

i can count

7ad3119

cloud-fan reviewed Mar 11, 2016
View reviewed changes

comments

65596df

marmbrus reviewed Mar 11, 2016
View reviewed changes

marmbrus added 2 commits March 11, 2016 12:36

Merge remote-tracking branch 'apache/master' into fileStrategy

c1cf71e

fix compilation

35be8d5

davies reviewed Mar 14, 2016
View reviewed changes

Merge remote-tracking branch 'apache/master' into fileStrategy

4744b97

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

marmbrus changed the title ~~[SPARK-XXXX][SQL] Add a strategy for planning partitioned and bucketed scans of files~~ [SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files Mar 14, 2016

asfgit closed this in 17eec0a Mar 15, 2016

cloud-fan reviewed Mar 15, 2016
View reviewed changes

tedyu reviewed Mar 17, 2016
View reviewed changes

cloud-fan reviewed Mar 25, 2016
View reviewed changes

cloud-fan reviewed Mar 30, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files #11646

[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files #11646

marmbrus commented Mar 11, 2016

marmbrus commented Mar 11, 2016

SparkQA commented Mar 11, 2016

SparkQA commented Mar 11, 2016

cloud-fan Mar 11, 2016

cloud-fan Mar 11, 2016

marmbrus Mar 11, 2016

marmbrus Mar 11, 2016

davies Mar 14, 2016

marmbrus Mar 14, 2016

SparkQA commented Mar 11, 2016

SparkQA commented Mar 11, 2016

davies Mar 14, 2016

marmbrus Mar 14, 2016

davies commented Mar 14, 2016

davies commented Mar 14, 2016

SparkQA commented Mar 14, 2016

marmbrus commented Mar 15, 2016

cloud-fan Mar 15, 2016

marmbrus Mar 15, 2016

cloud-fan commented Mar 15, 2016

tedyu Mar 17, 2016

cloud-fan Mar 25, 2016

liancheng Mar 25, 2016

liancheng Mar 25, 2016

liancheng Mar 25, 2016

liancheng Mar 25, 2016

cloud-fan Mar 30, 2016

marmbrus Mar 30, 2016

[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files #11646

[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files #11646

Conversation

marmbrus commented Mar 11, 2016

marmbrus commented Mar 11, 2016

SparkQA commented Mar 11, 2016

SparkQA commented Mar 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 11, 2016

SparkQA commented Mar 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davies commented Mar 14, 2016

davies commented Mar 14, 2016

SparkQA commented Mar 14, 2016

marmbrus commented Mar 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment