[SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path #23383

gengliangwang · 2018-12-26T06:19:39Z

What changes were proposed in this pull request?

Create a framework for file source V2 based on data source V2 API.
As a good example for demonstrating the framework, this PR also migrate ORC source. This is because ORC file source supports both row scan and columnar scan, and the implementation is simpler comparing with Parquet.

Note: Currently only read path of V2 API is done, this framework and migration are only for the read path.
Supports the following scan:

Scan ColumnarBatch
Scan UnsafeRow
Push down filters
Push down required columns

Not supported( due to the limitation of data source V2 API):

Stats metrics
Catalog table
Writes

How was this patch tested?

Unit test

gengliangwang · 2018-12-26T06:28:53Z

There was a PR for this: #20933. This one is to migrate ORC with latest data source V2 API.

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

SparkQA · 2018-12-26T08:05:01Z

Test build #100446 has finished for PR 23383 at commit ed1a7fe.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FallBackFileDataSourceToV1(sparkSession: SparkSession) extends Rule[LogicalPlan]
case class FilePartition(index: Int, files: Seq[PartitionedFile])
class EmptyPartitionReader[T] extends PartitionReader[T]
trait FileDataSourceV2 extends TableProvider with DataSourceRegister
class FilePartitionReader[T](
abstract class FilePartitionReaderFactory extends PartitionReaderFactory
abstract class FileScan(sparkSession: SparkSession,
abstract class FileScanBuilder(
abstract class FileTable(options: DataSourceOptions, userSpecifiedSchema: Option[StructType])
class PartitionRecordReader[T](
class PartitionRecordDReaderWithProject[X, T](
class OrcDataSourceV2 extends FileDataSourceV2
case class OrcPartitionReaderFactory(
case class OrcScan(
case class OrcScanBuilder(
case class OrcTable(options: DataSourceOptions, userSpecifiedSchema: Option[StructType])

HyukjinKwon · 2018-12-26T11:56:57Z

retest this please

SparkQA · 2018-12-26T15:20:50Z

Test build #100454 has finished for PR 23383 at commit ed1a7fe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FallBackFileDataSourceToV1(sparkSession: SparkSession) extends Rule[LogicalPlan]
case class FilePartition(index: Int, files: Seq[PartitionedFile])
class EmptyPartitionReader[T] extends PartitionReader[T]
trait FileDataSourceV2 extends TableProvider with DataSourceRegister
class FilePartitionReader[T](
abstract class FilePartitionReaderFactory extends PartitionReaderFactory
abstract class FileScan(sparkSession: SparkSession,
abstract class FileScanBuilder(
abstract class FileTable(options: DataSourceOptions, userSpecifiedSchema: Option[StructType])
class PartitionRecordReader[T](
class PartitionRecordDReaderWithProject[X, T](
class OrcDataSourceV2 extends FileDataSourceV2
case class OrcPartitionReaderFactory(
case class OrcScan(
case class OrcScanBuilder(
case class OrcTable(options: DataSourceOptions, userSpecifiedSchema: Option[StructType])

SparkQA · 2019-01-02T15:35:17Z

Test build #100639 has finished for PR 23383 at commit 01c7b07.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FileTable(
case class OrcTable(

SparkQA · 2019-01-03T13:35:17Z

Test build #100686 has finished for PR 23383 at commit f925e8f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-01-03T14:00:34Z

retest this please.

gengliangwang · 2019-01-03T15:34:55Z

pending on #23387 to be merged.

SparkQA · 2019-01-03T18:05:30Z

Test build #100693 has finished for PR 23383 at commit f925e8f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-01-03T18:15:12Z

retest this please.

SparkQA · 2019-01-03T22:00:19Z

Test build #100697 has finished for PR 23383 at commit e1242a4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FallBackFileDataSourceToV1(sparkSession: SparkSession) extends Rule[LogicalPlan]
case class FilePartition(index: Int, files: Seq[PartitionedFile])
class EmptyPartitionReader[T] extends PartitionReader[T]
trait FileDataSourceV2 extends TableProvider with DataSourceRegister
class FilePartitionReader[T](
abstract class FilePartitionReaderFactory extends PartitionReaderFactory
abstract class FileScan(
abstract class FileScanBuilder(fileIndex: PartitioningAwareFileIndex, schema: StructType)
abstract class FileTable(
class PartitionRecordReader[T](
class PartitionRecordDReaderWithProject[X, T](
class OrcDataSourceV2 extends FileDataSourceV2
case class OrcPartitionReaderFactory(
case class OrcScan(
case class OrcScanBuilder(
case class OrcTable(

SparkQA · 2019-01-03T22:42:47Z

Test build #100701 has finished for PR 23383 at commit e1242a4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FallBackFileDataSourceToV1(sparkSession: SparkSession) extends Rule[LogicalPlan]
case class FilePartition(index: Int, files: Seq[PartitionedFile])
class EmptyPartitionReader[T] extends PartitionReader[T]
trait FileDataSourceV2 extends TableProvider with DataSourceRegister
class FilePartitionReader[T](
abstract class FilePartitionReaderFactory extends PartitionReaderFactory
abstract class FileScan(
abstract class FileScanBuilder(fileIndex: PartitioningAwareFileIndex, schema: StructType)
abstract class FileTable(
class PartitionRecordReader[T](
class PartitionRecordDReaderWithProject[X, T](
class OrcDataSourceV2 extends FileDataSourceV2
case class OrcPartitionReaderFactory(
case class OrcScan(
case class OrcScanBuilder(
case class OrcTable(

gengliangwang · 2019-01-04T03:46:07Z

retest this please.

SparkQA · 2019-01-04T07:58:58Z

Test build #100718 has finished for PR 23383 at commit e1242a4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FallBackFileDataSourceToV1(sparkSession: SparkSession) extends Rule[LogicalPlan]
case class FilePartition(index: Int, files: Seq[PartitionedFile])
class EmptyPartitionReader[T] extends PartitionReader[T]
trait FileDataSourceV2 extends TableProvider with DataSourceRegister
class FilePartitionReader[T](
abstract class FilePartitionReaderFactory extends PartitionReaderFactory
abstract class FileScan(
abstract class FileScanBuilder(fileIndex: PartitioningAwareFileIndex, schema: StructType)
abstract class FileTable(
class PartitionRecordReader[T](
class PartitionRecordDReaderWithProject[X, T](
class OrcDataSourceV2 extends FileDataSourceV2
case class OrcPartitionReaderFactory(
case class OrcScan(
case class OrcScanBuilder(
case class OrcTable(

cloud-fan · 2019-01-04T11:09:04Z

retest this please

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala

…arReader

SparkQA · 2019-01-16T20:05:18Z

Test build #101325 has finished for PR 23383 at commit 30e0481.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-17T08:05:02Z

Test build #101347 has finished for PR 23383 at commit 6e87532.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DummyReadOnlyFileTable extends Table with SupportsBatchRead

gengliangwang · 2019-01-17T08:17:00Z

retest this please.

SparkQA · 2019-01-17T12:38:02Z

Test build #101359 has finished for PR 23383 at commit 6e87532.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DummyReadOnlyFileTable extends Table with SupportsBatchRead

cloud-fan · 2019-01-17T15:30:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala

+    val hadoopConf =
+      sparkSession.sessionState.newHadoopConfWithOptions(options.asMap().asScala.toMap)
+    val rootPathsSpecified = DataSource.checkAndGlobPathIfNecessary(filePaths, hadoopConf,
+      checkEmptyGlobPath = true, checkFilesExist = options.checkFilesExist())


we should revisit it later. It doesn't make sense to have different behaviors of file listing between read and write.

I'm fine with it as a workaround for now. In the followup we can remove it, fix tests and accepts the behavior changes for ds v2.

Yes, we can always set checkFilesExist as false here.

cloud-fan · 2019-01-17T15:33:53Z

LGTM, merging to master!

gengliangwang · 2019-01-17T16:18:42Z

@cloud-fan @dongjoon-hyun @gatorsmile Thanks for the review. I will come up with the file write path migration very soon.

…ion columns ## What changes were proposed in this pull request? Currently OrcColumnarBatchReader returns all the partition column values in the batch read. In data source V2, we can improve it by returning the required partition column values only. This PR is part of apache#23383 . As cloud-fan suggested, create a new PR to make review easier. Also, this PR doesn't improve `OrcFileFormat`, since in the method `buildReaderWithPartitionValues`, the `requiredSchema` filter out all the partition columns, so we can't know which partition column is required. ## How was this patch tested? Unit test Closes apache#23387 from gengliangwang/refactorOrcColumnarBatch. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ad path ## What changes were proposed in this pull request? Create a framework for file source V2 based on data source V2 API. As a good example for demonstrating the framework, this PR also migrate ORC source. This is because ORC file source supports both row scan and columnar scan, and the implementation is simpler comparing with Parquet. Note: Currently only read path of V2 API is done, this framework and migration are only for the read path. Supports the following scan: - Scan ColumnarBatch - Scan UnsafeRow - Push down filters - Push down required columns Not supported( due to the limitation of data source V2 API): - Stats metrics - Catalog table - Writes ## How was this patch tested? Unit test Closes apache#23383 from gengliangwang/latest_orcV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ex in the write path ## What changes were proposed in this pull request? In apache#23383, the file source V2 framework is implemented. In the PR, `FileIndex` is created as a member of `FileTable`, so that we can implement partition pruning like apache@0f9fcab in the future(As data source V2 catalog is under development, partition pruning is removed from the PR) However, after write path of file source V2 is implemented, I find that a simple write will create an unnecessary `FileIndex`, which is required by `FileTable`. This is a sort of regression. And we can see there is a warning message when writing to ORC files ``` WARN InMemoryFileIndex: The directory file:/tmp/foo was not found. Was it deleted very recently? ``` This PR is to make `FileIndex` as a lazy value in `FileTable`, so that we can avoid creating unnecessary `FileIndex` in the write path. ## How was this patch tested? Existing unit test Closes apache#23774 from gengliangwang/moveFileIndexInV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan reviewed Dec 26, 2018

View reviewed changes

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java Outdated Show resolved Hide resolved

gengliangwang mentioned this pull request Dec 26, 2018

[SPARK-26447][SQL]Allow OrcColumnarBatchReader to return less partition columns #23387

Closed

gengliangwang changed the title ~~[WIP][SPARK-23817][SQL] Migrate ORC file format read path to data source V2~~ [SPARK-23817][SQL] Migrate ORC file format read path to data source V2 Jan 3, 2019

gengliangwang force-pushed the latest_orcV2 branch from f925e8f to e1242a4 Compare January 3, 2019 17:23