[SPARK-27049][SQL] Support handling partition values in the abstraction of file source V2 #23961

gengliangwang · 2019-03-04T16:05:02Z

What changes were proposed in this pull request?

In FileFormat, the method buildReaderWithPartitionValues appends the partition values to the end of the result returned by buildReader, so that for data sources like CSV/JSON/AVRO only need to implement buildReader to read a single file without taking care of partition values.

This PR proposes to support handling partition values in file source v2 abstraction by:

Have two methods buildReader and buildReaderWithPartitionValues in FilePartitionReaderFactory, which have exactly the same meaning as they are in FileFormat
Rename buildColumnarReader as buildColumnarReaderWithPartitionValues to make the naming consistent.

How was this patch tested?

Existing unit tests.

SparkQA · 2019-03-04T17:51:21Z

Test build #103002 has finished for PR 23961 at commit f2f5668.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FilePartitionReaderFactory(

gengliangwang · 2019-03-05T03:27:28Z

retest this please.

gengliangwang · 2019-03-05T03:27:33Z

@cloud-fan

SparkQA · 2019-03-05T06:46:45Z

Test build #103021 has finished for PR 23961 at commit f2f5668.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FilePartitionReaderFactory(

cloud-fan · 2019-03-05T07:33:17Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PartitionRecordReader.scala

@@ -29,13 +29,3 @@ class PartitionRecordReader[T](

  override def close(): Unit = rowReader.close()
 }
-
-class PartitionRecordReaderWithProject[X, T](


can we make this class to append partition values?

I feel it's weird to have these 3 methods in FilePartitionReaderFactory: buildReader, buildReaderWithPartitionValues and buildColumnarReaderWithPartitionValues. They are not symmetrical.

Which class do you mean? I think it is bit complex to refactor the current ColumnarBatch to support appending partition values. I should create another JIRA for it. What do you think?

it's just different ways to abstract the code: 1) add common methods in the parent class 2) create util classes.

In this case I feel 2) is better, because the added methods in the parent class are not symmetrical.

gengliangwang · 2019-03-06T10:09:25Z

Close this one and create #23987

refactor

f2f5668

cloud-fan reviewed Mar 5, 2019

View reviewed changes

gengliangwang mentioned this pull request Mar 6, 2019

[SPARK-27049][SQL] Create util class to support handling partition values in file source V2 #23987

Closed

gengliangwang closed this Mar 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27049][SQL] Support handling partition values in the abstraction of file source V2 #23961

[SPARK-27049][SQL] Support handling partition values in the abstraction of file source V2 #23961

gengliangwang commented Mar 4, 2019

SparkQA commented Mar 4, 2019

gengliangwang commented Mar 5, 2019

gengliangwang commented Mar 5, 2019

SparkQA commented Mar 5, 2019

cloud-fan Mar 5, 2019

gengliangwang Mar 5, 2019

cloud-fan Mar 5, 2019

gengliangwang commented Mar 6, 2019

[SPARK-27049][SQL] Support handling partition values in the abstraction of file source V2 #23961

[SPARK-27049][SQL] Support handling partition values in the abstraction of file source V2 #23961

Conversation

gengliangwang commented Mar 4, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 4, 2019

gengliangwang commented Mar 5, 2019

gengliangwang commented Mar 5, 2019

SparkQA commented Mar 5, 2019

cloud-fan Mar 5, 2019

Choose a reason for hiding this comment

gengliangwang Mar 5, 2019

Choose a reason for hiding this comment

cloud-fan Mar 5, 2019

Choose a reason for hiding this comment

gengliangwang commented Mar 6, 2019