[SPARK-27034][SQL] Nested schema pruning for ORC #23943

viirya · 2019-03-03T07:04:11Z

What changes were proposed in this pull request?

We only supported nested schema pruning for Parquet previously. This proposes to support nested schema pruning for ORC too.

Note: This only covers ORC v1. For ORC v2, the necessary change is at the schema pruning rule. We should deal with ORC v2 as a TODO item, in order to reduce review burden.

How was this patch tested?

Added tests.

SparkQA · 2019-03-03T08:05:01Z

Test build #102949 has finished for PR 23943 at commit 4e71393.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-03T08:10:46Z

retest this please

SparkQA · 2019-03-03T10:18:07Z

Test build #102950 has finished for PR 23943 at commit 4e71393.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-03T17:02:36Z

Test build #102953 has finished for PR 23943 at commit 5f2a73f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-04T04:41:24Z

Test build #102960 has finished for PR 23943 at commit f636126.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-04T13:31:46Z

Test build #102979 has finished for PR 23943 at commit ab4fbb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-03-04T14:56:16Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala

@@ -81,7 +82,8 @@ object ParquetSchemaPruning extends Rule[LogicalPlan] {
   * Checks to see if the given relation is Parquet and can be pruned.


ParquetSchemaPruning can be renamed to like SchemaPruning, but I leave it to later followup to reduce diff.

After finishing reviews, we can rename it as a final commit.

viirya · 2019-03-04T14:57:40Z

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala

-      Row(null, null) ::
-      Nil)
-  }
-
  testSchemaPruning("select a single complex field and the partition column") {
    val query = sql("select name.middle, p from contacts")
    checkScan(query, "struct<name:struct<middle:string>>")
    checkAnswer(query.orderBy("id"),
      Row("X.", 1) :: Row("Y.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil)


Those tests can't be shared with ORC, because they are depended on schema merging.

Ur, do you mean spark.sql.parquet.mergeSchema is enabled in this test suite? I guess it's disabled by default.

Ah, sorry, it is not due to schema merging.

But the inferred schema between ORC and Parquet is different. We can test it on current master branch like:

withTempPath { dir => val path = dir.getCanonicalPath makeDataSourceFile(contacts, new File(path + "/contacts/p=1")) makeDataSourceFile(briefContacts, new File(path + "/contacts/p=2")) spark.read.format(dataSourceName).load(path + "/contacts").createOrReplaceTempView("contacts") spark.sql("select * from contacts").printSchema() }

When dataSourceName is parquet, the schema is:

root |-- id: integer (nullable = true) |-- name: struct (nullable = true) | |-- first: string (nullable = true) | |-- middle: string (nullable = true) | |-- last: string (nullable = true) |-- address: string (nullable = true) |-- pets: integer (nullable = true) |-- friends: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- first: string (nullable = true) | | |-- middle: string (nullable = true) | | |-- last: string (nullable = true) |-- relatives: map (nullable = true) | |-- key: string | |-- value: struct (valueContainsNull = true) | | |-- first: string (nullable = true) | | |-- middle: string (nullable = true) | | |-- last: string (nullable = true) |-- employer: struct (nullable = true) | |-- id: integer (nullable = true) | |-- company: struct (nullable = true) | | |-- name: string (nullable = true) | | |-- address: string (nullable = true) |-- p: integer (nullable = true)

For orc, it is:

root |-- id: integer (nullable = true) |-- name: struct (nullable = true) | |-- first: string (nullable = true) | |-- last: string (nullable = true) |-- address: string (nullable = true) |-- p: integer (nullable = true)

viirya · 2019-03-04T15:00:42Z

I think this is ok for review now. cc @dongjoon-hyun @cloud-fan @dbtsai

dongjoon-hyun · 2019-03-05T19:30:27Z

Hi, @viirya . Since your PR (#23955) is merged, could you rebase this PR and add benchmark here? That will show your PR's benefit in a more clear way.

…ning-orc

viirya · 2019-03-06T03:25:25Z

@dongjoon-hyun Thanks. Updated benchmark result after rebased with the master.

SparkQA · 2019-03-06T07:15:30Z

Test build #103074 has finished for PR 23943 at commit ef6576a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-06T22:05:11Z

sql/core/benchmarks/OrcNestedSchemaPruningBenchmark-results.txt

-Top-level column                                    113            196          89          8.8         113.0       1.0X
-Nested column                                      1316           1639         240          0.8        1315.5       0.1X
+Top-level column                                    116            151          36          8.6         116.3       1.0X
+Nested column                                       544            604          31          1.8         544.5       0.2X


Nice, 2x faster than before. BTW, Parquet has a ratio 10:4 and ORC 10:2 now. In other words, nested column read is slower 2 times in Parquet and 5 times in ORC. I guess there still exists some overhead in this PR (compared with Parquet). Could you optimize more in the current approach?

PARQUET

Selection: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Top-level column 88 114 16 11.4 87.5 1.0X Nested column 201 223 27 5.0 200.5 0.4X

cc @cloud-fan , @gengliangwang , @gatorsmile

I think the read performance is more determined by persistent library than Spark side here. As Parquet, at Spark side we provide correctly pruned nested schema to ORC library. Pruning nested fields when reading data is done by ORC library. I'm not sure if we have much room to optimize at Spark side for this. Of course I'm open to any suggestion I'm missing right now.

dongjoon-hyun · 2019-03-06T22:13:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1540,8 +1540,8 @@ object SQLConf {
      .internal()
      .doc("Prune nested fields from a logical relation's output which are unnecessary in " +
        "satisfying a query. This optimization allows columnar file format readers to avoid " +
-        "reading unnecessary nested column data. Currently Parquet is the only data source that " +
-        "implements this optimization.")
+        "reading unnecessary nested column data. Currently Parquet and ORC are the data sources " +


Shall we specify ORC v1 instead of ORC because ORC means V2 by default in Spark 3.0.0.

dongjoon-hyun · 2019-03-06T22:25:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+import org.apache.spark.sql.types.StructType
+
+abstract class SchemaPruningSuite
+    extends QueryTest


indentation? extends uses two space indent.

dongjoon-hyun · 2019-03-06T22:32:14Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSchemaPruningSuite.scala

+    checkScan(query, "struct<name:struct<first:string,middle:string,last:string>>")
+    checkAnswer(query.orderBy("id"),
+      Row("X.", Row("Jane", "X.", "Doe")) ::
+        Row("Y.", Row("John", "Y.", "Doe")) ::


We don't need more spaces here.

Sure. Automatically edited by the IDE...

dongjoon-hyun · 2019-03-06T22:32:26Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSchemaPruningSuite.scala

+        "relatives:map<string,struct<first:string,middle:string,last:string>>>")
+    checkAnswer(query.orderBy("id"),
+      Row(0, "Doe", "X.", "Jane", null, null, null, "Smith", "Z.", "Susan", 1, "123 Main Street") ::
+        Row(1, "Doe", "Y.", "John", null, null, null, null, null, null, 3, "321 Wall Street") ::


dongjoon-hyun · 2019-03-06T22:34:35Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSchemaPruningSuite.scala

+  }
+
+  /**
+   * Overrides this because ORC datasource doesn't support schema merging currently.


If you gave the final schema as the user defined schema, it will work.

I haven't tried. But I guess with user specified schema, it should work.

dongjoon-hyun · 2019-03-06T22:35:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+    briefContacts.map { case BriefContact(id, name, address) =>
+      BriefContactWithDataPartitionColumn(id, name, address, 2) }
+
+  testSchemaPruning("select a single complex field array and its parent struct array") {


These are all the same test cases, aren't these?

Yes, we should be able to use all same tests between Parquet and ORC once user specified schema works.

…ning-orc

SparkQA · 2019-03-07T08:05:01Z

Test build #103121 has finished for PR 23943 at commit 8ac4aed.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
protected case class Val(
class FractionTimestampFormatter(timeZone: TimeZone)
class PartitionReaderWithPartitionValues(
trait LimitExec extends UnaryExecNode
case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec
trait BaseLimitExec extends LimitExec with CodegenSupport

dilipbiswal · 2019-03-07T08:10:35Z

retest this please

SparkQA · 2019-03-07T12:26:11Z

Test build #103127 has finished for PR 23943 at commit 8ac4aed.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
protected case class Val(
class FractionTimestampFormatter(timeZone: TimeZone)
class PartitionReaderWithPartitionValues(
trait LimitExec extends UnaryExecNode
case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec
trait BaseLimitExec extends LimitExec with CodegenSupport

viirya · 2019-03-09T08:40:12Z

@dongjoon-hyun Thanks for your review! By using user specified schema, Parquet and ORC schema pruning suite use the same test cases now.

SparkQA · 2019-03-09T12:53:30Z

Test build #103258 has finished for PR 23943 at commit 633c3d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-11T00:03:20Z

Retest this please.

SparkQA · 2019-03-11T00:14:26Z

Test build #103286 has finished for PR 23943 at commit 633c3d7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-11T00:16:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

@@ -206,7 +209,7 @@ class OrcFileFormat
            Array.fill(requiredSchema.length)(-1) ++ Range(0, partitionSchema.length)
          batchReader.initialize(fileSplit, taskAttemptContext)
          batchReader.initBatch(
-            reader.getSchema,
+            TypeDescription.fromString(resultSchemaString),


This is required because we select the subset of the file schema (=reader.getSchema).

dongjoon-hyun · 2019-03-11T00:37:35Z

Could you rebase to master once more?

…ning-orc

viirya · 2019-03-11T07:32:01Z

@dongjoon-hyun Rebased now. Thanks!

SparkQA · 2019-03-11T12:08:31Z

Test build #103309 has finished for PR 23943 at commit 774027c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-12T06:41:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+              idx
+            } else {
+              -1
+            }
          })
        } else {
          // Do case-insensitive resolution only if in case-insensitive mode
          val caseInsensitiveOrcFieldMap =
            orcFieldNames.zipWithIndex.groupBy(_._1.toLowerCase(Locale.ROOT))


Since we don't need the old index, shall we remove the obsolete indexes?

- val caseInsensitiveOrcFieldMap = - orcFieldNames.zipWithIndex.groupBy(_._1.toLowerCase(Locale.ROOT)) + val caseInsensitiveOrcFieldMap = orcFieldNames.groupBy(_.toLowerCase(Locale.ROOT)) Some(requiredSchema.fieldNames.zipWithIndex.map { case (requiredFieldName, idx) => caseInsensitiveOrcFieldMap .get(requiredFieldName.toLowerCase(Locale.ROOT)) .map { matchedOrcFields => if (matchedOrcFields.size > 1) { // Need to fail if there is ambiguity, i.e. more than one field is matched. - val matchedOrcFieldsString = matchedOrcFields.map(_._1).mkString("[", ", ", "]") + val matchedOrcFieldsString = matchedOrcFields.mkString("[", ", ", "]")

SparkQA · 2019-03-12T22:33:42Z

Test build #103379 has finished for PR 23943 at commit b939d2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Merged to master. Thank you, @viirya .

cc @dbtsai , @cloud-fan , @gatorsmile , @gengliangwang

dongjoon-hyun · 2019-03-12T23:21:08Z

@viirya . Since this is merged now, could you make a planned followup PR with renaming ParquetSchemaPruning?

…ning ## What changes were proposed in this pull request? This is a followup to #23943. This proposes to rename ParquetSchemaPruning to SchemaPruning as ParquetSchemaPruning supports both Parquet and ORC v1 now. ## How was this patch tested? Existing tests. Closes #24077 from viirya/nested-schema-pruning-orc-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…g BM result with EC2 ## What changes were proposed in this pull request? This is a follow up PR for #23943 in order to update the benchmark result with EC2 `r3.xlarge` instance. ## How was this patch tested? N/A. (Manually compare the diff) Closes #24078 from dongjoon-hyun/SPARK-27034. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

gatorsmile · 2019-12-16T06:01:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-        "reading unnecessary nested column data. Currently Parquet is the only data source that " +
-        "implements this optimization.")
+        "reading unnecessary nested column data. Currently Parquet and ORC v1 are the " +
+        "data sources that implement this optimization.")
      .booleanConf
      .createWithDefault(false)


@dbtsai @dongjoon-hyun We turned on this flag by default in the upcoming 3.0 because Apple has tried this in the production in the last few months. I am wondering if that statement also includes ORC nested schema pruning?

We mainly use it and test it with Parquet.

viirya added 4 commits March 2, 2019 22:36

Nested schema pruning for ORC.

aad0ea9

Fix ORC v2.

ff3d23d

Change SQL config description.

d08e2a0

Reduce style diff.

4e71393

Revert change to catalogString to fix hive test.

5f2a73f

Fix issue for vectorized reader.

f636126

Set correct schema resolution flag for ORC.

ab4fbb2

viirya mentioned this pull request Mar 4, 2019

[SPARK-27043][SQL] Add ORC nested schema pruning benchmarks #23955

Closed

viirya commented Mar 4, 2019

View reviewed changes

viirya added 2 commits March 6, 2019 10:16

Merge remote-tracking branch 'upstream/master' into nested-schema-pru…

460a85f

…ning-orc

Update benchmark result.

ef6576a

dongjoon-hyun reviewed Mar 6, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into nested-schema-pru…

8ac4aed

…ning-orc

viirya added 2 commits March 9, 2019 16:35

Providing user specified schema.

d111024

Address other comments.

633c3d7

dongjoon-hyun reviewed Mar 11, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into nested-schema-pru…

774027c

…ning-orc

dongjoon-hyun reviewed Mar 12, 2019

View reviewed changes

Address comment.

b939d2a

dongjoon-hyun approved these changes Mar 12, 2019

View reviewed changes

dongjoon-hyun closed this in b0c2b3b Mar 12, 2019

viirya mentioned this pull request Mar 13, 2019

[SPARK-27034][SQL][Followup] Rename ParquetSchemaPruning to SchemaPruning #24077

Closed

dongjoon-hyun mentioned this pull request Mar 13, 2019

[SPARK-27034][SPARK-27123][SQL][FOLLOWUP] Update Nested Schema Pruning BM result with EC2 #24078

Closed

gatorsmile reviewed Dec 16, 2019

View reviewed changes

viirya deleted the nested-schema-pruning-orc branch December 27, 2023 18:22

		@@ -81,7 +82,8 @@ object ParquetSchemaPruning extends Rule[LogicalPlan] {
		* Checks to see if the given relation is Parquet and can be pruned.

[SPARK-27034][SQL] Nested schema pruning for ORC #23943

[SPARK-27034][SQL] Nested schema pruning for ORC #23943

Conversation

viirya commented Mar 3, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 3, 2019

dilipbiswal commented Mar 3, 2019

SparkQA commented Mar 3, 2019

SparkQA commented Mar 3, 2019

SparkQA commented Mar 4, 2019

SparkQA commented Mar 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Mar 4, 2019

dongjoon-hyun commented Mar 5, 2019

viirya commented Mar 6, 2019

SparkQA commented Mar 6, 2019

dongjoon-hyun Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 7, 2019

dilipbiswal commented Mar 7, 2019

SparkQA commented Mar 7, 2019

viirya commented Mar 9, 2019

SparkQA commented Mar 9, 2019

dongjoon-hyun commented Mar 11, 2019

SparkQA commented Mar 11, 2019

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 11, 2019

viirya commented Mar 11, 2019

SparkQA commented Mar 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 12, 2019

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Mar 3, 2019 •

edited

Loading

dongjoon-hyun Mar 6, 2019 •

edited

Loading

dongjoon-hyun Mar 6, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading