Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27034][SQL] Nested schema pruning for ORC #23943

Closed
wants to merge 14 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 3, 2019

What changes were proposed in this pull request?

We only supported nested schema pruning for Parquet previously. This proposes to support nested schema pruning for ORC too.

Note: This only covers ORC v1. For ORC v2, the necessary change is at the schema pruning rule. We should deal with ORC v2 as a TODO item, in order to reduce review burden.

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented Mar 3, 2019

Test build #102949 has finished for PR 23943 at commit 4e71393.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 3, 2019

Test build #102950 has finished for PR 23943 at commit 4e71393.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2019

Test build #102953 has finished for PR 23943 at commit 5f2a73f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2019

Test build #102960 has finished for PR 23943 at commit f636126.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2019

Test build #102979 has finished for PR 23943 at commit ab4fbb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -81,7 +82,8 @@ object ParquetSchemaPruning extends Rule[LogicalPlan] {
* Checks to see if the given relation is Parquet and can be pruned.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParquetSchemaPruning can be renamed to like SchemaPruning, but I leave it to later followup to reduce diff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After finishing reviews, we can rename it as a final commit.

Row(null, null) ::
Nil)
}

testSchemaPruning("select a single complex field and the partition column") {
val query = sql("select name.middle, p from contacts")
checkScan(query, "struct<name:struct<middle:string>>")
checkAnswer(query.orderBy("id"),
Row("X.", 1) :: Row("Y.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests can't be shared with ORC, because they are depended on schema merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, do you mean spark.sql.parquet.mergeSchema is enabled in this test suite? I guess it's disabled by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, it is not due to schema merging.

But the inferred schema between ORC and Parquet is different. We can test it on current master branch like:

withTempPath { dir =>
  val path = dir.getCanonicalPath

  makeDataSourceFile(contacts, new File(path + "/contacts/p=1"))
  makeDataSourceFile(briefContacts, new File(path + "/contacts/p=2"))

  spark.read.format(dataSourceName).load(path + "/contacts").createOrReplaceTempView("contacts")
  spark.sql("select * from contacts").printSchema()
}

When dataSourceName is parquet, the schema is:

root                                                                                                                                                   
 |-- id: integer (nullable = true)                                                                                                                     
 |-- name: struct (nullable = true)                                                                                                                    
 |    |-- first: string (nullable = true)                                                                                                              
 |    |-- middle: string (nullable = true)                                                                                                             
 |    |-- last: string (nullable = true)                                                                                                               
 |-- address: string (nullable = true)                                                                                                                 
 |-- pets: integer (nullable = true)                                                                                                                   
 |-- friends: array (nullable = true)                                                                                                                  
 |    |-- element: struct (containsNull = true)                                                                                                        
 |    |    |-- first: string (nullable = true)                                                                                                         
 |    |    |-- middle: string (nullable = true)                                                                                                        
 |    |    |-- last: string (nullable = true)                                                                                                          
 |-- relatives: map (nullable = true)                                                                                                                  
 |    |-- key: string                                                                                                                                  
 |    |-- value: struct (valueContainsNull = true)                                                                                                     
 |    |    |-- first: string (nullable = true)                                                                                                         
 |    |    |-- middle: string (nullable = true)                                                                                                        
 |    |    |-- last: string (nullable = true)                                                                                                          
 |-- employer: struct (nullable = true)                                                                                                                
 |    |-- id: integer (nullable = true)                                                                                                                
 |    |-- company: struct (nullable = true)                                                                                                            
 |    |    |-- name: string (nullable = true)                                                                                                          
 |    |    |-- address: string (nullable = true)                                                                                                       
 |-- p: integer (nullable = true)      

For orc, it is:

root
 |-- id: integer (nullable = true)
 |-- name: struct (nullable = true)
 |    |-- first: string (nullable = true)
 |    |-- last: string (nullable = true)
 |-- address: string (nullable = true)
 |-- p: integer (nullable = true)

@viirya
Copy link
Member Author

viirya commented Mar 4, 2019

I think this is ok for review now. cc @dongjoon-hyun @cloud-fan @dbtsai

@dongjoon-hyun
Copy link
Member

Hi, @viirya . Since your PR (#23955) is merged, could you rebase this PR and add benchmark here? That will show your PR's benefit in a more clear way.

@viirya
Copy link
Member Author

viirya commented Mar 6, 2019

@dongjoon-hyun Thanks. Updated benchmark result after rebased with the master.

@SparkQA
Copy link

SparkQA commented Mar 6, 2019

Test build #103074 has finished for PR 23943 at commit ef6576a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Top-level column 113 196 89 8.8 113.0 1.0X
Nested column 1316 1639 240 0.8 1315.5 0.1X
Top-level column 116 151 36 8.6 116.3 1.0X
Nested column 544 604 31 1.8 544.5 0.2X
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, 2x faster than before. BTW, Parquet has a ratio 10:4 and ORC 10:2 now. In other words, nested column read is slower 2 times in Parquet and 5 times in ORC. I guess there still exists some overhead in this PR (compared with Parquet). Could you optimize more in the current approach?

PARQUET

Selection:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Top-level column         88            114          16         11.4          87.5       1.0X
Nested column           201            223          27          5.0         200.5       0.4X

cc @cloud-fan , @gengliangwang , @gatorsmile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the read performance is more determined by persistent library than Spark side here. As Parquet, at Spark side we provide correctly pruned nested schema to ORC library. Pruning nested fields when reading data is done by ORC library. I'm not sure if we have much room to optimize at Spark side for this. Of course I'm open to any suggestion I'm missing right now.

@@ -1540,8 +1540,8 @@ object SQLConf {
.internal()
.doc("Prune nested fields from a logical relation's output which are unnecessary in " +
"satisfying a query. This optimization allows columnar file format readers to avoid " +
"reading unnecessary nested column data. Currently Parquet is the only data source that " +
"implements this optimization.")
"reading unnecessary nested column data. Currently Parquet and ORC are the data sources " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we specify ORC v1 instead of ORC because ORC means V2 by default in Spark 3.0.0.

import org.apache.spark.sql.types.StructType

abstract class SchemaPruningSuite
extends QueryTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation? extends uses two space indent.

checkScan(query, "struct<name:struct<first:string,middle:string,last:string>>")
checkAnswer(query.orderBy("id"),
Row("X.", Row("Jane", "X.", "Doe")) ::
Row("Y.", Row("John", "Y.", "Doe")) ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need more spaces here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Automatically edited by the IDE...

"relatives:map<string,struct<first:string,middle:string,last:string>>>")
checkAnswer(query.orderBy("id"),
Row(0, "Doe", "X.", "Jane", null, null, null, "Smith", "Z.", "Susan", 1, "123 Main Street") ::
Row(1, "Doe", "Y.", "John", null, null, null, null, null, null, 3, "321 Wall Street") ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

}

/**
* Overrides this because ORC datasource doesn't support schema merging currently.
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you gave the final schema as the user defined schema, it will work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried. But I guess with user specified schema, it should work.

briefContacts.map { case BriefContact(id, name, address) =>
BriefContactWithDataPartitionColumn(id, name, address, 2) }

testSchemaPruning("select a single complex field array and its parent struct array") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all the same test cases, aren't these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be able to use all same tests between Parquet and ORC once user specified schema works.

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103121 has finished for PR 23943 at commit 8ac4aed.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • protected case class Val(
  • class FractionTimestampFormatter(timeZone: TimeZone)
  • class PartitionReaderWithPartitionValues(
  • trait LimitExec extends UnaryExecNode
  • case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec
  • trait BaseLimitExec extends LimitExec with CodegenSupport

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103127 has finished for PR 23943 at commit 8ac4aed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • protected case class Val(
  • class FractionTimestampFormatter(timeZone: TimeZone)
  • class PartitionReaderWithPartitionValues(
  • trait LimitExec extends UnaryExecNode
  • case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec
  • trait BaseLimitExec extends LimitExec with CodegenSupport

@viirya
Copy link
Member Author

viirya commented Mar 9, 2019

@dongjoon-hyun Thanks for your review! By using user specified schema, Parquet and ORC schema pruning suite use the same test cases now.

@SparkQA
Copy link

SparkQA commented Mar 9, 2019

Test build #103258 has finished for PR 23943 at commit 633c3d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103286 has finished for PR 23943 at commit 633c3d7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -206,7 +209,7 @@ class OrcFileFormat
Array.fill(requiredSchema.length)(-1) ++ Range(0, partitionSchema.length)
batchReader.initialize(fileSplit, taskAttemptContext)
batchReader.initBatch(
reader.getSchema,
TypeDescription.fromString(resultSchemaString),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required because we select the subset of the file schema (=reader.getSchema).

@dongjoon-hyun
Copy link
Member

Could you rebase to master once more?

@viirya
Copy link
Member Author

viirya commented Mar 11, 2019

@dongjoon-hyun Rebased now. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103309 has finished for PR 23943 at commit 774027c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

idx
} else {
-1
}
})
} else {
// Do case-insensitive resolution only if in case-insensitive mode
val caseInsensitiveOrcFieldMap =
orcFieldNames.zipWithIndex.groupBy(_._1.toLowerCase(Locale.ROOT))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't need the old index, shall we remove the obsolete indexes?

-          val caseInsensitiveOrcFieldMap =
-            orcFieldNames.zipWithIndex.groupBy(_._1.toLowerCase(Locale.ROOT))
+          val caseInsensitiveOrcFieldMap = orcFieldNames.groupBy(_.toLowerCase(Locale.ROOT))
           Some(requiredSchema.fieldNames.zipWithIndex.map { case (requiredFieldName, idx) =>
             caseInsensitiveOrcFieldMap
               .get(requiredFieldName.toLowerCase(Locale.ROOT))
               .map { matchedOrcFields =>
                 if (matchedOrcFields.size > 1) {
                   // Need to fail if there is ambiguity, i.e. more than one field is matched.
-                  val matchedOrcFieldsString = matchedOrcFields.map(_._1).mkString("[", ", ", "]")
+                  val matchedOrcFieldsString = matchedOrcFields.mkString("[", ", ", "]")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@SparkQA
Copy link

SparkQA commented Mar 12, 2019

Test build #103379 has finished for PR 23943 at commit b939d2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master. Thank you, @viirya .

cc @dbtsai , @cloud-fan , @gatorsmile , @gengliangwang

@dongjoon-hyun
Copy link
Member

@viirya . Since this is merged now, could you make a planned followup PR with renaming ParquetSchemaPruning?

HyukjinKwon pushed a commit that referenced this pull request Mar 13, 2019
…ning

## What changes were proposed in this pull request?

This is a followup to #23943. This proposes to rename ParquetSchemaPruning to SchemaPruning as ParquetSchemaPruning supports both Parquet and ORC v1 now.

## How was this patch tested?

Existing tests.

Closes #24077 from viirya/nested-schema-pruning-orc-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
dbtsai pushed a commit that referenced this pull request Mar 13, 2019
…g BM result with EC2

## What changes were proposed in this pull request?

This is a follow up PR for #23943 in order to update the benchmark result with EC2 `r3.xlarge` instance.

## How was this patch tested?

N/A. (Manually compare the diff)

Closes #24078 from dongjoon-hyun/SPARK-27034.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
"reading unnecessary nested column data. Currently Parquet is the only data source that " +
"implements this optimization.")
"reading unnecessary nested column data. Currently Parquet and ORC v1 are the " +
"data sources that implement this optimization.")
.booleanConf
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai @dongjoon-hyun We turned on this flag by default in the upcoming 3.0 because Apple has tried this in the production in the last few months. I am wondering if that statement also includes ORC nested schema pruning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mainly use it and test it with Parquet.

@viirya viirya deleted the nested-schema-pruning-orc branch December 27, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants