Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27701][SQL] Extend NestedColumnAliasing to general nested field cases including GetArrayStructField #24599

Closed
wants to merge 9 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented May 14, 2019

What changes were proposed in this pull request?

NestedColumnAliasing rule covers GetStructField only, currently. It means that some nested field extraction expressions aren't pruned. For example, if only accessing a nested field in an array of struct (GetArrayStructFields), this column isn't pruned.

This patch extends the rule to cover general nested field cases, including GetArrayStructFields.

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105381 has finished for PR 24599 at commit a351e82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 14, 2019

cc @dongjoon-hyun @dbtsai @cloud-fan

@@ -25,7 +25,7 @@ import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.rules.RuleExecutor
import org.apache.spark.sql.types.{StringType, StructType}
import org.apache.spark.sql.types.{StringType, StructField, StructType}

class NestedColumnAliasingSuite extends SchemaPruningTest {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still are many usage of GetStructField in this test suite. Maybe make a minor PR to rewrite them.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 9, 2019

Test build #106314 has finished for PR 24599 at commit a351e82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya . Since this is important improvement, could you add a benchmark case to NestedSchemaPruningBenchmark? Also, please enumerate some newly support examples explicitly instead of more nested field cases in the PR description (at least).

@viirya viirya changed the title [SPARK-27701][SQL] Extend NestedColumnAliasing to more nested field cases [SPARK-27701][SQL] Extend NestedColumnAliasing to general nested field cases like GetArrayStructField and chain of ExtractValue Jun 10, 2019
@viirya viirya changed the title [SPARK-27701][SQL] Extend NestedColumnAliasing to general nested field cases like GetArrayStructField and chain of ExtractValue [SPARK-27701][SQL] Extend NestedColumnAliasing to general nested field cases including GetArrayStructField Jun 10, 2019
@viirya
Copy link
Member Author

viirya commented Jun 10, 2019

@dongjoon-hyun Thanks for looking into this.

I will add the benchmark case. The PR title and description were updated.

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106347 has finished for PR 24599 at commit 6fc422c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106353 has finished for PR 24599 at commit fb821c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2019

Test build #106366 has finished for PR 24599 at commit 4c8d371.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2019

Test build #106383 has finished for PR 24599 at commit 3075f8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2019

Test build #106384 has finished for PR 24599 at commit 64a98fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the master branch, it seems that there is a regression only in Orc (v1). I verified that Parquet/OrcV2 are consistent in master branch. This PR (@viirya ) is irrelevant to that.
cc @gatorsmile

@dongjoon-hyun
Copy link
Member

Hi, @viirya . I made a benchmark result PR to you. Could you review and merge that?

@@ -51,7 +55,7 @@ abstract class NestedSchemaPruningBenchmark extends SqlBasedBenchmark {
withTempPath { dir =>
val path = dir.getCanonicalPath

Seq(1, 2).foreach { i =>
Seq(1, 2, 3).foreach { i =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@viirya
Copy link
Member Author

viirya commented Jun 12, 2019

Thanks @dongjoon-hyun! Merged the benchmark results now.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you so much, @viirya .
The last commit is .txt-file only updates about benchmark result.
Merged to master.

@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106398 has finished for PR 24599 at commit 3aab8bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
…d cases including GetArrayStructField

## What changes were proposed in this pull request?

`NestedColumnAliasing` rule covers `GetStructField` only, currently. It means that some nested field extraction expressions aren't pruned. For example, if only accessing a nested field in an array of struct (`GetArrayStructFields`), this column isn't pruned.

This patch extends the rule to cover general nested field cases, including `GetArrayStructFields`.
## How was this patch tested?

Added tests.

Closes apache#24599 from viirya/nested-pruning-extract-value.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@viirya viirya deleted the nested-pruning-extract-value branch December 27, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants