Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27633][SQL] Remove redundant aliases in NestedColumnAliasing #24525

Closed
wants to merge 9 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented May 5, 2019

What changes were proposed in this pull request?

In NestedColumnAliasing rule, we create aliases for nested field access in project list. We considered that top level parent field and nested fields under it were both accessed. In the case, we don't create the aliases because they are redundant.

There is another case, where a nested parent field and nested fields under it were both accessed, which we don't consider now. We don't need to create aliases in this case too.

How was this patch tested?

Added test.

@SparkQA
Copy link

SparkQA commented May 5, 2019

Test build #105128 has finished for PR 24525 at commit f24b1d5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2019

Test build #105129 has finished for PR 24525 at commit 578663c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 6, 2019

cc @dongjoon-hyun @cloud-fan

@SparkQA
Copy link

SparkQA commented May 6, 2019

Test build #105136 has finished for PR 24525 at commit a009d3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented May 8, 2019

Test build #105267 has finished for PR 24525 at commit a009d3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 14, 2019

ping @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 14, 2019

@viirya . Actually, I've been investigating this PR multiple times from the first day. But I couldn't make me sure about this approach.

One big thing is nestedFields.forall(f => f == n || child.find(_.semanticEquals(f)).isEmpty). semanticEquals is not a robust choice in Struct related operations. semanticEquals erases many information and frequently it bites us at some corner cases.

I'm interested in this PR and testing this. Could you test this more seriously once more, too?

@viirya
Copy link
Member Author

viirya commented May 15, 2019

Thanks for review @dongjoon-hyun

I understood your concerns. Thanks for test, also. Can you share few cases that you think might be problematic, if you find any?

In case here, the intent is to see the child of a nested field accessor is already presented. Do you think it is more robust, if comparing them exactly equally, not semantically?

val nestedRelation = LocalRelation('a.struct('b.struct('c.int,
'd.struct('f.int, 'g.int), 'e.int)))

val first = GetStructField('a, 0, Some("b"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use expression DSL to write tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems only Add have DSL? Don't see DSL for GetStructField. I replaced Add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. We have getField.

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105622 has finished for PR 24525 at commit 412401b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105649 has finished for PR 24525 at commit 3fd4cc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 8, 2019

@cloud-fan @dongjoon-hyun Please let me know if you have more comments or thoughts on this? thanks.

@dongjoon-hyun
Copy link
Member

@viirya . I has been not sure about this PR due to the old reason, but I don't have a counter example you asked. Sorry about that. If the other committers think okay, I will not be against this PR. Thanks for working on nested column improvement, @viirya .

cc @cloud-fan , @dbtsai , @gatorsmile

@SparkQA
Copy link

SparkQA commented Aug 8, 2019

Test build #108795 has finished for PR 24525 at commit 4ff4112.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 8, 2019

retest this please.

@viirya
Copy link
Member Author

viirya commented Aug 8, 2019

@dongjoon-hyun I see. Thanks for the comment. Do you more agree on this if changing semanticEquals to exact comparing? Can it resolve your concern?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 8, 2019

It's been just my personal concern, @viirya . You may not need to revise your PR~ Let's wait and get some advice from the seniors.

@SparkQA
Copy link

SparkQA commented Aug 9, 2019

Test build #108845 has finished for PR 24525 at commit 4ff4112.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 15, 2019

cc @cloud-fan Do you have more thoughts on this change?

case a @ GetArrayStructFields(child, _, _, _, _) =>
nestedFields.forall(f => f == a || child.find(_.semanticEquals(f)).isEmpty)
case n @ GetStructField(child, _, _) =>
nestedFields.forall(f => f == n || child.find(_.semanticEquals(f)).isEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge the two entry into one? e.g.,

          case e @ (_: GetStructField | _: GetArrayStructFields) =>
            nestedFields.forall(f => f == e || e.children.head.find(_.semanticEquals(f)).isEmpty)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems the added test below passes without f == n? If so, can you add a test for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thanks. Will address two points.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think f == a condition is not necessary now. It should be left here by previous commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the check! Ur, one more comment: can we avoid the loop of length(nestedFields)^2? in: https://github.com/apache/spark/pull/24525/files#diff-43334bab9616cc53e8797b9afa9fc7aaR129

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for reply late.

I thought about this. We can check for a nested field accessor in other accessors which have less ExtractValue. For example, we don't need to check if a.b is redundant by looking at a.b.c, a.b.c.d, etc.

However to do this, we first need to collect the info about number of ExtractValue. Is it worth?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I see. Yea, I think its ok as it is. thanks for the check.

@SparkQA
Copy link

SparkQA commented Aug 20, 2019

Test build #109372 has finished for PR 24525 at commit 7a18a0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

github-actions bot commented Mar 2, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 2, 2020
@viirya
Copy link
Member Author

viirya commented Mar 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Mar 2, 2020

Test build #119144 has finished for PR 24525 at commit 7a18a0a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot closed this Mar 3, 2020
@HyukjinKwon HyukjinKwon reopened this Mar 3, 2020
@HyukjinKwon HyukjinKwon removed the Stale label Mar 3, 2020
@HyukjinKwon
Copy link
Member

retest this please

@viirya
Copy link
Member Author

viirya commented Mar 3, 2020

@HyukjinKwon Thanks for reopening this. I think I need to sync it up with latest change.

@SparkQA
Copy link

SparkQA commented Mar 3, 2020

Test build #119199 has finished for PR 24525 at commit 7a18a0a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2020

Test build #119206 has finished for PR 24525 at commit a027946.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Mar 3, 2020

retest this please

1 similar comment
@viirya
Copy link
Member Author

viirya commented Mar 3, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Mar 3, 2020

Test build #119213 has finished for PR 24525 at commit a027946.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 4, 2020

ping @cloud-fan

@HyukjinKwon
Copy link
Member

@viirya, can you rebase this please? I merged a couple of your PRs and seems it caused the conflicts.

@viirya
Copy link
Member Author

viirya commented Jun 13, 2020

@HyukjinKwon rebased, thanks.

@SparkQA
Copy link

SparkQA commented Jun 13, 2020

Test build #123950 has finished for PR 24525 at commit 40521aa.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 13, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 13, 2020

Test build #123974 has finished for PR 24525 at commit 40521aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2020

Test build #124009 has finished for PR 24525 at commit fb1d3f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@viirya viirya deleted the SPARK-27633 branch December 27, 2023 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants