Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4226] [SQL] Add Exists/In support for where clause #4812

Closed
wants to merge 11 commits into from

Conversation

chenghao-intel
Copy link
Contributor

Rewrite the [NOT] EXISTS and [NOT] IN as left semi join, to support the subquery in where clause.

SELECT * FROM src b WHERE [NOT] EXISTS
  (SELECT a.key 
  FROM src a 
  WHERE a.key = b.key AND a.value > 'val_2'
  )

And
SELECT * FROM src b WHERE b.key [NOT] IN
  (SELECT a.key 
  FROM src a 
  WHERE a.key = b.key AND a.value > 'val_2'
  )

Some features still need to be supported, but will do another PRs:

  • WHERE [NOT] EXISTS (subquery) AND otherCondition
  • WHERE key [NOT] IN (subquery) AND otherCondition
  • HAVING AGGREGATION [NOT] IN / [NOT] EXIST (subquery)

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28058 has finished for PR 4812 at commit 3d6934c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LeftSemiType extends JoinType
    • case class Exists(left: LogicalPlan, right: LogicalPlan, exist: Boolean) extends BinaryNode

@ravipesala
Copy link
Contributor

@chenghao-intel Thank you for your implementation, following are my observations
Implementation seems simple but it comes with lot of limitations. The query like below

select C from R1 where exists (Select B from R2 where R1.X = R2.Y)

would be converted as below in case of your implementation I guess

select C from R1 left semi join R2 on  R1.X = R2.Y

But it syntactically not correct. it supposed to be converted as follow.

select C
from R1 left semi join
(select B, R2.Y as sq1_col0 from R2) sq1
on R1.X = sq1.sq1_col0

Both exists and in implementations should be similar. Just add exists support in parser would be enough and remaining implementation is almost similar. Not only the above case there are lot of other scenarios need to be taken care in subquery expressions. Please refer https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf.
I am waiting to get my code merged so that I am planning to add all the remaining features on top of it.

@chenghao-intel
Copy link
Contributor Author

So syntactically, what's the difference between

select C from R1 left semi join R2 on  R1.X = R2.Y

and

select C
from R1 left semi join
(select B, R2.Y as sq1_col0 from R2) sq1
on R1.X = sq1.sq1_col0

We never select any values from R2 right?

@chenghao-intel
Copy link
Contributor Author

Sorry, I meant semantically.

@chenghao-intel
Copy link
Contributor Author

@marmbrus, any comment on this?

@ravipesala
Copy link
Contributor

@chenghao-intel Sorry for late reply. I think semantically it looks fine.

@adrian-wang
Copy link
Contributor

LGTM

@scwf
Copy link
Contributor

scwf commented Mar 30, 2015

Hi @chenghao-intel can you rebase this PR?

// TODO add IN and NOT IN
case whereExpr =>
Filter(nodeToExpr(whereExpr), relations)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this do not support sql with both predicts and exists in where clause:

select * 
from src b 
where 
(not exists 
  (select a.key 
  from src a 
  where b.value = a.value  and a.key = b.key and a.value > 'val_2'
  )
) and b.key > 1
;

@liancheng
Copy link
Contributor

I think we can use PhysicalOperation to make this rule more general. Namely extract all filters first, deal with all Exists, and then apply other filters back.

@chenghao-intel
Copy link
Contributor Author

Thank you @liancheng @scwf for the review. I'd like to support the subquery combines filter in where clause in another PR. Probably we can do that after the feature of IN subquery being added.

@SparkQA
Copy link

SparkQA commented Apr 8, 2015

Test build #29829 has finished for PR 4812 at commit 81877a5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LeftSemiType extends JoinType
    • case class Exists(left: LogicalPlan, right: LogicalPlan, exist: Boolean) extends BinaryNode
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 8, 2015

Test build #29830 has finished for PR 4812 at commit b36b906.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LeftSemiType extends JoinType
    • case class Exists(left: LogicalPlan, right: LogicalPlan, exist: Boolean) extends BinaryNode
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 12, 2015

Test build #30113 has finished for PR 4812 at commit de529db.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LeftSemiType extends JoinType
    • case class Exists(
    • case class InSubquery(left: SubqueryConjunction, right: LogicalPlan, positive: Boolean)
    • case class SubqueryConjunction(child: LogicalPlan,
  • This patch does not change any dependencies.

@chenghao-intel chenghao-intel changed the title [SPARK-4226] [SQL] Add Exists support for where clause [SPARK-4226] [SQL] Add Exists/In support for where clause Apr 12, 2015
@SparkQA
Copy link

SparkQA commented Apr 12, 2015

Test build #30114 has finished for PR 4812 at commit beb4e21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LeftSemiType extends JoinType
    • case class Exists(
    • case class InSubquery(left: SubqueryConjunction, right: LogicalPlan, positive: Boolean)
    • case class SubqueryConjunction(child: LogicalPlan,
  • This patch does not change any dependencies.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 3, 2015

Hi @chenghao-intel, thanks for working on this. It seems like this branch has gone stale and there are some questions about the implementation. Can we close this issue for now and discuss design on the JIRA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants