-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8654][SQL] Fix Analysis exception when using NULL IN (...) #8983
Conversation
In the analysis phase , while processing the rules for IN predicate, we compare the in-list types to the lhs expression type and generate cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up generating cast between in list types to NULL like cast (1 as NULL) which is not a valid cast. The fix is to not generate such a cast if the lhs type is a NullType instead we translate the expression to Literal(Null).
val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), "a")() :: Nil, | ||
LocalRelation() | ||
) | ||
assertAnalysisSuccess(plan, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the default value of caseSensitive
?
Thanks for reviewing the code Wenchen. I was trying to model the test case based on what was put in the JIRA which did a caseInsensitiveAnalyze. I have fixed it now. |
*/ | ||
object InConversion extends Rule[LogicalPlan] { | ||
def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions { | ||
// Skip nodes who's children have not been resolved yet. | ||
case e if !e.childrenResolved => e | ||
|
||
case i @ In(a, b) if (a.dataType == NullType) => | ||
Literal.create(null, BooleanType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of just casting null to boolean, can we come up with a better idea according to the data types of b
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry for my mistake. I thought you are casting the a
to boolean type, but actually you just turn the result to boolean null.
Can you reference to a hive doc that says something like "if the value is null, the In operation will always return null"? I think we should follow hive semantic here.
Thanks !! Do we need to look at the in list types in this case ? The in list types could be literals of different types , right ? for example NULL not in (1, 'a') Since the result of IN predicate is a boolean type, i thought it would be safe to transform it to Thanks a lot in advance for your help. |
if you change the |
Thanks Wenchen. You are right that not all types can be casted to boolean. However, in this case, we are not trying to cast the in list types to the LHS type (null type in our case) as we know that this is a special case predicate would always evaluate to NULL. That is why we are simply transforming the in predicate to NULL one and dropping the in list altogether. == Parsed Logical Plan == == Analyzed Logical Plan == Please let me know what you think .. If you have a test case in mind that would exhibit a problem then i would like to try it out. Thanks a lot for your help. |
Hi Wenchen, Here is the link i could find where its a bit confusing on the equality operator. In order to test it out , i tried the queries on hive like following .. hive> select * from tnull ; hive> select null = 1 from tnull ; Please let me know what you think and thanks again for your help. |
can you try |
hive> select null in (1,2,null) from tnull; |
I checked our implementation, we should return null for this case. LGTM then. |
btw you can send another PR to add a rule in |
Can you please help clarify ? Are you referring to the case when one of the value in the Ran the following two queries on hive We have already taken care of the case LHS type is null. Let me know what you think. |
Looks like our implementation is wrong.... We return null if any value in the list is null. |
@cloud-fan , please confirm my understanding of the code (fairly new to the codebase..:-) To confirm if there is a issue, i tried to run the following two queries again. The output looks select * from inttab where 1 in (1,2,NULL) == Parsed Logical Plan == == Analyzed Logical Plan == == Optimized Logical Plan == == Physical Plan == Code Generation: true select * from inttab where 1 in (NULL,1,2) == Parsed Logical Plan == == Analyzed Logical Plan == == Optimized Logical Plan == == Physical Plan == Code Generation: true Please let me know your thoughts .. |
ok misleaded by the imperative code style, sorry for that... This PR make |
Thanks a LOT @cloud-fan. Sure.. i will look into it. When you say another PR, Asking as i am new to the process. One other question.. what is the process to get this change integrated ? Do i need to initiate any action from my end ? |
yup, another JIRA please. You need ask some committers like @marmbrus to review your PR to get it merged . My final thoughts on this PR: |
Very good point..Thanks.. Actually Hive reports an error in this case. hive> select * from tnull where array(2,3) in (1, array(2,3)); I am not sure what is the right thing to do here. Any comments @marmbrus ? |
Lets follow hive. |
Thanks a lot michael for looking into this. I debugged hive to understand the /**
(case 1) expr in (expr1,... exprN)
(case 2) Type conversion semantics.
Our behaviour seems match that of MySql more at the present time. Do we want to change this ? |
if |
According to my reading of the SQL Standard, NULL IN (expr1, ...) should always evaluate to NULL. Here is my reasoning: The 2011 SQL Standard, part 2, section 8.4 (in predicate), syntax rule 5 says that expr IN (expr1, ...) is equivalent to expr = ANY (expr1, ...) Section 8.9 (quantified comparison predicate), general rule 2, subrules (c) and (d), say that expr = ANY (expr1, ...) evaluates to the following: TRUE if (expr = exprN) is TRUE for at least one of the expressions on the right side FALSE if the right side is an empty list or if (expr = exprN) is FALSE for every exprN on the right side UNKNOWN (NULL) otherwise Since (NULL = exprN) is always UNKNOWN and since an IN list must be non-empty (see the BNF in section 8.4), it follows that NULL IN (expr1, ...) always evaluates to UNKNOWN (NULL). So Dilip's transformation of NULL IN (expr1, ...) -> NULL looks correct to me. There is no need to cast the expressions on the right side to a common type. That is, not unless you want to raise syntax errors in situations where there is no implicit conversion to a common type. As the following examples show, Postgres, MySQL, and Derby all exhibit the correct Standard behavior. Thanks, MySQL behavior: mysql> SELECT NULL IN (1, 2, 3); mysql> SELECT NULL IN (1, 2, NULL); mysql> SELECT NULL IN (); Postgres behavior: mydb=# SELECT NULL IN (1, 2, 3); ?column?(1 row) mydb=# SELECT NULL IN (1, 2, NULL); ?column?(1 row) mydb=# SELECT NULL IN (); Derby behavior: ij> VALUES CAST (NULL AS INT) IN (1, 2, 3); 1NULL 1 row selected 1NULL 1 row selected |
I think we are all in agreement that The only question here is if we should through an error when the stuff in |
Hi Michael, Postgres and Derby raise an error if the expressions in the IN list can't be implicitly cast to a common type. MySQL is more forgiving. Thanks, MySQL: mysql> SELECT NULL IN ( 1, 'abc' ); Postgres: mydb=# SELECT NULL IN ( 1, 'abc' ); Derby: ij> VALUES CAST (NULL AS INT) IN (1, 'abc' ); |
I don't find any guidance in the Standard for what should be done if the left side of the IN operator is an untyped NULL literal. Technically, there is no such thing in the Standard. The NULL needs to be cast to a legal type. Section 8.4 provides no guidance about the type correspondence of the IN list values. However, section 8.9 implies that the IN list is equivalent to the result of a subquery, which means that we must be able to cast all of the values on the right side to a common type. Thanks, |
Thank you @marmbrus @rick-ibm @cloud-fan I checked the behavior of db2. It also raises an error if the in list types are not compatible. db2 => select * from f1 where NULL in (1, true) I am studying the code now to figure out how to detect this and raise an error. |
@cloud-fan |
findWiderCommonType(inTypes) match { | ||
case Some(finalDataType) => Literal.create(null, BooleanType) | ||
case None => i | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of returning literal null, I think we should just wider the types, as it does fix the bug(we can add a rule in Optimizer
to return null for this case).
the code can be:
case i @ In(a, b) if b.exists(_.dataType != a.dataType) =>
findWiderCommonType(i.children.map(_.dataType)) match {
case Some(finalDataType) => i.withNewChildren(i.children.map(Cast(_, finalDataType)))
case None => i
}
@cloud-fan |
I only found one rule in Anyway it should be another PR and this PR LGTM. |
Thanks, merging to master. |
Thanks a lot @marmbrus . Many thanks to @cloud-fan for his help. |
test this please |
Sorry, this never passed tests and broke something. I'm going to revert. Please reopen the PR. |
This reverts commit dcbd58a from apache#8983 Author: Michael Armbrust <michael@databricks.com> Closes apache#9034 from marmbrus/revert8654.
@marmbrus .. sorry about it. Is there a way i can look at the list of failures ? and it reported success. But then this is my first time.. so may not have right configuration. |
just reopen this PR and we will trigger a test on our jenkins for it. For local test, you can do |
@cloud-fan |
open a new one is also OK. |
In the analysis phase , while processing the rules for IN predicate, we
compare the in-list types to the lhs expression type and generate
cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
generating cast between in list types to NULL like cast (1 as NULL) which
is not a valid cast.
The fix is to not generate such a cast if the lhs type is a NullType instead
we translate the expression to Literal(Null).