Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8654][SQL] Fix Analysis exception when using NULL IN (...) #8983

Closed
wants to merge 6 commits into from

Conversation

dilipbiswal
Copy link
Contributor

In the analysis phase , while processing the rules for IN predicate, we
compare the in-list types to the lhs expression type and generate
cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
generating cast between in list types to NULL like cast (1 as NULL) which
is not a valid cast.

The fix is to not generate such a cast if the lhs type is a NullType instead
we translate the expression to Literal(Null).

In the analysis phase , while processing the rules for IN predicate, we
compare the in-list types to the lhs expression type and generate
cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
generating cast between in list types to NULL like cast (1 as NULL) which
is not a valid cast.

The fix is to not generate such a cast if the lhs type is a NullType instead
we translate the expression to Literal(Null).
val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), "a")() :: Nil,
LocalRelation()
)
assertAnalysisSuccess(plan, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the default value of caseSensitive?

@dilipbiswal
Copy link
Contributor Author

Thanks for reviewing the code Wenchen. I was trying to model the test case based on what was put in the JIRA which did a caseInsensitiveAnalyze. I have fixed it now.

*/
object InConversion extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions {
// Skip nodes who's children have not been resolved yet.
case e if !e.childrenResolved => e

case i @ In(a, b) if (a.dataType == NullType) =>
Literal.create(null, BooleanType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of just casting null to boolean, can we come up with a better idea according to the data types of b?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for my mistake. I thought you are casting the a to boolean type, but actually you just turn the result to boolean null.

Can you reference to a hive doc that says something like "if the value is null, the In operation will always return null"? I think we should follow hive semantic here.

@dilipbiswal
Copy link
Contributor Author

Thanks !!

Do we need to look at the in list types in this case ? The in list types could be literals of different types , right ? for example NULL not in (1, 'a')

Since the result of IN predicate is a boolean type, i thought it would be safe to transform it to
a Null literal of boolean type. Are you thinking of a case where this would not work ?

Thanks a lot in advance for your help.

@cloud-fan
Copy link
Contributor

if you change the value type to boolean, then we have to change each element in list to boolean type too, which maybe dangerous as a lot of types can't be casted to boolean. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions

@dilipbiswal
Copy link
Contributor Author

Thanks Wenchen. You are right that not all types can be casted to boolean.

However, in this case, we are not trying to cast the in list types to the LHS type (null type in our case) as we know that this is a special case predicate would always evaluate to NULL. That is why we are simply transforming the in predicate to NULL one and dropping the in list altogether.

== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter NOT null IN (1,2,3,4) => original one
'UnresolvedRelation [inttab], None

== Analyzed Logical Plan ==
c1: int
Project [c1#0]
Filter NOT null => rewritten one
Subquery inttab
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

Please let me know what you think .. If you have a test case in mind that would exhibit a problem then i would like to try it out.

Thanks a lot for your help.

@dilipbiswal
Copy link
Contributor Author

Hi Wenchen,

Here is the link i could find where its a bit confusing on the equality operator.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-RelationalOperators

In order to test it out , i tried the queries on hive like following ..

hive> select * from tnull ;
OK
1
2
NULL

hive> select null = 1 from tnull ;
OK
NULL
NULL
NULL
Time taken: 0.118 seconds, Fetched: 3 row(s)
hive> select null in (1,2) from tnull ;
OK
NULL
NULL
NULL

Please let me know what you think and thanks again for your help.

@cloud-fan
Copy link
Contributor

can you try select null in (1,2,null)? I wanna make sure hive doesn't execute the In operation when value is null type.

@dilipbiswal
Copy link
Contributor Author

hive> select null in (1,2,null) from tnull;
OK
NULL
NULL
NULL
Time taken: 0.139 seconds, Fetched: 3 row(s)

@cloud-fan
Copy link
Contributor

I checked our implementation, we should return null for this case.

LGTM then.

@cloud-fan
Copy link
Contributor

btw you can send another PR to add a rule in Optimizer so that if value is literal null, we can just return null.

@dilipbiswal
Copy link
Contributor Author

Can you please help clarify ? Are you referring to the case when one of the value in the
in list is literal null like 1 in (1, NULL) ? If so, i don't think we can evaluate this to NULL...

Ran the following two queries on hive
hive> select 1 in (1,NULL) from tnull;
OK
true
true
true
Time taken: 1.21 seconds, Fetched: 3 row(s)
hive> select 1 in (NULL,1) from tnull;
OK
true
true
true
Time taken: 0.168 seconds, Fetched: 3 row(s)

We have already taken care of the case LHS type is null. Let me know what you think.

@cloud-fan
Copy link
Contributor

Looks like our implementation is wrong.... We return null if any value in the list is null.
Would you like to send another PR to fix it?

@dilipbiswal
Copy link
Contributor Author

@cloud-fan , please confirm my understanding of the code (fairly new to the codebase..:-)
In the code we go through the entire in list and run evaluate the expression flagging hasNull. But
we continue with next items and return true if we see a match. If we haven't seen it then we look at the hasNull flag and return Null or False.

To confirm if there is a issue, i tried to run the following two queries again. The output looks
ok to me..

select * from inttab where 1 in (1,2,NULL)
var2: org.apache.spark.sql.DataFrame = [c1: int]
+---+
| c1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+

== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter 1 IN (1,2,null)
'UnresolvedRelation [inttab], None

== Analyzed Logical Plan ==
c1: int
Project [c1#0]
Filter 1 IN (cast(1 as int),cast(2 as int),cast(null as int))
Subquery inttab
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

== Optimized Logical Plan ==
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

== Physical Plan ==
Scan PhysicalRDD[c1#0]

Code Generation: true

select * from inttab where 1 in (NULL,1,2)
var2: org.apache.spark.sql.DataFrame = [c1: int]
+---+
| c1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+

== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter 1 IN (null,1,2)
'UnresolvedRelation [inttab], None

== Analyzed Logical Plan ==
c1: int
Project [c1#0]
Filter 1 IN (cast(null as int),cast(1 as int),cast(2 as int))
Subquery inttab
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

== Optimized Logical Plan ==
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

== Physical Plan ==
Scan PhysicalRDD[c1#0]

Code Generation: true

Please let me know your thoughts ..

@cloud-fan
Copy link
Contributor

ok misleaded by the imperative code style, sorry for that...
So we will return null if the key is null, or we can't find a value in list matching our key, and there is null value inside list.

This PR make In operation return null when key is NullType, which is right. But I think we can do it further by returning null for any type of key which is literal null(probably another PR).

@dilipbiswal
Copy link
Contributor Author

Thanks a LOT @cloud-fan. Sure.. i will look into it. When you say another PR,
do we mean another JIRA ?

Asking as i am new to the process.

One other question.. what is the process to get this change integrated ? Do i need to initiate any action from my end ?

@cloud-fan
Copy link
Contributor

yup, another JIRA please.

You need ask some committers like @marmbrus to review your PR to get it merged .

My final thoughts on this PR:
When the type is confilct, for example null in (true, array(2,3)), should we return null or report type conflict error during analysis? What hive's behavior for this case?

@dilipbiswal
Copy link
Contributor Author

Very good point..Thanks.. Actually Hive reports an error in this case.

hive> select * from tnull where array(2,3) in (1, array(2,3));
FAILED: SemanticException [Error 10014]: Line 1:37 Wrong arguments '3': The arguments for IN should be the same type! Types are: {array IN (int, array)}

I am not sure what is the right thing to do here.

Any comments @marmbrus ?

@marmbrus
Copy link
Contributor

marmbrus commented Oct 6, 2015

Lets follow hive.

@dilipbiswal
Copy link
Contributor Author

@marmbrus

Thanks a lot michael for looking into this. I debugged hive to understand the
behaviour and would like to share my findings. I wanted to make sure we are doing
the right thing here. Here are the comments at the top of GenericUDFIn() in hive.

/**

  • GenericUDFIn
    *
  • Example usage:
  • SELECT key FROM src WHERE key IN ("238", "1");
    *
  • From MySQL page on IN(): To comply with the SQL standard, IN returns NULL
  • not only if the expression on the left hand side is NULL, but also if no
  • match is found in the list and one of the expressions in the list is NULL.
    *
  • Also noteworthy: type conversion behavior is different from MySQL. With
  • expr IN expr1, expr2... in MySQL, exprN will each be converted into the same
  • type as expr. In the Hive implementation, all expr(N) will be converted into
  • a common type for conversion consistency with other UDF's, and to prevent
  • conversions from a big type to a small type (e.g. int to tinyint)
    */

(case 1) expr in (expr1,... exprN)

  • 1.1 Per sql standard if expr is NULL then IN should return NULL
    (with this PR we are attempting to achieve this)
  • 1.2 if any of the expression in the right hand side is NULL and also
    no match is found in the list then IN should also return NULL.
    We also enforce this semantics by
    implementation

(case 2) Type conversion semantics.

  • 2.1 In MySQL all the expressions in the right hand side is converted
    to left hand side type. Our behaviour matches this semantics. I am
    not sure if this is the standard though.
  • 2.2 In Hive, they seem to find a common type (probably larger type) and
    promote both left hand side and right hand side to that common type.
    I believe this is where it throws the SemanticException.

Our behaviour seems match that of MySql more at the present time. Do we want to change this ?
also about case 1 , it is not clear from the hive comments on what they intended to
do vs what is the external behaviour. Please let me know what you think.

@cloud-fan
Copy link
Contributor

if null in (true, array(2,3)) throws exception in hive, then we probably should first cast key and all values in list to a common type. If there is no such a common type, throw exception.

@rick-ibm
Copy link
Contributor

rick-ibm commented Oct 7, 2015

According to my reading of the SQL Standard,

NULL IN (expr1, ...)

should always evaluate to NULL. Here is my reasoning:

The 2011 SQL Standard, part 2, section 8.4 (in predicate), syntax rule 5 says that

expr IN (expr1, ...)

is equivalent to

expr = ANY (expr1, ...)

Section 8.9 (quantified comparison predicate), general rule 2, subrules (c) and (d), say that

expr = ANY (expr1, ...)

evaluates to the following:

TRUE if (expr = exprN) is TRUE for at least one of the expressions on the right side

FALSE if the right side is an empty list or if (expr = exprN) is FALSE for every exprN on the right side

UNKNOWN (NULL) otherwise

Since (NULL = exprN) is always UNKNOWN and since an IN list must be non-empty (see the BNF in section 8.4), it follows that

NULL IN (expr1, ...)

always evaluates to UNKNOWN (NULL). So Dilip's transformation of

NULL IN (expr1, ...) -> NULL

looks correct to me. There is no need to cast the expressions on the right side to a common type. That is, not unless you want to raise syntax errors in situations where there is no implicit conversion to a common type.

As the following examples show, Postgres, MySQL, and Derby all exhibit the correct Standard behavior.

Thanks,
-Rick


MySQL behavior:

mysql> SELECT NULL IN (1, 2, 3);
SELECT NULL IN (1, 2, 3);
+-------------------+
| NULL IN (1, 2, 3) |
+-------------------+
| NULL |
+-------------------+
1 row in set (0.00 sec)

mysql> SELECT NULL IN (1, 2, NULL);
SELECT NULL IN (1, 2, NULL);
+----------------------+
| NULL IN (1, 2, NULL) |
+----------------------+
| NULL |
+----------------------+
1 row in set (0.00 sec)

mysql> SELECT NULL IN ();
SELECT NULL IN ();
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1


Postgres behavior:

mydb=# SELECT NULL IN (1, 2, 3);
SELECT NULL IN (1, 2, 3);

?column?

(1 row)

mydb=# SELECT NULL IN (1, 2, NULL);
SELECT NULL IN (1, 2, NULL);

?column?

(1 row)

mydb=# SELECT NULL IN ();
SELECT NULL IN ();
ERROR: syntax error at or near ")"
LINE 1: SELECT NULL IN ();


Derby behavior:

ij> VALUES CAST (NULL AS INT) IN (1, 2, 3);

1

NULL

1 row selected
ij> VALUES CAST (NULL AS INT) IN (1, 2, CAST (NULL AS INT));

1

NULL

1 row selected
ij> VALUES CAST (NULL AS INT) IN ();
ERROR 42X01: Syntax error: Encountered ")" at line 1, column 31.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 7, 2015

I think we are all in agreement that null IN (...) should return null.

The only question here is if we should through an error when the stuff in (...) can be coerced to a common type. It seems to me that the user must be confused if they are attempting to compare a value with an incompatible set of options. It would be good to see what other systems do here.

@rick-ibm
Copy link
Contributor

rick-ibm commented Oct 7, 2015

Hi Michael,

Postgres and Derby raise an error if the expressions in the IN list can't be implicitly cast to a common type. MySQL is more forgiving.

Thanks,
-Rick


MySQL:

mysql> SELECT NULL IN ( 1, 'abc' );
SELECT NULL IN ( 1, 'abc' );
+----------------------+
| NULL IN ( 1, 'abc' ) |
+----------------------+
| NULL |
+----------------------+
1 row in set (0.00 sec)


Postgres:

mydb=# SELECT NULL IN ( 1, 'abc' );
SELECT NULL IN ( 1, 'abc' );
ERROR: invalid input syntax for integer: "abc"
LINE 1: SELECT NULL IN ( 1, 'abc' );


Derby:

ij> VALUES CAST (NULL AS INT) IN (1, 'abc' );
ERROR 42818: Comparisons between 'INTEGER' and 'CHAR (UCS_BASIC)' are not supported. Types must be comparable. String types must also have matching collation. If collation does not match, a possible solution is to cast operands to force them to the default collation (e.g. SELECT tablename FROM sys.systables WHERE CAST(tablename AS VARCHAR(128)) = 'T1')

@rick-ibm
Copy link
Contributor

rick-ibm commented Oct 7, 2015

I don't find any guidance in the Standard for what should be done if the left side of the IN operator is an untyped NULL literal. Technically, there is no such thing in the Standard. The NULL needs to be cast to a legal type.

Section 8.4 provides no guidance about the type correspondence of the IN list values. However, section 8.9 implies that the IN list is equivalent to the result of a subquery, which means that we must be able to cast all of the values on the right side to a common type.

Thanks,
-Rick

@dilipbiswal
Copy link
Contributor Author

Thank you @marmbrus @rick-ibm @cloud-fan

I checked the behavior of db2. It also raises an error if the in list types are not compatible.

db2 => select * from f1 where NULL in (1, true)
SQL0401N The data types of the operands for the operation "IN" are not
compatible or comparable. SQLSTATE=42818

I am studying the code now to figure out how to detect this and raise an error.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan
Hi Wenchen, can you please take a look at the changes and let me know what you think..

findWiderCommonType(inTypes) match {
case Some(finalDataType) => Literal.create(null, BooleanType)
case None => i
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of returning literal null, I think we should just wider the types, as it does fix the bug(we can add a rule in Optimizer to return null for this case).

the code can be:

case i @ In(a, b) if b.exists(_.dataType != a.dataType) =>
  findWiderCommonType(i.children.map(_.dataType)) match {
    case Some(finalDataType) => i.withNewChildren(i.children.map(Cast(_, finalDataType)))
    case None => i
  }

@dilipbiswal
Copy link
Contributor Author

@cloud-fan
Thanks a lot. I have implemented the review comments. Please take a look. I looked at the optimizer code,. we already seem to be transforming NULL in (...) to "Filter null" in ConstantFolding rule. So we should be okay here , right ?

@cloud-fan
Copy link
Contributor

I only found one rule in Optimizer to optimize In to InSet when the values in list are all literals.
I think it's better to add a case in NullPropagation to optimize In with literal null key to literal null even the list are not all literals.

Anyway it should be another PR and this PR LGTM.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 8, 2015

Thanks, merging to master.

@asfgit asfgit closed this in dcbd58a Oct 8, 2015
@dilipbiswal
Copy link
Contributor Author

Thanks a lot @marmbrus .

Many thanks to @cloud-fan for his help.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 8, 2015

test this please

@marmbrus
Copy link
Contributor

marmbrus commented Oct 8, 2015

Sorry, this never passed tests and broke something. I'm going to revert. Please reopen the PR.

ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 8, 2015
This reverts commit dcbd58a from apache#8983

Author: Michael Armbrust <michael@databricks.com>

Closes apache#9034 from marmbrus/revert8654.
@dilipbiswal
Copy link
Contributor Author

@marmbrus .. sorry about it. Is there a way i can look at the list of failures ?
I had run :
build/mvn -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0 -Dmaven.test.failure.ignore=true test

and it reported success. But then this is my first time.. so may not have right configuration.

@cloud-fan
Copy link
Contributor

just reopen this PR and we will trigger a test on our jenkins for it.

For local test, you can do ./build/sbt catalyst/test

@dilipbiswal
Copy link
Contributor Author

@cloud-fan
I didn't see a re-open option on this pull request. Do i have to create a new pull request ?
Please let me know ..

@cloud-fan
Copy link
Contributor

open a new one is also OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants