[SPARK-8654][SQL] Fix Analysis exception when using NULL IN (...) #8983

dilipbiswal · 2015-10-05T18:49:07Z

In the analysis phase , while processing the rules for IN predicate, we
compare the in-list types to the lhs expression type and generate
cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
generating cast between in list types to NULL like cast (1 as NULL) which
is not a valid cast.

The fix is to not generate such a cast if the lhs type is a NullType instead
we translate the expression to Literal(Null).

In the analysis phase , while processing the rules for IN predicate, we compare the in-list types to the lhs expression type and generate cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up generating cast between in list types to NULL like cast (1 as NULL) which is not a valid cast. The fix is to not generate such a cast if the lhs type is a NullType instead we translate the expression to Literal(Null).

cloud-fan · 2015-10-05T19:00:27Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala

+    val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), "a")() :: Nil,
+      LocalRelation()
+    )
+    assertAnalysisSuccess(plan, false)


why change the default value of caseSensitive?

dilipbiswal · 2015-10-05T19:26:57Z

Thanks for reviewing the code Wenchen. I was trying to model the test case based on what was put in the JIRA which did a caseInsensitiveAnalyze. I have fixed it now.

cloud-fan · 2015-10-05T19:54:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala

   */
  object InConversion extends Rule[LogicalPlan] {
    def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions {
      // Skip nodes who's children have not been resolved yet.
      case e if !e.childrenResolved => e

+      case i @ In(a, b) if (a.dataType == NullType) =>
+        Literal.create(null, BooleanType)


instead of just casting null to boolean, can we come up with a better idea according to the data types of b?

Ah sorry for my mistake. I thought you are casting the a to boolean type, but actually you just turn the result to boolean null.

Can you reference to a hive doc that says something like "if the value is null, the In operation will always return null"? I think we should follow hive semantic here.

dilipbiswal · 2015-10-05T20:51:13Z

Thanks !!

Do we need to look at the in list types in this case ? The in list types could be literals of different types , right ? for example NULL not in (1, 'a')

Since the result of IN predicate is a boolean type, i thought it would be safe to transform it to
a Null literal of boolean type. Are you thinking of a case where this would not work ?

Thanks a lot in advance for your help.

cloud-fan · 2015-10-05T22:43:13Z

if you change the value type to boolean, then we have to change each element in list to boolean type too, which maybe dangerous as a lot of types can't be casted to boolean. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions

dilipbiswal · 2015-10-05T23:23:25Z

Thanks Wenchen. You are right that not all types can be casted to boolean.

However, in this case, we are not trying to cast the in list types to the LHS type (null type in our case) as we know that this is a special case predicate would always evaluate to NULL. That is why we are simply transforming the in predicate to NULL one and dropping the in list altogether.

== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter NOT null IN (1,2,3,4) => original one
'UnresolvedRelation [inttab], None

== Analyzed Logical Plan ==
c1: int
Project [c1#0]
Filter NOT null => rewritten one
Subquery inttab
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

Please let me know what you think .. If you have a test case in mind that would exhibit a problem then i would like to try it out.

Thanks a lot for your help.

dilipbiswal · 2015-10-06T00:50:06Z

Hi Wenchen,

Here is the link i could find where its a bit confusing on the equality operator.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-RelationalOperators

In order to test it out , i tried the queries on hive like following ..

hive> select * from tnull ;
OK
1
2
NULL

hive> select null = 1 from tnull ;
OK
NULL
NULL
NULL
Time taken: 0.118 seconds, Fetched: 3 row(s)
hive> select null in (1,2) from tnull ;
OK
NULL
NULL
NULL

Please let me know what you think and thanks again for your help.

cloud-fan · 2015-10-06T00:58:45Z

can you try select null in (1,2,null)? I wanna make sure hive doesn't execute the In operation when value is null type.

dilipbiswal · 2015-10-06T01:03:05Z

hive> select null in (1,2,null) from tnull;
OK
NULL
NULL
NULL
Time taken: 0.139 seconds, Fetched: 3 row(s)

cloud-fan · 2015-10-06T01:25:12Z

I checked our implementation, we should return null for this case.

LGTM then.

cloud-fan · 2015-10-06T01:27:30Z

btw you can send another PR to add a rule in Optimizer so that if value is literal null, we can just return null.

dilipbiswal · 2015-10-06T03:46:10Z

Can you please help clarify ? Are you referring to the case when one of the value in the
in list is literal null like 1 in (1, NULL) ? If so, i don't think we can evaluate this to NULL...

Ran the following two queries on hive
hive> select 1 in (1,NULL) from tnull;
OK
true
true
true
Time taken: 1.21 seconds, Fetched: 3 row(s)
hive> select 1 in (NULL,1) from tnull;
OK
true
true
true
Time taken: 0.168 seconds, Fetched: 3 row(s)

We have already taken care of the case LHS type is null. Let me know what you think.

cloud-fan · 2015-10-06T17:12:26Z

Looks like our implementation is wrong.... We return null if any value in the list is null.
Would you like to send another PR to fix it?

dilipbiswal · 2015-10-06T18:05:04Z

@cloud-fan , please confirm my understanding of the code (fairly new to the codebase..:-)
In the code we go through the entire in list and run evaluate the expression flagging hasNull. But
we continue with next items and return true if we see a match. If we haven't seen it then we look at the hasNull flag and return Null or False.

To confirm if there is a issue, i tried to run the following two queries again. The output looks
ok to me..

select * from inttab where 1 in (1,2,NULL)
var2: org.apache.spark.sql.DataFrame = [c1: int]
+---+
| c1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+

== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter 1 IN (1,2,null)
'UnresolvedRelation [inttab], None

== Analyzed Logical Plan ==
c1: int
Project [c1#0]
Filter 1 IN (cast(1 as int),cast(2 as int),cast(null as int))
Subquery inttab
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

== Optimized Logical Plan ==
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

== Physical Plan ==
Scan PhysicalRDD[c1#0]

Code Generation: true

select * from inttab where 1 in (NULL,1,2)
var2: org.apache.spark.sql.DataFrame = [c1: int]
+---+
| c1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+

== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter 1 IN (null,1,2)
'UnresolvedRelation [inttab], None

== Analyzed Logical Plan ==
c1: int
Project [c1#0]
Filter 1 IN (cast(null as int),cast(1 as int),cast(2 as int))
Subquery inttab
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

== Optimized Logical Plan ==
LogicalRDD [c1#0], MapPartitionsRDD[4] at rddToDataFrameHolder at :26

== Physical Plan ==
Scan PhysicalRDD[c1#0]

Code Generation: true

Please let me know your thoughts ..

cloud-fan · 2015-10-06T18:22:08Z

ok misleaded by the imperative code style, sorry for that...
So we will return null if the key is null, or we can't find a value in list matching our key, and there is null value inside list.

This PR make In operation return null when key is NullType, which is right. But I think we can do it further by returning null for any type of key which is literal null(probably another PR).

dilipbiswal · 2015-10-06T18:38:24Z

Thanks a LOT @cloud-fan. Sure.. i will look into it. When you say another PR,
do we mean another JIRA ?

Asking as i am new to the process.

One other question.. what is the process to get this change integrated ? Do i need to initiate any action from my end ?

cloud-fan · 2015-10-06T19:46:46Z

yup, another JIRA please.

You need ask some committers like @marmbrus to review your PR to get it merged .

My final thoughts on this PR:
When the type is confilct, for example null in (true, array(2,3)), should we return null or report type conflict error during analysis? What hive's behavior for this case?

dilipbiswal · 2015-10-06T21:15:16Z

Very good point..Thanks.. Actually Hive reports an error in this case.

hive> select * from tnull where array(2,3) in (1, array(2,3));
FAILED: SemanticException [Error 10014]: Line 1:37 Wrong arguments '3': The arguments for IN should be the same type! Types are: {array IN (int, array)}

I am not sure what is the right thing to do here.

Any comments @marmbrus ?

marmbrus · 2015-10-06T22:09:03Z

Lets follow hive.

dilipbiswal · 2015-10-07T01:07:26Z

@marmbrus

Thanks a lot michael for looking into this. I debugged hive to understand the
behaviour and would like to share my findings. I wanted to make sure we are doing
the right thing here. Here are the comments at the top of GenericUDFIn() in hive.

/**

GenericUDFIn
*
Example usage:
SELECT key FROM src WHERE key IN ("238", "1");
*
From MySQL page on IN(): To comply with the SQL standard, IN returns NULL
not only if the expression on the left hand side is NULL, but also if no
match is found in the list and one of the expressions in the list is NULL.
*
Also noteworthy: type conversion behavior is different from MySQL. With
expr IN expr1, expr2... in MySQL, exprN will each be converted into the same
type as expr. In the Hive implementation, all expr(N) will be converted into
a common type for conversion consistency with other UDF's, and to prevent
conversions from a big type to a small type (e.g. int to tinyint)
*/

(case 1) expr in (expr1,... exprN)

1.1 Per sql standard if expr is NULL then IN should return NULL
(with this PR we are attempting to achieve this)
1.2 if any of the expression in the right hand side is NULL and also
no match is found in the list then IN should also return NULL.
We also enforce this semantics by
implementation

(case 2) Type conversion semantics.

2.1 In MySQL all the expressions in the right hand side is converted
to left hand side type. Our behaviour matches this semantics. I am
not sure if this is the standard though.
2.2 In Hive, they seem to find a common type (probably larger type) and
promote both left hand side and right hand side to that common type.
I believe this is where it throws the SemanticException.

Our behaviour seems match that of MySql more at the present time. Do we want to change this ?
also about case 1 , it is not clear from the hive comments on what they intended to
do vs what is the external behaviour. Please let me know what you think.

cloud-fan · 2015-10-07T04:56:33Z

if null in (true, array(2,3)) throws exception in hive, then we probably should first cast key and all values in list to a common type. If there is no such a common type, throw exception.

rick-ibm · 2015-10-07T16:03:33Z

According to my reading of the SQL Standard,

NULL IN (expr1, ...)

should always evaluate to NULL. Here is my reasoning:

The 2011 SQL Standard, part 2, section 8.4 (in predicate), syntax rule 5 says that

expr IN (expr1, ...)

is equivalent to

expr = ANY (expr1, ...)

Section 8.9 (quantified comparison predicate), general rule 2, subrules (c) and (d), say that

expr = ANY (expr1, ...)

evaluates to the following:

TRUE if (expr = exprN) is TRUE for at least one of the expressions on the right side

FALSE if the right side is an empty list or if (expr = exprN) is FALSE for every exprN on the right side

UNKNOWN (NULL) otherwise

Since (NULL = exprN) is always UNKNOWN and since an IN list must be non-empty (see the BNF in section 8.4), it follows that

NULL IN (expr1, ...)

always evaluates to UNKNOWN (NULL). So Dilip's transformation of

NULL IN (expr1, ...) -> NULL

looks correct to me. There is no need to cast the expressions on the right side to a common type. That is, not unless you want to raise syntax errors in situations where there is no implicit conversion to a common type.

As the following examples show, Postgres, MySQL, and Derby all exhibit the correct Standard behavior.

Thanks,
-Rick

MySQL behavior:

mysql> SELECT NULL IN (1, 2, 3);
SELECT NULL IN (1, 2, 3);
+-------------------+
| NULL IN (1, 2, 3) |
+-------------------+
| NULL |
+-------------------+
1 row in set (0.00 sec)

mysql> SELECT NULL IN (1, 2, NULL);
SELECT NULL IN (1, 2, NULL);
+----------------------+
| NULL IN (1, 2, NULL) |
+----------------------+
| NULL |
+----------------------+
1 row in set (0.00 sec)

mysql> SELECT NULL IN ();
SELECT NULL IN ();
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1

Postgres behavior:

mydb=# SELECT NULL IN (1, 2, 3);
SELECT NULL IN (1, 2, 3);

?column?

(1 row)

mydb=# SELECT NULL IN (1, 2, NULL);
SELECT NULL IN (1, 2, NULL);

?column?

(1 row)

mydb=# SELECT NULL IN ();
SELECT NULL IN ();
ERROR: syntax error at or near ")"
LINE 1: SELECT NULL IN ();

Derby behavior:

ij> VALUES CAST (NULL AS INT) IN (1, 2, 3);

1

NULL

1 row selected
ij> VALUES CAST (NULL AS INT) IN (1, 2, CAST (NULL AS INT));

1

NULL

1 row selected
ij> VALUES CAST (NULL AS INT) IN ();
ERROR 42X01: Syntax error: Encountered ")" at line 1, column 31.

marmbrus · 2015-10-07T17:15:49Z

I think we are all in agreement that null IN (...) should return null.

The only question here is if we should through an error when the stuff in (...) can be coerced to a common type. It seems to me that the user must be confused if they are attempting to compare a value with an incompatible set of options. It would be good to see what other systems do here.

rick-ibm · 2015-10-07T17:33:53Z

Hi Michael,

Postgres and Derby raise an error if the expressions in the IN list can't be implicitly cast to a common type. MySQL is more forgiving.

Thanks,
-Rick

MySQL:

mysql> SELECT NULL IN ( 1, 'abc' );
SELECT NULL IN ( 1, 'abc' );
+----------------------+
| NULL IN ( 1, 'abc' ) |
+----------------------+
| NULL |
+----------------------+
1 row in set (0.00 sec)

Postgres:

mydb=# SELECT NULL IN ( 1, 'abc' );
SELECT NULL IN ( 1, 'abc' );
ERROR: invalid input syntax for integer: "abc"
LINE 1: SELECT NULL IN ( 1, 'abc' );

Derby:

ij> VALUES CAST (NULL AS INT) IN (1, 'abc' );
ERROR 42818: Comparisons between 'INTEGER' and 'CHAR (UCS_BASIC)' are not supported. Types must be comparable. String types must also have matching collation. If collation does not match, a possible solution is to cast operands to force them to the default collation (e.g. SELECT tablename FROM sys.systables WHERE CAST(tablename AS VARCHAR(128)) = 'T1')

rick-ibm · 2015-10-07T18:08:11Z

I don't find any guidance in the Standard for what should be done if the left side of the IN operator is an untyped NULL literal. Technically, there is no such thing in the Standard. The NULL needs to be cast to a legal type.

Section 8.4 provides no guidance about the type correspondence of the IN list values. However, section 8.9 implies that the IN list is equivalent to the result of a subquery, which means that we must be able to cast all of the values on the right side to a common type.

Thanks,
-Rick

dilipbiswal · 2015-10-07T18:34:18Z

Thank you @marmbrus @rick-ibm @cloud-fan

I checked the behavior of db2. It also raises an error if the in list types are not compatible.

db2 => select * from f1 where NULL in (1, true)
SQL0401N The data types of the operands for the operation "IN" are not
compatible or comparable. SQLSTATE=42818

I am studying the code now to figure out how to detect this and raise an error.

dilipbiswal · 2015-10-08T03:03:26Z

@cloud-fan
Hi Wenchen, can you please take a look at the changes and let me know what you think..

cloud-fan · 2015-10-08T04:37:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala

+        findWiderCommonType(inTypes) match {
+          case Some(finalDataType) => Literal.create(null, BooleanType)
+          case None => i
+        }


instead of returning literal null, I think we should just wider the types, as it does fix the bug(we can add a rule in Optimizer to return null for this case).

the code can be:

case i @ In(a, b) if b.exists(_.dataType != a.dataType) => findWiderCommonType(i.children.map(_.dataType)) match { case Some(finalDataType) => i.withNewChildren(i.children.map(Cast(_, finalDataType))) case None => i }

dilipbiswal · 2015-10-08T09:27:18Z

@cloud-fan
Thanks a lot. I have implemented the review comments. Please take a look. I looked at the optimizer code,. we already seem to be transforming NULL in (...) to "Filter null" in ConstantFolding rule. So we should be okay here , right ?

cloud-fan · 2015-10-08T17:31:48Z

I only found one rule in Optimizer to optimize In to InSet when the values in list are all literals.
I think it's better to add a case in NullPropagation to optimize In with literal null key to literal null even the list are not all literals.

Anyway it should be another PR and this PR LGTM.

marmbrus · 2015-10-08T17:40:18Z

Thanks, merging to master.

dilipbiswal · 2015-10-08T18:25:46Z

Thanks a lot @marmbrus .

Many thanks to @cloud-fan for his help.

marmbrus · 2015-10-08T20:43:54Z

test this please

marmbrus · 2015-10-08T20:45:58Z

Sorry, this never passed tests and broke something. I'm going to revert. Please reopen the PR.

This reverts commit dcbd58a from apache#8983 Author: Michael Armbrust <michael@databricks.com> Closes apache#9034 from marmbrus/revert8654.

dilipbiswal · 2015-10-08T20:54:31Z

@marmbrus .. sorry about it. Is there a way i can look at the list of failures ?
I had run :
build/mvn -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0 -Dmaven.test.failure.ignore=true test

and it reported success. But then this is my first time.. so may not have right configuration.

cloud-fan · 2015-10-08T20:56:39Z

just reopen this PR and we will trigger a test on our jenkins for it.

For local test, you can do ./build/sbt catalyst/test

dilipbiswal · 2015-10-08T21:08:07Z

@cloud-fan
I didn't see a re-open option on this pull request. Do i have to create a new pull request ?
Please let me know ..

cloud-fan · 2015-10-08T21:15:12Z

open a new one is also OK.

cloud-fan reviewed Oct 5, 2015
View reviewed changes

Fixing testcase based on review comments

acbd6a6

cloud-fan reviewed Oct 5, 2015
View reviewed changes

dilipbiswal added 2 commits October 7, 2015 19:32

Raise an error if the inlist contains incompatible types

d33e786

Fix scala style check failures

ac9265d

cloud-fan reviewed Oct 8, 2015
View reviewed changes

Apply code review comments

cf976a6

Fix the comment

6bfdf2f

asfgit closed this in dcbd58a Oct 8, 2015

marmbrus mentioned this pull request Oct 8, 2015

Revert [SPARK-8654] [SQL] Fix Analysis exception when using NULL IN #9034

Closed

liancheng mentioned this pull request Oct 8, 2015

[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format #8988

Closed

[SPARK-8654][SQL] Fix Analysis exception when using NULL IN (...) #8983

[SPARK-8654][SQL] Fix Analysis exception when using NULL IN (...) #8983

Conversation

dilipbiswal commented Oct 5, 2015

cloud-fan Oct 5, 2015

Choose a reason for hiding this comment

dilipbiswal commented Oct 5, 2015

cloud-fan Oct 5, 2015

Choose a reason for hiding this comment

cloud-fan Oct 5, 2015

Choose a reason for hiding this comment

dilipbiswal commented Oct 5, 2015

cloud-fan commented Oct 5, 2015

dilipbiswal commented Oct 5, 2015

dilipbiswal commented Oct 6, 2015

cloud-fan commented Oct 6, 2015

dilipbiswal commented Oct 6, 2015

cloud-fan commented Oct 6, 2015

cloud-fan commented Oct 6, 2015

dilipbiswal commented Oct 6, 2015

cloud-fan commented Oct 6, 2015

dilipbiswal commented Oct 6, 2015

cloud-fan commented Oct 6, 2015

dilipbiswal commented Oct 6, 2015

cloud-fan commented Oct 6, 2015

dilipbiswal commented Oct 6, 2015

marmbrus commented Oct 6, 2015

dilipbiswal commented Oct 7, 2015

cloud-fan commented Oct 7, 2015

rick-ibm commented Oct 7, 2015

?column?

?column?

1

1

marmbrus commented Oct 7, 2015

rick-ibm commented Oct 7, 2015

rick-ibm commented Oct 7, 2015

dilipbiswal commented Oct 7, 2015

dilipbiswal commented Oct 8, 2015

cloud-fan Oct 8, 2015

Choose a reason for hiding this comment

dilipbiswal commented Oct 8, 2015

cloud-fan commented Oct 8, 2015

marmbrus commented Oct 8, 2015

dilipbiswal commented Oct 8, 2015

marmbrus commented Oct 8, 2015

marmbrus commented Oct 8, 2015

dilipbiswal commented Oct 8, 2015

cloud-fan commented Oct 8, 2015

dilipbiswal commented Oct 8, 2015

cloud-fan commented Oct 8, 2015