Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType. #41162

Conversation

liukuijian8040
Copy link

@liukuijian8040 liukuijian8040 commented May 13, 2023

What changes were proposed in this pull request?

See SPARK-43491.
The query results of in ('00') and = '00' are inconsistent.
image

We do this work to ensure when dataTypes of elements in In expression are the same, it will behaviour as same as BinaryComparison expression like EqualTo when the switch is open(spark.sql.legacy.inExpressionCompatibleWithEqualTo.enabled=true).

// test data and content: 
// test.json
// {"name":"Michael","age":0}
spark.read().json("examples/src/main/resources/test.json").createOrReplaceTempView("t");

Before change (see Filter node in Analyzed Logical Plan)

spark.sql("select * from t where age in ('00')").explain(true);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter 'age IN (00)
   +- 'UnresolvedRelation [t], [], false

== Analyzed Logical Plan ==
age: bigint, name: string
Project [age#7L, name#8]
+- Filter cast(age#7L as string) IN (cast(00 as string))
   +- SubqueryAlias t
	  +- Relation[age#7L,name#8] json

== Optimized Logical Plan ==
Filter (isnotnull(age#7L) AND (cast(age#7L as string) = 00))
+- Relation[age#7L,name#8] json

== Physical Plan ==
*(1) Filter (isnotnull(age#7L) AND (cast(age#7L as string) = 00))
+- FileScan json [age#7L,name#8] Batched: false, DataFilters: [isnotnull(age#7L), (cast(age#7L as string) = 00)], Format: JSON, Location: InMemoryFileIndex[file:/D:/code/spark/examples/src/main/resources/test.json], PartitionFilters: [], PushedFilters: [IsNotNull(age)], ReadSchema: struct<age:bigint,name:string>

+---+----+
|age|name|
+---+----+
+---+----+

After change (see Filter node in Analyzed Logical Plan)

spark.sql("select * from t where age in ('00')").explain(true);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter 'age IN (00)
   +- 'UnresolvedRelation [t], [], false

== Analyzed Logical Plan ==
age: bigint, name: string
Project [age#7L, name#8]
+- Filter cast(age#7L as bigint) IN (cast(00 as bigint))
   +- SubqueryAlias t
	  +- Relation[age#7L,name#8] json

== Optimized Logical Plan ==
Filter (isnotnull(age#7L) AND (age#7L = 0))
+- Relation[age#7L,name#8] json

== Physical Plan ==
*(1) Filter (isnotnull(age#7L) AND (age#7L = 0))
+- FileScan json [age#7L,name#8] Batched: false, DataFilters: [isnotnull(age#7L), (age#7L = 0)], Format: JSON, Location: InMemoryFileIndex[file:/D:/code/spark/examples/src/main/resources/test.json], PartitionFilters: [], PushedFilters: [IsNotNull(age), EqualTo(age,0)], ReadSchema: struct<age:bigint,name:string>

+---+-------+
|age|   name|
+---+-------+
|  0|Michael|
+---+-------+

Why are the changes needed?

The query results of Spark SQL and Hive SQL are inconsistent with same sql. Spark SQL calculates 0 in ('00') as false in 3.1.1, which act different from = keyword, but Hive calculates true in 3.1.0 and false in 2.3.3. Hive has fixed the in keyword in 3.1.0, but SparkSQL does not.
for example, this two query sql should have same result, how ever, the query result is different:

scala> spark.sql("select 1 as test where 0 in ('00')").show;
+----+
|test|
+----+
+----+


scala> spark.sql("select 1 as test where 0 = '00'").show;
+----+                                                                          
|test|
+----+
|   1|
+----+

hive 2.3.3
image

hive 3.1.0
image

Does this PR introduce any user-facing change?

We add a switch to support In expression compatible with EqualTo expression with false as default value, to make sure it will not change default behavior of Spark SQL.

How was this patch tested?

By set spark.sql.legacy.inExpressionCompatibleWithEqualTo.enabled=true/false, to check whether the analyzed logical plan will cast expression as expected. For true, it will generate same Cast logical plan as EqualTo, and false will keep the old Cast logical plan solution.

@github-actions github-actions bot added the SQL label May 13, 2023
@liukuijian8040 liukuijian8040 changed the title [SPARK-43491] In expression should act as same as EqualTo when elements in IN expression have same DataType. [SPARK-43491][SQL] In expression should act as same as EqualTo when elements in IN expression have same DataType. May 13, 2023
@liukuijian8040
Copy link
Author

@cloud-fan @wzhfy , please help review this pr, thanks.

@liukuijian8040
Copy link
Author

gentle ping @cloud-fan

@wzhfy
Copy link
Contributor

wzhfy commented May 18, 2023

I also think that the different results between 0 in ('00') and 0 = '00' are confusing, and seems hive already fixed this problem.
Could you also take a look? @cloud-fan @MaxGekk

|}
""".stripMargin
val codeElseIf =
if (!java.lang.Boolean.parseBoolean(x.isNull.toString)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add a comment here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x.null is a ExprCode type. and toString returns code string. So, there are three cases:
a. code is a variable name of the Boolean type, for example, if (varArg_0)
b. code is false, if (false)
c. code is true, if (true)
The else if branch does not need to be generated when isNull is set to true.

image

@cloud-fan
Copy link
Contributor

I think this is indeed an issue, but it seems a bit weird to special-case the 1-element-in-list case. Thoughts? @gengliangwang @srielau

@cloud-fan
Copy link
Contributor

BTW can we also check the behavior in other databases like mysql, postgres, oracle, etc.?

@liukuijian8040
Copy link
Author

liukuijian8040 commented May 18, 2023

quickly check behavior in mysql, and in ('00') has same query result with = '00' . @cloud-fan
image
and also postgres behavior is consistent.
image

@liukuijian8040
Copy link
Author

cc @cloud-fan @srielau please for more other thought of this pr?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 16, 2023
@github-actions github-actions bot closed this Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants