[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection #25754

WeichenXu123 · 2019-09-11T08:18:46Z

What changes were proposed in this pull request?

The Column.isInCollection() with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run.
In this PR, in isInCollection() function, directly generate InSet expression, avoid generating too many children expressions.

Why are the changes needed?

Column.isInCollection() with a large size collection sometimes become a bottleneck when running sql.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually benchmark it in spark-shell:

def testExplainTime(collectionSize: Int) = {
        val df = spark.range(10).withColumn("id2", col("id") + 1)
        val list = Range(0, collectionSize).toList
        val startTime = System.currentTimeMillis() 
        df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
        val elapsedTime = System.currentTimeMillis() - startTime
        println(s"cost time: ${elapsedTime}ms")
}

Then test on collection size 5, 10, 100, 1000, 10000, test result is:

collection size	explain time (before)	explain time (after)
5	26ms	29ms
10	30ms	48ms
100	104ms	50ms
1000	1202ms	58ms
10000	10012ms	523ms

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

SparkQA · 2019-09-11T11:52:38Z

Test build #110470 has finished for PR 25754 at commit 5d79149.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

SparkQA · 2019-09-12T13:25:28Z

Test build #110510 has finished for PR 25754 at commit ab3e5d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-12T22:12:36Z

Test build #110534 has finished for PR 25754 at commit b60cd94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-09-13T00:17:45Z

LGTM after minor changes in test cases.

Thanks! Merged to master.

…arge size collection ### What changes were proposed in this pull request? The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run. In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions. ### Why are the changes needed? `Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually benchmark it in spark-shell: ``` def testExplainTime(collectionSize: Int) = { val df = spark.range(10).withColumn("id2", col("id") + 1) val list = Range(0, collectionSize).toList val startTime = System.currentTimeMillis() df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain() val elapsedTime = System.currentTimeMillis() - startTime println(s"cost time: ${elapsedTime}ms") } ``` Then test on collection size 5, 10, 100, 1000, 10000, test result is: collection size | explain time (before) | explain time (after) ------ | ------ | ------ 5 | 26ms | 29ms 10 | 30ms | 48ms 100 | 104ms | 50ms 1000 | 1202ms | 58ms 10000 | 10012ms | 523ms Closes apache#25754 from WeichenXu123/improve_in_collection. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

…n.isInCollection() with a large size collection" ### What changes were proposed in this pull request? This reverts commit 5631a96. Closes #28328 ### Why are the changes needed? The PR #25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default): ```scala val set = (0 to 20).map(_.toString).toSet val data = Seq("1").toDF("x") data.select($"x".isInCollection(set).as("isInCollection")).show() ``` The function must return **'true'** because "1" is in the set of "0" ... "20" but it returns "false": ``` +--------------+ |isInCollection| +--------------+ | false| +--------------+ ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? ``` $ ./build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #28388 from MaxGekk/fix-isInCollection-revert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…n.isInCollection() with a large size collection" ### What changes were proposed in this pull request? This reverts commit 5631a96. Closes #28328 ### Why are the changes needed? The PR #25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default): ```scala val set = (0 to 20).map(_.toString).toSet val data = Seq("1").toDF("x") data.select($"x".isInCollection(set).as("isInCollection")).show() ``` The function must return **'true'** because "1" is in the set of "0" ... "20" but it returns "false": ``` +--------------+ |isInCollection| +--------------+ | false| +--------------+ ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? ``` $ ./build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #28388 from MaxGekk/fix-isInCollection-revert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7cabc8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…f `isInCollection` ### What changes were proposed in this pull request? - Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`. - Test different switch thresholds in the `isInCollection: Scala Collection` test. ### Why are the changes needed? To prevent regressions like introduced by #25754 and reverted by #28388 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing and new tests in `ColumnExpressionSuite` Closes #28405 from MaxGekk/test-isInCollection. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…f `isInCollection` ### What changes were proposed in this pull request? - Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`. - Test different switch thresholds in the `isInCollection: Scala Collection` test. ### Why are the changes needed? To prevent regressions like introduced by #25754 and reverted by #28388 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing and new tests in `ColumnExpressionSuite` Closes #28405 from MaxGekk/test-isInCollection. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9164865) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

init pr

5d79149

cloud-fan reviewed Sep 11, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Column.scala Show resolved Hide resolved

WeichenXu123 changed the title ~~[WIP][SPARK-29048] Improve performance on Column.isInCollection() with a large size collection~~ [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection Sep 11, 2019

gatorsmile reviewed Sep 11, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Column.scala Show resolved Hide resolved

dongjoon-hyun added the SQL label Sep 11, 2019

address comments

ab3e5d4

Update ColumnExpressionSuite.scala

b60cd94

gatorsmile closed this in 5631a96 Sep 13, 2019

dongjoon-hyun mentioned this pull request Apr 24, 2020

[SPARK-31553][SQL] Fix isInCollection for collection sizes above the optimisation threshold #28328

Closed

MaxGekk mentioned this pull request Apr 28, 2020

[SPARK-31553][SQL] Revert "[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection" #28388

Closed

MaxGekk mentioned this pull request Apr 29, 2020

[SPARK-31553][SQL][TESTS][FOLLOWUP] Tests for collection elem types of isInCollection #28405

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection #25754

[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection #25754

WeichenXu123 commented Sep 11, 2019 •

edited

Loading

SparkQA commented Sep 11, 2019

SparkQA commented Sep 12, 2019

SparkQA commented Sep 12, 2019

gatorsmile commented Sep 13, 2019

[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection #25754

[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection #25754

Conversation

WeichenXu123 commented Sep 11, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 11, 2019

SparkQA commented Sep 12, 2019

SparkQA commented Sep 12, 2019

gatorsmile commented Sep 13, 2019

WeichenXu123 commented Sep 11, 2019 •

edited

Loading