Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection #25754

Closed

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Sep 11, 2019

What changes were proposed in this pull request?

The Column.isInCollection() with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run.
In this PR, in isInCollection() function, directly generate InSet expression, avoid generating too many children expressions.

Why are the changes needed?

Column.isInCollection() with a large size collection sometimes become a bottleneck when running sql.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually benchmark it in spark-shell:

def testExplainTime(collectionSize: Int) = {
        val df = spark.range(10).withColumn("id2", col("id") + 1)
        val list = Range(0, collectionSize).toList
        val startTime = System.currentTimeMillis() 
        df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
        val elapsedTime = System.currentTimeMillis() - startTime
        println(s"cost time: ${elapsedTime}ms")
}

Then test on collection size 5, 10, 100, 1000, 10000, test result is:

collection size explain time (before) explain time (after)
5 26ms 29ms
10 30ms 48ms
100 104ms 50ms
1000 1202ms 58ms
10000 10012ms 523ms

@SparkQA
Copy link

SparkQA commented Sep 11, 2019

Test build #110470 has finished for PR 25754 at commit 5d79149.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 changed the title [WIP][SPARK-29048] Improve performance on Column.isInCollection() with a large size collection [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection Sep 11, 2019
@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110510 has finished for PR 25754 at commit ab3e5d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110534 has finished for PR 25754 at commit b60cd94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM after minor changes in test cases.

Thanks! Merged to master.

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
…arge size collection

### What changes were proposed in this pull request?
The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run.
In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions.

### Why are the changes needed?
`Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually benchmark it in spark-shell:
```
def testExplainTime(collectionSize: Int) = {
        val df = spark.range(10).withColumn("id2", col("id") + 1)
        val list = Range(0, collectionSize).toList
        val startTime = System.currentTimeMillis()
        df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
        val elapsedTime = System.currentTimeMillis() - startTime
        println(s"cost time: ${elapsedTime}ms")
}
```
Then test on collection size 5, 10, 100, 1000, 10000, test result is:

collection size | explain time (before) | explain time (after)
------ | ------ | ------
5 | 26ms | 29ms
10 | 30ms | 48ms
100 | 104ms | 50ms
1000 | 1202ms | 58ms
10000 | 10012ms | 523ms

Closes apache#25754 from WeichenXu123/improve_in_collection.

Lead-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
cloud-fan pushed a commit that referenced this pull request Apr 28, 2020
…n.isInCollection() with a large size collection"

### What changes were proposed in this pull request?
This reverts commit 5631a96.

Closes #28328

### Why are the changes needed?
The PR  #25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default):
```scala
val set = (0 to 20).map(_.toString).toSet
val data = Seq("1").toDF("x")
data.select($"x".isInCollection(set).as("isInCollection")).show()
```
The function must return **'true'** because "1" is in the set of "0" ... "20" but it returns "false":
```
+--------------+
|isInCollection|
+--------------+
|         false|
+--------------+
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
```
$ ./build/sbt "test:testOnly *ColumnExpressionSuite"
```

Closes #28388 from MaxGekk/fix-isInCollection-revert.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 28, 2020
…n.isInCollection() with a large size collection"

### What changes were proposed in this pull request?
This reverts commit 5631a96.

Closes #28328

### Why are the changes needed?
The PR  #25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default):
```scala
val set = (0 to 20).map(_.toString).toSet
val data = Seq("1").toDF("x")
data.select($"x".isInCollection(set).as("isInCollection")).show()
```
The function must return **'true'** because "1" is in the set of "0" ... "20" but it returns "false":
```
+--------------+
|isInCollection|
+--------------+
|         false|
+--------------+
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
```
$ ./build/sbt "test:testOnly *ColumnExpressionSuite"
```

Closes #28388 from MaxGekk/fix-isInCollection-revert.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b7cabc8)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 30, 2020
…f `isInCollection`

### What changes were proposed in this pull request?
- Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`.
- Test different switch thresholds in the `isInCollection: Scala Collection` test.

### Why are the changes needed?
To prevent regressions like introduced by #25754 and reverted by #28388

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing and new tests in `ColumnExpressionSuite`

Closes #28405 from MaxGekk/test-isInCollection.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 30, 2020
…f `isInCollection`

### What changes were proposed in this pull request?
- Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`.
- Test different switch thresholds in the `isInCollection: Scala Collection` test.

### Why are the changes needed?
To prevent regressions like introduced by #25754 and reverted by #28388

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing and new tests in `ColumnExpressionSuite`

Closes #28405 from MaxGekk/test-isInCollection.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 9164865)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants