Skip to content

[SPARK-49193][SQL] Improve the performance of RowSetUtils.toColumnBasedSet#47699

Closed
wangyum wants to merge 3 commits intoapache:masterfrom
wangyum:toColumnBasedSet
Closed

[SPARK-49193][SQL] Improve the performance of RowSetUtils.toColumnBasedSet#47699
wangyum wants to merge 3 commits intoapache:masterfrom
wangyum:toColumnBasedSet

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Aug 11, 2024

What changes were proposed in this pull request?

Replace while loop with foreach in RowSetUtils.toTColumn.

Why are the changes needed?

Improve the performance of RowSetUtils.toColumnBasedSet:
image

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test.

import org.apache.hive.service.rpc.thrift.TProtocolVersion
import org.apache.spark.sql.execution.HiveResult

val df = spark.sql("select id, cast(id as string), cast(id as timestamp) from range(200000)")
val dataTypes = df.schema.fields.map(_.dataType)
val rows = df.collect().toList
val start1 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V11, HiveResult.getTimeFormatters)
val start2 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V5, HiveResult.getTimeFormatters)
val start3 = System.currentTimeMillis()
println(s"toColumnBasedSet time: ${start2 - start1}, toRowBasedSet time: ${start3 - start2}")

Before this PR:

toColumnBasedSet time: 17307, toRowBasedSet time: 71

After this PR:

toColumnBasedSet time: 128, toRowBasedSet time: 70

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 11, 2024
val values = new java.util.ArrayList[String](rowSize)
while (i < rowSize) {
val row = rows(i)
rows.foreach { row =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@wangyum wangyum Aug 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the Array's performance is much better than Seq.

toColumnBasedSet time: 138, toRowBasedSet time: 73

@wangyum wangyum changed the title [WIP] Improve the performance of RowSetUtils.toColumnBasedSet [SPARK-49193] Improve the performance of RowSetUtils.toColumnBasedSet Aug 11, 2024
Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangyum
Copy link
Member Author

wangyum commented Aug 11, 2024

Another test:

import org.apache.hive.service.rpc.thrift.TProtocolVersion
import org.apache.spark.sql.execution.HiveResult

val df = spark.sql("select id, cast(id as string), cast(id as timestamp),  cast(id as decimal(18, 0)) from range(20000000)")
val dataTypes = df.schema.fields.map(_.dataType)
val rows = df.collect().toList
val start1 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes,
  TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V11, HiveResult.getTimeFormatters)
val start2 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes,
  TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V5, HiveResult.getTimeFormatters)
val start3 = System.currentTimeMillis()
println(s"toColumnBasedSet time: ${start2 - start1}, toRowBasedSet time: ${start3 - start2}")

Result:

toColumnBasedSet time: 40678, toRowBasedSet time: 90844

@HyukjinKwon HyukjinKwon changed the title [SPARK-49193] Improve the performance of RowSetUtils.toColumnBasedSet [SPARK-49193][SQL] Improve the performance of RowSetUtils.toColumnBasedSet Aug 11, 2024
@HyukjinKwon
Copy link
Member

The docmumentation build, you might need to sync your branch and master branch to the latest.

Merged to master.

@wangyum wangyum deleted the toColumnBasedSet branch August 11, 2024 12:43
yaooqinn pushed a commit that referenced this pull request Aug 12, 2024
…edSet

### What changes were proposed in this pull request?

Replace `while` loop with `foreach` in `RowSetUtils.toTColumn`.

### Why are the changes needed?

Improve the performance of `RowSetUtils.toColumnBasedSet`:
<img width="1196" alt="image" src="https://github.com/user-attachments/assets/f481de39-e0bf-41c5-8fee-09dc1a93c4e1">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.
```scala
import org.apache.hive.service.rpc.thrift.TProtocolVersion
import org.apache.spark.sql.execution.HiveResult

val df = spark.sql("select id, cast(id as string), cast(id as timestamp) from range(200000)")
val dataTypes = df.schema.fields.map(_.dataType)
val rows = df.collect().toList
val start1 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V11, HiveResult.getTimeFormatters)
val start2 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V5, HiveResult.getTimeFormatters)
val start3 = System.currentTimeMillis()
println(s"toColumnBasedSet time: ${start2 - start1}, toRowBasedSet time: ${start3 - start2}")
```

Before this PR:
```
toColumnBasedSet time: 17307, toRowBasedSet time: 71
```

After this PR:
```
toColumnBasedSet time: 128, toRowBasedSet time: 70
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47699 from wangyum/toColumnBasedSet.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 567d58c)
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Aug 12, 2024
…edSet

### What changes were proposed in this pull request?

Replace `while` loop with `foreach` in `RowSetUtils.toTColumn`.

### Why are the changes needed?

Improve the performance of `RowSetUtils.toColumnBasedSet`:
<img width="1196" alt="image" src="https://github.com/user-attachments/assets/f481de39-e0bf-41c5-8fee-09dc1a93c4e1">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.
```scala
import org.apache.hive.service.rpc.thrift.TProtocolVersion
import org.apache.spark.sql.execution.HiveResult

val df = spark.sql("select id, cast(id as string), cast(id as timestamp) from range(200000)")
val dataTypes = df.schema.fields.map(_.dataType)
val rows = df.collect().toList
val start1 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V11, HiveResult.getTimeFormatters)
val start2 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V5, HiveResult.getTimeFormatters)
val start3 = System.currentTimeMillis()
println(s"toColumnBasedSet time: ${start2 - start1}, toRowBasedSet time: ${start3 - start2}")
```

Before this PR:
```
toColumnBasedSet time: 17307, toRowBasedSet time: 71
```

After this PR:
```
toColumnBasedSet time: 128, toRowBasedSet time: 70
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47699 from wangyum/toColumnBasedSet.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 567d58c)
Signed-off-by: Kent Yao <yao@apache.org>
@yaooqinn
Copy link
Member

I backported this to branch-3.5 and branch-3.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants