[SPARK-49193][SQL] Improve the performance of RowSetUtils.toColumnBasedSet by wangyum · Pull Request #47699 · apache/spark

wangyum · 2024-08-11T01:17:12Z

What changes were proposed in this pull request?

Replace while loop with foreach in RowSetUtils.toTColumn.

Why are the changes needed?

Improve the performance of RowSetUtils.toColumnBasedSet:

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test.

import org.apache.hive.service.rpc.thrift.TProtocolVersion
import org.apache.spark.sql.execution.HiveResult

val df = spark.sql("select id, cast(id as string), cast(id as timestamp) from range(200000)")
val dataTypes = df.schema.fields.map(_.dataType)
val rows = df.collect().toList
val start1 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V11, HiveResult.getTimeFormatters)
val start2 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V5, HiveResult.getTimeFormatters)
val start3 = System.currentTimeMillis()
println(s"toColumnBasedSet time: ${start2 - start1}, toRowBasedSet time: ${start3 - start2}")

Before this PR:

toColumnBasedSet time: 17307, toRowBasedSet time: 71

After this PR:

toColumnBasedSet time: 128, toRowBasedSet time: 70

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-08-11T01:21:48Z

sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala

        val values = new java.util.ArrayList[String](rowSize)
-        while (i < rowSize) {
-          val row = rows(i)
+        rows.foreach { row =>


Hmmm generally while is faster than foreach (https://github.com/databricks/scala-style-guide?tab=readme-ov-file#traversal-and-zipwithindex)

It seems the Array's performance is much better than Seq.

toColumnBasedSet time: 138, toRowBasedSet time: 73

yaooqinn

LGTM

wangyum · 2024-08-11T06:27:47Z

Another test:

import org.apache.hive.service.rpc.thrift.TProtocolVersion
import org.apache.spark.sql.execution.HiveResult

val df = spark.sql("select id, cast(id as string), cast(id as timestamp),  cast(id as decimal(18, 0)) from range(20000000)")
val dataTypes = df.schema.fields.map(_.dataType)
val rows = df.collect().toList
val start1 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes,
  TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V11, HiveResult.getTimeFormatters)
val start2 = System.currentTimeMillis()
RowSetUtils.toTRowSet(1, rows, dataTypes,
  TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V5, HiveResult.getTimeFormatters)
val start3 = System.currentTimeMillis()
println(s"toColumnBasedSet time: ${start2 - start1}, toRowBasedSet time: ${start3 - start2}")

Result:

toColumnBasedSet time: 40678, toRowBasedSet time: 90844

HyukjinKwon · 2024-08-11T08:32:48Z

The docmumentation build, you might need to sync your branch and master branch to the latest.

Merged to master.

…edSet ### What changes were proposed in this pull request? Replace `while` loop with `foreach` in `RowSetUtils.toTColumn`. ### Why are the changes needed? Improve the performance of `RowSetUtils.toColumnBasedSet`: <img width="1196" alt="image" src="https://github.com/user-attachments/assets/f481de39-e0bf-41c5-8fee-09dc1a93c4e1"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ```scala import org.apache.hive.service.rpc.thrift.TProtocolVersion import org.apache.spark.sql.execution.HiveResult val df = spark.sql("select id, cast(id as string), cast(id as timestamp) from range(200000)") val dataTypes = df.schema.fields.map(_.dataType) val rows = df.collect().toList val start1 = System.currentTimeMillis() RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V11, HiveResult.getTimeFormatters) val start2 = System.currentTimeMillis() RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V5, HiveResult.getTimeFormatters) val start3 = System.currentTimeMillis() println(s"toColumnBasedSet time: ${start2 - start1}, toRowBasedSet time: ${start3 - start2}") ``` Before this PR: ``` toColumnBasedSet time: 17307, toRowBasedSet time: 71 ``` After this PR: ``` toColumnBasedSet time: 128, toRowBasedSet time: 70 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47699 from wangyum/toColumnBasedSet. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 567d58c) Signed-off-by: Kent Yao <yao@apache.org>

yaooqinn · 2024-08-12T12:18:12Z

I backported this to branch-3.5 and branch-3.4

wangyum added 2 commits August 10, 2024 17:55

Improve toColumnBasedSet

73f4b3d

fix

c1b98f5

github-actions bot added the SQL label Aug 11, 2024

HyukjinKwon reviewed Aug 11, 2024

View reviewed changes

wangyum changed the title ~~[WIP] Improve the performance of RowSetUtils.toColumnBasedSet~~ [SPARK-49193] Improve the performance of RowSetUtils.toColumnBasedSet Aug 11, 2024

yaooqinn reviewed Aug 11, 2024

View reviewed changes

yaooqinn approved these changes Aug 11, 2024

View reviewed changes

fix

a959b33

HyukjinKwon approved these changes Aug 11, 2024

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-49193] Improve the performance of RowSetUtils.toColumnBasedSet~~ [SPARK-49193][SQL] Improve the performance of RowSetUtils.toColumnBasedSet Aug 11, 2024

HyukjinKwon closed this in 567d58c Aug 11, 2024

wangyum deleted the toColumnBasedSet branch August 11, 2024 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49193][SQL] Improve the performance of RowSetUtils.toColumnBasedSet#47699

[SPARK-49193][SQL] Improve the performance of RowSetUtils.toColumnBasedSet#47699
wangyum wants to merge 3 commits intoapache:masterfrom
wangyum:toColumnBasedSet

wangyum commented Aug 11, 2024 •

edited

Loading

Uh oh!

HyukjinKwon Aug 11, 2024

Uh oh!

wangyum Aug 11, 2024 •

edited

Loading

Uh oh!

yaooqinn left a comment

Uh oh!

wangyum commented Aug 11, 2024

Uh oh!

HyukjinKwon commented Aug 11, 2024

Uh oh!

yaooqinn commented Aug 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wangyum commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon Aug 11, 2024

Choose a reason for hiding this comment

Uh oh!

wangyum Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn left a comment

Choose a reason for hiding this comment

Uh oh!

wangyum commented Aug 11, 2024

Uh oh!

HyukjinKwon commented Aug 11, 2024

Uh oh!

yaooqinn commented Aug 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangyum commented Aug 11, 2024 •

edited

Loading

wangyum Aug 11, 2024 •

edited

Loading