[SPARK-47085][SQL] reduce the complexity of toTRowSet from n^2 to n #45155

igreenfield · 2024-02-18T08:30:01Z

What changes were proposed in this pull request?

reduce the complexity of RowSetUtils.toTRowSet from n^2 to n

Why are the changes needed?

This causes performance issues.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tests + test manually on AWS EMR

Was this patch authored or co-authored using generative AI tooling?

No

yaooqinn · 2024-02-19T03:41:14Z

Could we share some performance results in this section? It appears that replacing the while-loop with map/foreach is the only change made.

igreenfield · 2024-02-19T06:31:10Z

@yaooqinn yes this reduce the complexty to O(n) as Seq.apply is O(n)

yaooqinn · 2024-02-19T06:54:52Z

Thanks, @igreenfield, merged to master.

Can you make backport PRs to the 3.5 and 3.4 branches?

igreenfield · 2024-02-19T07:18:26Z

@yaooqinn Yes I will create.

igreenfield · 2024-02-19T08:04:14Z

@yaooqinn

dongjoon-hyun

Hi, @igreenfield and all.

This could be a good improvement. However, Apache Spark community has a backporting policy which allows a bug-fix only.

If you really want to backport this, please change the issue type to BUG and provide some background why this is a regression at Apache Spark 3.4.0.

yaooqinn · 2024-02-20T02:03:54Z

Thank you @dongjoon-hyun.

Hi @igreenfield, can you help update the jira side as @dongjoon-hyun suggested?

igreenfield · 2024-02-20T07:20:14Z

Hi @yaooqinn @dongjoon-hyun Done.

dongjoon-hyun · 2024-02-20T16:21:00Z

Unfortunately, this part is still missing, @igreenfield .

provide some background why this is a regression at Apache Spark 3.4.0.

mridulm · 2024-02-21T18:28:25Z

Does this PR actually result in an improvement ? Seq.apply is expensive only if it is not an indexed seq.
The change itself is reasonable, ~~but it looks like current usages pass in an IndexedSeq - and so should not be expensive ?~~
~~If yes, let us not backport it.~~

Looks like I was wrong about this ... the list is a created via ::, sigh.

dongjoon-hyun · 2024-02-21T18:32:33Z

So, we are good, @mridulm ? 😄

pan3793 · 2024-02-22T11:26:26Z

sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala

@@ -81,13 +75,11 @@ object RowSetUtils {
    val tRowSet = new TRowSet(startRowOffSet, new java.util.ArrayList[TRow](rowSize))
    var i = 0
    val columnSize = schema.length
-    val tColumns = new java.util.ArrayList[TColumn](columnSize)


this silently reverts SPARK-46328

pan3793 · 2024-02-22T11:40:57Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

@@ -116,7 +116,7 @@ private[hive] class SparkExecuteStatementOperation(
    val offset = iter.getPosition
    val rows = iter.take(maxRows).toList
    log.debug(s"Returning result set with ${rows.length} rows from offsets " +
-      s"[${iter.getFetchStart}, ${offset}) with $statementId")
+      s"[${iter.getFetchStart}, ${iter.getPosition}) with $statementId")


is this change correct? the offset is evaluated before iterating iter, which is the real offset where we consume

This what it was before the other pr and it prints the correct data

pan3793 · 2024-02-22T11:43:58Z

...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala

@@ -116,7 +116,7 @@ private[hive] class SparkExecuteStatementOperation(
    val offset = iter.getPosition
    val rows = iter.take(maxRows).toList


the real matter is toList returns a linked list that

Performance
Time: List has O(1) prepend and head/tail access. Most other operations are O(n) on the number of elements in the list. This includes the index-based lookup of elements, length, append and reverse.

The most simplified approach to address the performance issue is

- val rows = iter.take(maxRows).toList + val rows = iter.take(maxRows).toIndexedSeq

which reduces the time cost of rows(i) from O(n) to O(1)

I think method coplexity should not be dependent on the input stracture and the current situation prove it

These are private methods, and we can simply require toRowBasedSet to take a IndexedSeq instead of Seq as proposed by @pan3793, right ?
I would have actually preferred this - as it minimizes the change (though the reformulation in this PR is ok as well)

cfmcgrady · 2024-02-26T05:34:39Z

sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/RowSetUtils.scala

-    val tRows = new java.util.ArrayList[TRow](rowSize)
-    while (i < rowSize) {
-      val row = rows(i)
+    val tRows = rows.map { row =>


Per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, we need to use while loop for performance-sensitive code

[SPARK-47085] reduce the complexity of toTRowSet from n^2 to n

1231c3f

github-actions bot added the SQL label Feb 18, 2024

[SPARK-47085] remove deprecation

75da8de

yaooqinn approved these changes Feb 19, 2024

View reviewed changes

yaooqinn closed this in 91dfc31 Feb 19, 2024

dongjoon-hyun reviewed Feb 19, 2024

View reviewed changes

pan3793 reviewed Feb 22, 2024

View reviewed changes

cfmcgrady reviewed Feb 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47085][SQL] reduce the complexity of toTRowSet from n^2 to n #45155

[SPARK-47085][SQL] reduce the complexity of toTRowSet from n^2 to n #45155

igreenfield commented Feb 18, 2024

yaooqinn commented Feb 19, 2024

igreenfield commented Feb 19, 2024

yaooqinn commented Feb 19, 2024

igreenfield commented Feb 19, 2024

igreenfield commented Feb 19, 2024

dongjoon-hyun left a comment

yaooqinn commented Feb 20, 2024

igreenfield commented Feb 20, 2024

dongjoon-hyun commented Feb 20, 2024

mridulm commented Feb 21, 2024 •

edited

dongjoon-hyun commented Feb 21, 2024

pan3793 Feb 22, 2024

pan3793 Feb 22, 2024

igreenfield Feb 22, 2024

pan3793 Feb 22, 2024 •

edited

igreenfield Feb 22, 2024

mridulm Feb 22, 2024 •

edited

cfmcgrady Feb 26, 2024

		@@ -116,7 +116,7 @@ private[hive] class SparkExecuteStatementOperation(
		val offset = iter.getPosition
		val rows = iter.take(maxRows).toList

[SPARK-47085][SQL] reduce the complexity of toTRowSet from n^2 to n #45155

[SPARK-47085][SQL] reduce the complexity of toTRowSet from n^2 to n #45155

Conversation

igreenfield commented Feb 18, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

yaooqinn commented Feb 19, 2024

igreenfield commented Feb 19, 2024

yaooqinn commented Feb 19, 2024

igreenfield commented Feb 19, 2024

igreenfield commented Feb 19, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

yaooqinn commented Feb 20, 2024

igreenfield commented Feb 20, 2024

dongjoon-hyun commented Feb 20, 2024

mridulm commented Feb 21, 2024 • edited

dongjoon-hyun commented Feb 21, 2024

pan3793 Feb 22, 2024

Choose a reason for hiding this comment

pan3793 Feb 22, 2024

Choose a reason for hiding this comment

igreenfield Feb 22, 2024

Choose a reason for hiding this comment

pan3793 Feb 22, 2024 • edited

Choose a reason for hiding this comment

igreenfield Feb 22, 2024

Choose a reason for hiding this comment

mridulm Feb 22, 2024 • edited

Choose a reason for hiding this comment

cfmcgrady Feb 26, 2024

Choose a reason for hiding this comment

mridulm commented Feb 21, 2024 •

edited

pan3793 Feb 22, 2024 •

edited

mridulm Feb 22, 2024 •

edited