Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47085][SQL] reduce the complexity of toTRowSet from n^2 to n #45155

Closed
wants to merge 2 commits into from

Conversation

igreenfield
Copy link

What changes were proposed in this pull request?

reduce the complexity of RowSetUtils.toTRowSet from n^2 to n

Why are the changes needed?

This causes performance issues.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tests + test manually on AWS EMR

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Feb 18, 2024
@yaooqinn
Copy link
Member

Could we share some performance results in this section? It appears that replacing the while-loop with map/foreach is the only change made.

@igreenfield
Copy link
Author

@yaooqinn yes this reduce the complexty to O(n) as Seq.apply is O(n)

@yaooqinn yaooqinn closed this in 91dfc31 Feb 19, 2024
@yaooqinn
Copy link
Member

Thanks, @igreenfield, merged to master.

Can you make backport PRs to the 3.5 and 3.4 branches?

@igreenfield
Copy link
Author

@yaooqinn Yes I will create.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @igreenfield and all.

This could be a good improvement. However, Apache Spark community has a backporting policy which allows a bug-fix only.

Screenshot 2024-02-19 at 10 59 16

If you really want to backport this, please change the issue type to BUG and provide some background why this is a regression at Apache Spark 3.4.0.

@yaooqinn
Copy link
Member

Thank you @dongjoon-hyun.

Hi @igreenfield, can you help update the jira side as @dongjoon-hyun suggested?

@igreenfield
Copy link
Author

Hi @yaooqinn @dongjoon-hyun Done.

@dongjoon-hyun
Copy link
Member

Unfortunately, this part is still missing, @igreenfield .

provide some background why this is a regression at Apache Spark 3.4.0.

@mridulm
Copy link
Contributor

mridulm commented Feb 21, 2024

Does this PR actually result in an improvement ? Seq.apply is expensive only if it is not an indexed seq.
The change itself is reasonable, but it looks like current usages pass in an IndexedSeq - and so should not be expensive ?
If yes, let us not backport it.

Looks like I was wrong about this ... the list is a created via ::, sigh.

@dongjoon-hyun
Copy link
Member

So, we are good, @mridulm ? 😄

@@ -81,13 +75,11 @@ object RowSetUtils {
val tRowSet = new TRowSet(startRowOffSet, new java.util.ArrayList[TRow](rowSize))
var i = 0
val columnSize = schema.length
val tColumns = new java.util.ArrayList[TColumn](columnSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this silently reverts SPARK-46328

@@ -116,7 +116,7 @@ private[hive] class SparkExecuteStatementOperation(
val offset = iter.getPosition
val rows = iter.take(maxRows).toList
log.debug(s"Returning result set with ${rows.length} rows from offsets " +
s"[${iter.getFetchStart}, ${offset}) with $statementId")
s"[${iter.getFetchStart}, ${iter.getPosition}) with $statementId")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change correct? the offset is evaluated before iterating iter, which is the real offset where we consume

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This what it was before the other pr and it prints the correct data

@@ -116,7 +116,7 @@ private[hive] class SparkExecuteStatementOperation(
val offset = iter.getPosition
val rows = iter.take(maxRows).toList
Copy link
Member

@pan3793 pan3793 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the real matter is toList returns a linked list that

Performance
Time: List has O(1) prepend and head/tail access. Most other operations are O(n) on the number of elements in the list. This includes the index-based lookup of elements, length, append and reverse.

The most simplified approach to address the performance issue is

-    val rows = iter.take(maxRows).toList
+    val rows = iter.take(maxRows).toIndexedSeq

which reduces the time cost of rows(i) from O(n) to O(1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think method coplexity should not be dependent on the input stracture and the current situation prove it

Copy link
Contributor

@mridulm mridulm Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are private methods, and we can simply require toRowBasedSet to take a IndexedSeq instead of Seq as proposed by @pan3793, right ?
I would have actually preferred this - as it minimizes the change (though the reformulation in this PR is ok as well)

val tRows = new java.util.ArrayList[TRow](rowSize)
while (i < rowSize) {
val row = rows(i)
val tRows = rows.map { row =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, we need to use while loop for performance-sensitive code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants