-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47085][SQL] reduce the complexity of toTRowSet from n^2 to n #45155
Conversation
Could we share some performance results in this section? It appears that replacing the |
@yaooqinn yes this reduce the complexty to O(n) as Seq.apply is O(n) |
Thanks, @igreenfield, merged to master. Can you make backport PRs to the 3.5 and 3.4 branches? |
@yaooqinn Yes I will create. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @igreenfield and all.
This could be a good improvement. However, Apache Spark community has a backporting policy which allows a bug-fix only.
If you really want to backport this, please change the issue type to BUG
and provide some background why this is a regression at Apache Spark 3.4.0.
Thank you @dongjoon-hyun. Hi @igreenfield, can you help update the jira side as @dongjoon-hyun suggested? |
Hi @yaooqinn @dongjoon-hyun Done. |
Unfortunately, this part is still missing, @igreenfield .
|
Does this PR actually result in an improvement ? Looks like I was wrong about this ... the list is a created via |
So, we are good, @mridulm ? 😄 |
@@ -81,13 +75,11 @@ object RowSetUtils { | |||
val tRowSet = new TRowSet(startRowOffSet, new java.util.ArrayList[TRow](rowSize)) | |||
var i = 0 | |||
val columnSize = schema.length | |||
val tColumns = new java.util.ArrayList[TColumn](columnSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this silently reverts SPARK-46328
@@ -116,7 +116,7 @@ private[hive] class SparkExecuteStatementOperation( | |||
val offset = iter.getPosition | |||
val rows = iter.take(maxRows).toList | |||
log.debug(s"Returning result set with ${rows.length} rows from offsets " + | |||
s"[${iter.getFetchStart}, ${offset}) with $statementId") | |||
s"[${iter.getFetchStart}, ${iter.getPosition}) with $statementId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change correct? the offset
is evaluated before iterating iter
, which is the real offset where we consume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This what it was before the other pr and it prints the correct data
@@ -116,7 +116,7 @@ private[hive] class SparkExecuteStatementOperation( | |||
val offset = iter.getPosition | |||
val rows = iter.take(maxRows).toList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the real matter is toList
returns a linked list
that
Performance
Time: List has O(1) prepend and head/tail access. Most other operations are O(n) on the number of elements in the list. This includes the index-based lookup of elements, length, append and reverse.
The most simplified approach to address the performance issue is
- val rows = iter.take(maxRows).toList
+ val rows = iter.take(maxRows).toIndexedSeq
which reduces the time cost of rows(i)
from O(n) to O(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think method coplexity should not be dependent on the input stracture and the current situation prove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are private methods, and we can simply require toRowBasedSet
to take a IndexedSeq
instead of Seq
as proposed by @pan3793, right ?
I would have actually preferred this - as it minimizes the change (though the reformulation in this PR is ok as well)
val tRows = new java.util.ArrayList[TRow](rowSize) | ||
while (i < rowSize) { | ||
val row = rows(i) | ||
val tRows = rows.map { row => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, we need to use while
loop for performance-sensitive code
What changes were proposed in this pull request?
reduce the complexity of RowSetUtils.toTRowSet from n^2 to n
Why are the changes needed?
This causes performance issues.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tests + test manually on AWS EMR
Was this patch authored or co-authored using generative AI tooling?
No