New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42669][CONNECT] Short circuit local relation RPCs #40782
Conversation
The failure GA is unrelated to this PR. |
sorry, |
So this happens when
? |
// Short circuit local relation RPCs | ||
val localRelation = plan.getRoot.getLocalRelation | ||
new SparkResult( | ||
Seq.empty[proto.ExecutePlanResponse].iterator.asJava, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if you can populate the data from the LocalRelation into an instance of proto.ExecutePlanResponse
thus we do not need to branch the code in SparkResult
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
It happens when the root plan is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The code changes are reduced much less now.
reader.loadNextBatch() | ||
val batch = proto.ExecutePlanResponse.ArrowBatch | ||
.newBuilder() | ||
.setRowCount(reader.getVectorSchemaRoot.getRowCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to set row count? You can just modify SparkResult to make the assert a bit more lenient.
val localRelation = plan.getRoot.getLocalRelation | ||
val response = proto.ExecutePlanResponse.newBuilder() | ||
val reader = new ArrowStreamReader(localRelation.getData.newInput(), allocator) | ||
reader.loadNextBatch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to close the reader, otherwise the vector root is never cleaned up (memory leak).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I forgot it.
@hvanhovell The failed GA is unrelated to this PR. |
.newBuilder() | ||
.setData(localRelation.getData) | ||
.build() | ||
response.setArrowBatch(batch).setIsLocalBuilt(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? I don't think we should change the protocol for this.
assert(root.getRowCount == response.getArrowBatch.getRowCount) // HUH! | ||
assert( | ||
response.getIsLocalBuilt || | ||
root.getRowCount == response.getArrowBatch.getRowCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just make the row count optional. It is kinda weird it is in the protocol anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You means change the protocol that make the row count optional ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error: Field "1" on message "ArrowBatch" moved from outside to inside a oneof.
[8](https://github.com/beliefer/spark/actions/runs/4761585011/jobs/8462987474#step:5:9)
Error: buf found 1 breaking changes.
It seems 1 breaking change. @hvanhovell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess adding the row count as optional to LocalRelation
message is more useful.
The server can check if the expected number of rows are provided, and we can reuse it here to set the row count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell What do you think about my idea if making the row count optional causes the breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell Could you take a close look ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to implement this in Python client, but I'm not sure whether this is useful or not.
We need at least one roundtrip to retrieve some configs to provide a right result, e.g., spark.sql.session.timeZone
.
The test below won't pass without retrieving the config:
https://github.com/apache/spark/blob/d7712776bf28d7c9c0d523624bec70a4bc2d97a0/python/pyspark/sql/tests/test_arrow.py#L335-L358
I'm not sure it's the same in Scala client, though.
Just FYI.
After some more investigation, I realized we need to retrieve some configs in some cases anyway.
I think we should go ahead with this. @beliefer
assert(root.getRowCount == response.getArrowBatch.getRowCount) // HUH! | ||
assert( | ||
response.getIsLocalBuilt || | ||
root.getRowCount == response.getArrowBatch.getRowCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess adding the row count as optional to LocalRelation
message is more useful.
The server can check if the expected number of rows are provided, and we can reuse it here to set the row count.
Because there are difference suggestion from @hvanhovell and @ueshin, I don't know how to continue this job. |
@ueshin @hvanhovell Recently, #41064 added the rowCount statistics to |
cc @HyukjinKwon |
ping @hvanhovell cc @ueshin |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
@hvanhovell Do we need this change? |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Operations on
LocalRelation
can mostly be done locally (without sending RPCs).We should leverage this.
Why are the changes needed?
Avoid sending RPCs for
LocalRelation
.Does this PR introduce any user-facing change?
'No'.
New feature.
How was this patch tested?
Exists test cases.