New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory #26591
Conversation
@rdblue @cloud-fan , this PR contains the latest feedback from #25990. Sorry for closing the other PR and opening a new one! 😓 |
ok to test |
Test build #114068 has finished for PR 26591 at commit
|
retest this please |
Test build #114139 has finished for PR 26591 at commit
|
retest this please |
Test build #114153 has finished for PR 26591 at commit
|
I keep hitting errors when merging this PR: @HyukjinKwon do you have any clue? |
I tried to update PR title and description, but has no luck. I'm going to merge it using Github directly. |
oh, seems like the email address is invalid
@edrevo can you set up a different email for your git and rebase your commit? |
I think it is the |
rebased |
Test build #114239 has finished for PR 26591 at commit
|
thanks, merging to master! |
@edrevo please open a new PR to add logical write info, thanks! |
will do! many thanks for the patience you've had with me and for steering me in the right direction with the changes. |
@edrevo, @cloud-fan, what is the intended purpose of LogicalWriteInfo? |
To make the API more type-safe. I've seen implementations checking if We can discuss more after @edrevo open the PR. |
A real example: when @edrevo adding the numPartitions info using |
What changes were proposed in this pull request?
When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.
However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.
This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written.
Why are the changes needed?
Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:
it could be used to provision a Kafka topic with a specific number of partitions
it could be used to scale a microservice prior to sending the data to it
it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions)
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tests passed