Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory #26591

Closed
wants to merge 1 commit into from

Conversation

edrevo
Copy link
Contributor

@edrevo edrevo commented Nov 19, 2019

What changes were proposed in this pull request?

When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.

However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.

This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written.

Why are the changes needed?

Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:

it could be used to provision a Kafka topic with a specific number of partitions
it could be used to scale a microservice prior to sending the data to it
it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tests passed

@edrevo
Copy link
Contributor Author

edrevo commented Nov 19, 2019

@rdblue @cloud-fan , this PR contains the latest feedback from #25990. Sorry for closing the other PR and opening a new one! 😓

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114068 has finished for PR 26591 at commit 6d9e427.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114139 has finished for PR 26591 at commit 6d9e427.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114153 has finished for PR 26591 at commit 6d9e427.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I keep hitting errors when merging this PR: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 71: ordinal not in range(128)

@HyukjinKwon do you have any clue?

@cloud-fan cloud-fan changed the title [SPARK-29248][SQL] Add PhysicalWriteInfo with number of partitions [SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory Nov 21, 2019
@cloud-fan
Copy link
Contributor

I tried to update PR title and description, but has no luck. I'm going to merge it using Github directly.

@cloud-fan
Copy link
Contributor

oh, seems like the email address is invalid

Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 577, in <module>
    main()
  File "./dev/merge_spark_pr.py", line 552, in main
    merge_hash = merge_pr(pr_num, target_ref, title, body, pr_repo_desc)
  File "./dev/merge_spark_pr.py", line 147, in merge_pr
    distinct_authors[0])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 71: ordinal not in range(128)

@edrevo can you set up a different email for your git and rebase your commit?

@edrevo
Copy link
Contributor Author

edrevo commented Nov 21, 2019

I think it is the á in my name. I'll change it and rebase it, no problem. The weird thing is, I have previously contributed to spark and the unicode char wasn't a problem back then, so something must have changed in the CI that now breaks with unicode.

@edrevo
Copy link
Contributor Author

edrevo commented Nov 21, 2019

rebased

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114239 has finished for PR 26591 at commit 21bbd6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 54c5087 Nov 21, 2019
@cloud-fan
Copy link
Contributor

@edrevo please open a new PR to add logical write info, thanks!

@edrevo
Copy link
Contributor Author

edrevo commented Nov 21, 2019

will do! many thanks for the patience you've had with me and for steering me in the right direction with the changes.

@rdblue
Copy link
Contributor

rdblue commented Nov 21, 2019

@edrevo, @cloud-fan, what is the intended purpose of LogicalWriteInfo?

@cloud-fan
Copy link
Contributor

To make the API more type-safe. I've seen implementations checking if withQueryId is called and called only once. With LogicalWriteInfo interface, people can just get info from it and don't need to worry about potential mistakes at Spark side.

We can discuss more after @edrevo open the PR.

@cloud-fan
Copy link
Contributor

A real example: when @edrevo adding the numPartitions info using withNumPartitions, none of us realizing that it's not set at the streaming side. When @edrevo refactoring the code using LogicalWriteInfo, we realize it immediately that it's missing at the streaming side and then we propose PhyiscalWriteInfo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants