[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory #26591

edrevo · 2019-11-19T05:48:05Z

What changes were proposed in this pull request?

When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.

However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.

This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written.

Why are the changes needed?

Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:

it could be used to provision a Kafka topic with a specific number of partitions
it could be used to scale a microservice prior to sending the data to it
it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tests passed

edrevo · 2019-11-19T05:48:55Z

@rdblue @cloud-fan , this PR contains the latest feedback from #25990. Sorry for closing the other PR and opening a new one! 😓

cloud-fan · 2019-11-19T06:34:28Z

ok to test

SparkQA · 2019-11-19T08:05:02Z

Test build #114068 has finished for PR 26591 at commit 6d9e427.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-20T07:22:12Z

retest this please

SparkQA · 2019-11-20T08:05:02Z

Test build #114139 has finished for PR 26591 at commit 6d9e427.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-20T08:50:16Z

retest this please

SparkQA · 2019-11-20T13:08:00Z

Test build #114153 has finished for PR 26591 at commit 6d9e427.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-20T14:39:37Z

I keep hitting errors when merging this PR: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 71: ordinal not in range(128)

@HyukjinKwon do you have any clue?

cloud-fan · 2019-11-21T11:37:53Z

I tried to update PR title and description, but has no luck. I'm going to merge it using Github directly.

cloud-fan · 2019-11-21T11:43:32Z

oh, seems like the email address is invalid

Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 577, in <module>
    main()
  File "./dev/merge_spark_pr.py", line 552, in main
    merge_hash = merge_pr(pr_num, target_ref, title, body, pr_repo_desc)
  File "./dev/merge_spark_pr.py", line 147, in merge_pr
    distinct_authors[0])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 71: ordinal not in range(128)

@edrevo can you set up a different email for your git and rebase your commit?

edrevo · 2019-11-21T11:45:29Z

I think it is the á in my name. I'll change it and rebase it, no problem. The weird thing is, I have previously contributed to spark and the unicode char wasn't a problem back then, so something must have changed in the CI that now breaks with unicode.

edrevo · 2019-11-21T11:47:49Z

rebased

SparkQA · 2019-11-21T15:39:24Z

Test build #114239 has finished for PR 26591 at commit 21bbd6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-21T16:19:46Z

thanks, merging to master!

cloud-fan · 2019-11-21T16:20:13Z

@edrevo please open a new PR to add logical write info, thanks!

edrevo · 2019-11-21T16:21:00Z

will do! many thanks for the patience you've had with me and for steering me in the right direction with the changes.

rdblue · 2019-11-21T17:56:24Z

@edrevo, @cloud-fan, what is the intended purpose of LogicalWriteInfo?

cloud-fan · 2019-11-22T02:33:22Z

To make the API more type-safe. I've seen implementations checking if withQueryId is called and called only once. With LogicalWriteInfo interface, people can just get info from it and don't need to worry about potential mistakes at Spark side.

We can discuss more after @edrevo open the PR.

cloud-fan · 2019-11-22T02:35:42Z

A real example: when @edrevo adding the numPartitions info using withNumPartitions, none of us realizing that it's not set at the streaming side. When @edrevo refactoring the code using LogicalWriteInfo, we realize it immediately that it's missing at the streaming side and then we propose PhyiscalWriteInfo.

edrevo mentioned this pull request Nov 19, 2019

[SPARK-29248][SQL] Add PhysicalWriteInfo with number of partitions #25990

Closed

cloud-fan approved these changes Nov 19, 2019

View reviewed changes

dongjoon-hyun added the SQL label Nov 20, 2019

cloud-fan changed the title ~~[SPARK-29248][SQL] Add PhysicalWriteInfo with number of partitions~~ [SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory Nov 21, 2019

[SPARK-29248][SQL] Add PhysicalWriteInfo with number of partitions

21bbd6b

edrevo force-pushed the temp branch from 6d9e427 to 21bbd6b Compare November 21, 2019 11:47

cloud-fan closed this in 54c5087 Nov 21, 2019

cloud-fan mentioned this pull request Dec 20, 2019

[SPARK-30226][SQL] Remove withXXX functions in WriteBuilder #26678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory #26591

[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory #26591

edrevo commented Nov 19, 2019 •

edited by cloud-fan

edrevo commented Nov 19, 2019

cloud-fan commented Nov 19, 2019

SparkQA commented Nov 19, 2019

cloud-fan commented Nov 20, 2019

SparkQA commented Nov 20, 2019

cloud-fan commented Nov 20, 2019

SparkQA commented Nov 20, 2019

cloud-fan commented Nov 20, 2019

cloud-fan commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

edrevo commented Nov 21, 2019 •

edited

edrevo commented Nov 21, 2019

SparkQA commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

edrevo commented Nov 21, 2019

rdblue commented Nov 21, 2019

cloud-fan commented Nov 22, 2019

cloud-fan commented Nov 22, 2019

[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory #26591

[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory #26591

Conversation

edrevo commented Nov 19, 2019 • edited by cloud-fan

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

edrevo commented Nov 19, 2019

cloud-fan commented Nov 19, 2019

SparkQA commented Nov 19, 2019

cloud-fan commented Nov 20, 2019

SparkQA commented Nov 20, 2019

cloud-fan commented Nov 20, 2019

SparkQA commented Nov 20, 2019

cloud-fan commented Nov 20, 2019

cloud-fan commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

edrevo commented Nov 21, 2019 • edited

edrevo commented Nov 21, 2019

SparkQA commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

cloud-fan commented Nov 21, 2019

edrevo commented Nov 21, 2019

rdblue commented Nov 21, 2019

cloud-fan commented Nov 22, 2019

cloud-fan commented Nov 22, 2019

edrevo commented Nov 19, 2019 •

edited by cloud-fan

edrevo commented Nov 21, 2019 •

edited