[WIP]Introducing a separation of concerns between data sources that are gi… by RogerDunn · Pull Request #30106 · apache/spark

RogerDunn · 2020-10-20T15:47:42Z

…ven a dataframe's schema upon write, from FileDataSourceV2, in part to allow data sources that require this notion to be developed in Java (FileDataSourceV2 is not compatible with Java because of its private Table member)

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…ven a dataframe's schema upon write, from FileDataSourceV2, in part to allow data sources that require this notion to be developed in Java (FileDataSourceV2 is not compatible with Java because of its private Table member)

AmplabJenkins · 2020-10-20T15:50:33Z

Can one of the admins verify this patch?

RogerDunn · 2020-10-20T15:53:36Z

@AmplabJenkins , thank you for the speedy note on this patch. Please, there is no current urgency on this patch. I marked it as a [WIP] PR because I'm currently in the process of reviewing this change (weighed against other alternatives) with another Spark developer. I'll keep the notes up to date on this PR. Thank you!

cloud-fan · 2020-10-21T15:02:07Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/SchemaOnWriteDataSource.scala

+/**
+ * A data source that is given the DataFrame's schema upon write operations.
+ */
+trait SchemaOnWriteDataSource extends TableProvider with DataSourceRegister {


One option is to add this trait, and another option is to officially make FileDataSourceV2 a public developer API. Does this "schema-on-write" behavior make sense for non-file data sources?

cc @gengliangwang

+1, making FileDataSourceV2 a public developer API seems simpler

cloud-fan · 2020-11-04T16:09:33Z

The schema inference is already in the DS v2 API TableProvider which is not bound to file source. I think this feature "skip schema inference on write" should not bound to file source as well.

Instead of adding a marker trait, I think we can just reuse TableProvider.supportsExternalMetadata, which indicates that the schema inference is expensive. If supportsExternalMetadata == true, the DataFrameWriter.save should not infer schema and just pass the input schema to TableProvider.

RogerDunn · 2020-11-04T19:16:21Z

@cloud-fan Your idea sounds just right. Are you proposing to make that change in the work you're already doing (in which case I'll remove this PR)?

cloud-fan · 2020-11-05T04:45:55Z

@gengliangwang is creating a PR to implement this idea.

gengliangwang · 2020-11-06T05:23:24Z

I have created #30273 for this.

RogerDunn · 2020-11-06T14:23:20Z

@gengliangwang your PR is better, thank you!

cloud-fan reviewed Oct 21, 2020

View reviewed changes

RogerDunn closed this Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Introducing a separation of concerns between data sources that are gi…#30106

[WIP]Introducing a separation of concerns between data sources that are gi…#30106
RogerDunn wants to merge 1 commit intoapache:masterfrom
RogerDunn:develop

RogerDunn commented Oct 20, 2020

Uh oh!

AmplabJenkins commented Oct 20, 2020

Uh oh!

RogerDunn commented Oct 20, 2020

Uh oh!

cloud-fan Oct 21, 2020

Uh oh!

gengliangwang Oct 21, 2020

Uh oh!

cloud-fan commented Nov 4, 2020

Uh oh!

RogerDunn commented Nov 4, 2020

Uh oh!

cloud-fan commented Nov 5, 2020

Uh oh!

gengliangwang commented Nov 6, 2020

Uh oh!

RogerDunn commented Nov 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

RogerDunn commented Oct 20, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Oct 20, 2020

Uh oh!

RogerDunn commented Oct 20, 2020

Uh oh!

cloud-fan Oct 21, 2020

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 21, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 4, 2020

Uh oh!

RogerDunn commented Nov 4, 2020

Uh oh!

cloud-fan commented Nov 5, 2020

Uh oh!

gengliangwang commented Nov 6, 2020

Uh oh!

RogerDunn commented Nov 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants