Skip to content

[WIP]Introducing a separation of concerns between data sources that are gi…#30106

Closed
RogerDunn wants to merge 1 commit intoapache:masterfrom
RogerDunn:develop
Closed

[WIP]Introducing a separation of concerns between data sources that are gi…#30106
RogerDunn wants to merge 1 commit intoapache:masterfrom
RogerDunn:develop

Conversation

@RogerDunn
Copy link

…ven a dataframe's schema upon write, from FileDataSourceV2, in part to allow data sources that require this notion to be developed in Java (FileDataSourceV2 is not compatible with Java because of its private Table member)

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…ven a dataframe's schema upon write, from FileDataSourceV2, in part to allow data sources that require this notion to be developed in Java (FileDataSourceV2 is not compatible with Java because of its private Table member)
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@RogerDunn
Copy link
Author

@AmplabJenkins , thank you for the speedy note on this patch. Please, there is no current urgency on this patch. I marked it as a [WIP] PR because I'm currently in the process of reviewing this change (weighed against other alternatives) with another Spark developer. I'll keep the notes up to date on this PR. Thank you!

/**
* A data source that is given the DataFrame's schema upon write operations.
*/
trait SchemaOnWriteDataSource extends TableProvider with DataSourceRegister {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to add this trait, and another option is to officially make FileDataSourceV2 a public developer API. Does this "schema-on-write" behavior make sense for non-file data sources?

cc @gengliangwang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, making FileDataSourceV2 a public developer API seems simpler

@cloud-fan
Copy link
Contributor

The schema inference is already in the DS v2 API TableProvider which is not bound to file source. I think this feature "skip schema inference on write" should not bound to file source as well.

Instead of adding a marker trait, I think we can just reuse TableProvider.supportsExternalMetadata, which indicates that the schema inference is expensive. If supportsExternalMetadata == true, the DataFrameWriter.save should not infer schema and just pass the input schema to TableProvider.

@RogerDunn
Copy link
Author

@cloud-fan Your idea sounds just right. Are you proposing to make that change in the work you're already doing (in which case I'll remove this PR)?

@cloud-fan
Copy link
Contributor

@gengliangwang is creating a PR to implement this idea.

@gengliangwang
Copy link
Member

I have created #30273 for this.

@RogerDunn
Copy link
Author

@gengliangwang your PR is better, thank you!

@RogerDunn RogerDunn closed this Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants