Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41970][SQL][FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation #39808

Conversation

databricks-david-lewis
Copy link
Contributor

What changes were proposed in this pull request?

This PR reverts the SparkPathchanges to FileIndex and FileRelation because they provided little benefit to Open Source Spark, but are widely used extension points for other open source projects. For the 3.4.0 release we want to preserve this type of binary compatibility.

That said, we reserve the right to make this change for Spark 4.0.

Why are the changes needed?

Revert inputFiles: Array[SparkPath] back to inputFiles: Array[String], with an explicit comment that the strings are expected to be url-encoded.

Does this PR introduce any user-facing change?

This is to revert an internal interface change.

How was this patch tested?

Existing unit tests.

@databricks-david-lewis databricks-david-lewis changed the title [SC-][FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation [SPARK-41970][FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation Jan 30, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a partial revert, @databricks-david-lewis ?

cc @cloud-fan and @HyukjinKwon and @xinrong-meng

@databricks-david-lewis
Copy link
Contributor Author

Yes it is @dongjoon-hyun.

override def inputFiles: Array[SparkPath] =
allFiles().map(SparkPath.fromFileStatus).toArray
override def inputFiles: Array[String] =
allFiles().map(fs => SparkPath.fromFileStatus(fs).urlEncoded).toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this part is not reverting to the original code, allFiles().map(_.getPath.toUri.toString).toArray, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. It is not exactly a revert, but is reusing the SparkPath code to translate from hadoop FileStatus to url-encoded string.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you for reducing the breaking changes.

@dongjoon-hyun
Copy link
Member

Also, cc @sunchao

@HyukjinKwon HyukjinKwon changed the title [SPARK-41970][FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation [SPARK-41970][SQL[[FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation Jan 31, 2023
@HyukjinKwon HyukjinKwon changed the title [SPARK-41970][SQL[[FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation [SPARK-41970][SQL][FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation Jan 31, 2023
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.4.

HyukjinKwon pushed a commit that referenced this pull request Jan 31, 2023
…d FileRelation

### What changes were proposed in this pull request?
This PR reverts the `SparkPath`changes to `FileIndex` and `FileRelation` because they provided little benefit to Open Source Spark, but are widely used extension points for other open source projects. For the 3.4.0 release we want to preserve this type of binary compatibility.

That said, we reserve the right to make this change for Spark 4.0.

### Why are the changes needed?
Revert `inputFiles: Array[SparkPath]` back to `inputFiles: Array[String]`, with an explicit comment that the strings are expected to be url-encoded.

### Does this PR introduce _any_ user-facing change?
This is to revert an internal interface change.

### How was this patch tested?
Existing unit tests.

Closes #39808 from databricks-david-lewis/SPARK_PATH_FOLLOWUP.

Authored-by: David Lewis <david.lewis@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3887e71)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@cloud-fan
Copy link
Contributor

late LGTM

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…d FileRelation

### What changes were proposed in this pull request?
This PR reverts the `SparkPath`changes to `FileIndex` and `FileRelation` because they provided little benefit to Open Source Spark, but are widely used extension points for other open source projects. For the 3.4.0 release we want to preserve this type of binary compatibility.

That said, we reserve the right to make this change for Spark 4.0.

### Why are the changes needed?
Revert `inputFiles: Array[SparkPath]` back to `inputFiles: Array[String]`, with an explicit comment that the strings are expected to be url-encoded.

### Does this PR introduce _any_ user-facing change?
This is to revert an internal interface change.

### How was this patch tested?
Existing unit tests.

Closes apache#39808 from databricks-david-lewis/SPARK_PATH_FOLLOWUP.

Authored-by: David Lewis <david.lewis@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3887e71)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants