Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38236][SQL] Check if table location is absolute by "new Path(locationUri).isAbsolute" in create/alter table #35462

Closed
wants to merge 7 commits into from

Conversation

bozhang2820
Copy link
Contributor

@bozhang2820 bozhang2820 commented Feb 9, 2022

What changes were proposed in this pull request?

After #28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is URI.isAbsolute, which basically checks if the table location URI has a scheme defined. So table URIs like /table/path are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is s3a://bucket/db, the table will be created at s3a://bucket/table/path, while it should be created under the file system defined in SessionCatalog.hadoopConf instead.

This change fixes that by treating table location as absolute when the first letter of its path is slash.

This also applies to alter table.

Why are the changes needed?

This is to fix the behavior described above.

Does this PR introduce any user-facing change?

Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in SessionCatalog.hadoopConf, instead of the one defined in the database location URI.

How was this patch tested?

Updated unit tests.

@github-actions github-actions bot added the SQL label Feb 9, 2022
@bozhang2820 bozhang2820 changed the title [SPARK-31709][SQL][FOLLOWUP] Treat table location as absolute when the first letter of its path is slash in create/alter table [WIP][SPARK-31709][SQL][FOLLOWUP] Treat table location as absolute when the first letter of its path is slash in create/alter table Feb 9, 2022
@@ -388,6 +388,8 @@ class SessionCatalog(
private def makeQualifiedTablePath(locationUri: URI, database: String): URI = {
if (locationUri.isAbsolute) {
locationUri
} else if (locationUri.getPath.startsWith("/")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use new Path(locationUri).isAbsolute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will change to follow your suggestion.

@bozhang2820 bozhang2820 changed the title [WIP][SPARK-31709][SQL][FOLLOWUP] Treat table location as absolute when the first letter of its path is slash in create/alter table [SPARK-31709][SQL][FOLLOWUP] Treat table location as absolute when the first letter of its path is slash in create/alter table Feb 16, 2022
@bozhang2820
Copy link
Contributor Author

CC @yaooqinn @cloud-fan for a review.

@@ -53,15 +53,11 @@ import org.apache.spark.sql.types._
import org.apache.spark.sql.v2.avro.AvroScan
import org.apache.spark.util.Utils

abstract class AvroSuite
abstract class AvroSuiteBase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to touch this suite? The fix is for catalog stuff and is not related to the file source.

Copy link
Contributor Author

@bozhang2820 bozhang2820 Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After running the test in CommonFileDataSourceSuite, org.apache.hadoop.fs.Path.getFileSystem() in other tests in the same suite will fail the check in FakeFileSystemRequiringDSOption, since they share the same SparkSession.

That's why I split the suites that extends CommonFileDataSourceSuite into separate ones in this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it because org.apache.hadoop.fs.Path.getFileSystem() has cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because SparkContext.hadoopConfiguration is passed to org.apache.hadoop.fs.Path.getFileSystem() and that is shared across tests in the same suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but certain test case can reset the hadoop configs it sets to avoid polluting the shared environment, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in CommonFileDataSourceSuite actually uses withSQLConf which does reset the config. However SessionCatalog.hadoopConf are set and could not be changed unless we create a new SessionCatalog

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix LGTM, just one question for the test changes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @bozhang2820 .
In general, FOLLOWUP-style PR are more proper when the original PR and the follow-up PR is in the same release. SPARK-31709 was released long time ago at 3.1.0 (Actually, on March 2, 2021, as 3.1.1). In addition, your contribution is worth of tracking by a separate JIRA issue instead of a follow-up style.

@bozhang2820 bozhang2820 changed the title [SPARK-31709][SQL][FOLLOWUP] Treat table location as absolute when the first letter of its path is slash in create/alter table [SPARK-38236][SQL] Treat table location as absolute when the first letter of its path is slash in create/alter table Feb 17, 2022
@bozhang2820
Copy link
Contributor Author

Thanks @dongjoon-hyun for the guidance. Created https://issues.apache.org/jira/browse/SPARK-38236 and updated the title.

@@ -68,6 +68,8 @@ abstract class AvroSuite

override protected def beforeAll(): Unit = {
super.beforeAll()
// initialize SessionCatalog here so it has a clean hadoopConf
spark.sessionState.catalog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do it in SharedSparkSessionBase.initializeSession?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but figured that might need a separate PR due to the impact?

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 21, 2022

thanks, merging to master!

@cloud-fan cloud-fan closed this in b71b917 Feb 21, 2022
@cloud-fan
Copy link
Contributor

@bozhang2820 can you help to create backport PRs for 3.2 and 3.1?

bozhang2820 added a commit to bozhang2820/spark that referenced this pull request Feb 21, 2022
…tter of its path is slash in create/alter table

After apache#28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead.

This change fixes that by treating table location as absolute when the first letter of its path is slash.

This also applies to alter table.

This is to fix the behavior described above.

Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI.

Updated unit tests.

Closes apache#35462 from bozhang2820/spark-31709.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@bozhang2820
Copy link
Contributor Author

bozhang2820 commented Feb 21, 2022

Created #35591 for 3.1 and 3.2.

@dongjoon-hyun
Copy link
Member

+1, late LGTM.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@bozhang2820 bozhang2820 deleted the spark-31709 branch February 22, 2022 03:14
@bozhang2820 bozhang2820 changed the title [SPARK-38236][SQL] Treat table location as absolute when the first letter of its path is slash in create/alter table [SPARK-38236][SQL] Check if table location is absolute by "new Path(locationUri).isAbsolute" in create/alter table Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants