-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38236][SQL] Check if table location is absolute by "new Path(locationUri).isAbsolute" in create/alter table #35462
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Show resolved
Hide resolved
c514c47
to
67c2318
Compare
@@ -388,6 +388,8 @@ class SessionCatalog( | |||
private def makeQualifiedTablePath(locationUri: URI, database: String): URI = { | |||
if (locationUri.isAbsolute) { | |||
locationUri | |||
} else if (locationUri.getPath.startsWith("/")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use new Path(locationUri).isAbsolute
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will change to follow your suggestion.
CC @yaooqinn @cloud-fan for a review. |
@@ -53,15 +53,11 @@ import org.apache.spark.sql.types._ | |||
import org.apache.spark.sql.v2.avro.AvroScan | |||
import org.apache.spark.util.Utils | |||
|
|||
abstract class AvroSuite | |||
abstract class AvroSuiteBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to touch this suite? The fix is for catalog stuff and is not related to the file source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running the test in CommonFileDataSourceSuite
, org.apache.hadoop.fs.Path.getFileSystem()
in other tests in the same suite will fail the check in FakeFileSystemRequiringDSOption
, since they share the same SparkSession
.
That's why I split the suites that extends CommonFileDataSourceSuite
into separate ones in this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it because org.apache.hadoop.fs.Path.getFileSystem()
has cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because SparkContext.hadoopConfiguration
is passed to org.apache.hadoop.fs.Path.getFileSystem()
and that is shared across tests in the same suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but certain test case can reset the hadoop configs it sets to avoid polluting the shared environment, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test in CommonFileDataSourceSuite
actually uses withSQLConf
which does reset the config. However SessionCatalog.hadoopConf
are set and could not be changed unless we create a new SessionCatalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix LGTM, just one question for the test changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @bozhang2820 .
In general, FOLLOWUP
-style PR are more proper when the original PR and the follow-up PR is in the same release. SPARK-31709 was released long time ago at 3.1.0 (Actually, on March 2, 2021, as 3.1.1). In addition, your contribution is worth of tracking by a separate JIRA issue instead of a follow-up style.
Thanks @dongjoon-hyun for the guidance. Created https://issues.apache.org/jira/browse/SPARK-38236 and updated the title. |
@@ -68,6 +68,8 @@ abstract class AvroSuite | |||
|
|||
override protected def beforeAll(): Unit = { | |||
super.beforeAll() | |||
// initialize SessionCatalog here so it has a clean hadoopConf | |||
spark.sessionState.catalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do it in SharedSparkSessionBase.initializeSession
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it but figured that might need a separate PR due to the impact?
thanks, merging to master! |
@bozhang2820 can you help to create backport PRs for 3.2 and 3.1? |
…tter of its path is slash in create/alter table After apache#28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. This is to fix the behavior described above. Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. Updated unit tests. Closes apache#35462 from bozhang2820/spark-31709. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Created #35591 for 3.1 and 3.2. |
+1, late LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
What changes were proposed in this pull request?
After #28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is
URI.isAbsolute
, which basically checks if the table location URI has a scheme defined. So table URIs like/table/path
are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI iss3a://bucket/db
, the table will be created ats3a://bucket/table/path
, while it should be created under the file system defined inSessionCatalog.hadoopConf
instead.This change fixes that by treating table location as absolute when the first letter of its path is slash.
This also applies to alter table.
Why are the changes needed?
This is to fix the behavior described above.
Does this PR introduce any user-facing change?
Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in
SessionCatalog.hadoopConf
, instead of the one defined in the database location URI.How was this patch tested?
Updated unit tests.