[SPARK-31709][SQL] Proper base path for database/table location when it is a relative path #28527

yaooqinn · 2020-05-14T09:38:26Z

What changes were proposed in this pull request?

Currently, the user home directory is used as the base path for the database and table locations when their locationa are specified with a relative paths, e.g.

> set spark.sql.warehouse.dir;
spark.sql.warehouse.dir	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/
spark-sql> create database loctest location 'loctestdbdir';

spark-sql> desc database loctest;
Database Name	loctest
Comment
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
Owner	kentyao

spark-sql> create table loctest(id int) location 'loctestdbdir';
spark-sql> desc formatted loctest;
id	int	NULL

# Detailed Table Information
Database	default
Table	loctest
Owner	kentyao
Created Time	Thu May 14 16:29:05 CST 2020
Last Access	UNKNOWN
Created By	Spark 3.1.0-SNAPSHOT
Type	EXTERNAL
Provider	parquet
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

The user home is not always warehouse-related, unchangeable in runtime, and shared both by database and table as the parent directory. Meanwhile, we use the table path as the parent directory for relative partition locations.

The config spark.sql.warehouse.dir represents the default location for managed databases and tables.
For databases, the case above seems not to follow its semantics, because it should use spark.sql.warehouse.dir` as the base path instead.

For tables, it seems to be right but here I suggest enriching the meaning that lets it also be the for external tables with relative paths for locations.

With changes in this PR,

The location of a database will be warehouseDir/dbpath when dbpath is relative.
The location of a table will be dbpath/tblpath when tblpath is relative.

Why are the changes needed?

bugfix and improvement

Firstly, the databases with relative locations should be created under the default location specified by spark.sql.warehouse.dir.

Secondly, the external tables with relative paths may also follow this behavior for consistency.

At last, the behavior for database, tables and partitions with relative paths to choose base paths should be the same.

Does this PR introduce any user-facing change?

Yes, this PR changes the createDatabase, alterDatabase, createTable and alterTable APIs and related DDLs. If the LOCATION clause is followed by a relative path, the root path will be spark.sql.warehouse.dir for databases, and spark.sql.warehouse.dir / dbPath for tables.

e.g.

after

spark-sql> desc database loctest;
Database Name	loctest
Comment
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest
Owner	kentyao
spark-sql> use loctest;
spark-sql> create table loctest(id int) location 'loctest';
20/05/14 18:18:02 WARN InMemoryFileIndex: The directory file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/loctest was not found. Was it deleted very recently?
20/05/14 18:18:02 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
spark-sql> desc formatted loctest;
id	int	NULL

# Detailed Table Information
Database	loctest
Table	loctest
Owner	kentyao
Created Time	Thu May 14 18:18:03 CST 2020
Last Access	UNKNOWN
Created By	Spark 3.1.0-SNAPSHOT
Type	EXTERNAL
Provider	parquet
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
spark-sql> alter table loctest set location 'loctest2'
         > ;
spark-sql> desc formatted loctest;
id	int	NULL

# Detailed Table Information
Database	loctest
Table	loctest
Owner	kentyao
Created Time	Thu May 14 18:18:03 CST 2020
Last Access	UNKNOWN
Created By	Spark 3.1.0-SNAPSHOT
Type	EXTERNAL
Provider	parquet
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest2
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

How was this patch tested?

Add unit tests.

…e path

SparkQA · 2020-05-14T09:41:25Z

Test build #122612 has finished for PR 28527 at commit 93c19e4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-14T10:26:20Z

Test build #122614 has started for PR 28527 at commit 7d50c17.

yaooqinn · 2020-05-14T16:28:40Z

retest this please

SparkQA · 2020-05-14T20:27:35Z

Test build #122622 has finished for PR 28527 at commit 7d50c17.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-05-15T02:11:59Z

retest this please

SparkQA · 2020-05-15T07:05:02Z

Test build #122639 has finished for PR 28527 at commit 7d50c17.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-05-15T07:05:49Z

retest this please

yaooqinn · 2020-05-15T18:44:28Z

retest this please

SparkQA · 2020-05-16T01:27:30Z

Test build #122690 has finished for PR 28527 at commit 7d50c17.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-05-16T05:55:53Z

retest this please

maropu · 2020-05-17T23:37:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

@@ -231,7 +239,8 @@ class SessionCatalog(
  def alterDatabase(dbDefinition: CatalogDatabase): Unit = {
    val dbName = formatDatabaseName(dbDefinition.name)
    requireDbExists(dbName)
-    externalCatalog.alterDatabase(dbDefinition.copy(name = dbName))
+    externalCatalog.alterDatabase(dbDefinition.copy(
+      name = dbName, locationUri = makeQualifiedDBPath(dbDefinition.locationUri)))


We need to update the location uri here?

We have ALTER (DATABASE|SCHEMA) database_name SET LOCATION path syntax that works for hive 3.x

maropu · 2020-05-17T23:37:33Z

retest this please

SparkQA · 2020-05-18T05:51:52Z

Test build #122764 has finished for PR 28527 at commit 7d50c17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-05-18T12:32:25Z

cc @cloud-fan @dongjoon-hyun too, thanks

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

maropu

Looks fine, thanks, @yaooqinn

SparkQA · 2020-05-19T07:05:02Z

Test build #122827 has finished for PR 28527 at commit 3fbe65a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-05-19T23:38:38Z

retest this please

SparkQA · 2020-05-20T05:32:17Z

Test build #122859 has finished for PR 28527 at commit 3fbe65a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-05-29T07:17:31Z

retest this please

maropu · 2020-05-29T07:17:39Z

cc: @cloud-fan

SparkQA · 2020-05-29T12:56:55Z

Test build #123270 has finished for PR 28527 at commit 3fbe65a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-31T07:54:45Z

Test build #126862 has finished for PR 28527 at commit 3fbe65a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-31T08:22:36Z

retest this please

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

cloud-fan · 2020-07-31T09:21:52Z

can you rebase/merge with master, to trigger github actions?

SparkQA · 2020-07-31T13:38:35Z

Test build #126876 has finished for PR 28527 at commit 3fbe65a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-31T16:41:45Z

Test build #126881 has finished for PR 28527 at commit b471a0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-08-03T02:37:43Z

gentle ping @cloud-fan

cloud-fan · 2020-08-03T05:00:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+      locationUri
+    } else {
+      val dbName = formatDatabaseName(database)
+      val dbLocation = makeQualifiedDBPath(getDatabaseMetadata(dbName).locationUri)


I'm a bit concerned about it as it adds an extra database lookup. Is it better to push this work to the underlying external catalog?

Do you mean that we are calling requireDbExists(dbName) repeatedly?

Ideally createTable should only do one RPC to the hive metastore. requireDbExists is one problem but we can simply remove it. However, the new database lookup seems can't be easily removed.

However, the new database lookup seems can't be easily removed.

Yes, it seem make no difference even we put this into external catalogs, we have to call the getDatabase API in order to get the actual path of the database for this particular case.

what happens if we don't qualify the path here and leave it to hive metastore? Will it still be a relative path in the hive metastore?

FYI, #17254

ok we did the same thing for partition. LGTM then

cloud-fan · 2020-08-03T12:48:16Z

thanks, merging to master!

…tter of its path is slash in create/alter table ### What changes were proposed in this pull request? After #28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. ### Why are the changes needed? This is to fix the behavior described above. ### Does this PR introduce _any_ user-facing change? Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. ### How was this patch tested? Updated unit tests. Closes #35462 from bozhang2820/spark-31709. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tter of its path is slash in create/alter table After apache#28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. This is to fix the behavior described above. Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. Updated unit tests. Closes apache#35462 from bozhang2820/spark-31709. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…new Path(locationUri).isAbsolute" in create/alter table ### What changes were proposed in this pull request? After #28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. ### Why are the changes needed? This is to fix the behavior described above. ### Does this PR introduce _any_ user-facing change? Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. ### How was this patch tested? Updated unit tests. Closes #35591 from bozhang2820/spark-31709-3.2. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…new Path(locationUri).isAbsolute" in create/alter table ### What changes were proposed in this pull request? After #28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. ### Why are the changes needed? This is to fix the behavior described above. ### Does this PR introduce _any_ user-facing change? Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. ### How was this patch tested? Updated unit tests. Closes #35591 from bozhang2820/spark-31709-3.2. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 915f0cc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…new Path(locationUri).isAbsolute" in create/alter table ### What changes were proposed in this pull request? After apache#28527, we change to create table under the database location when the table location is relative. However the criteria to determine if a table location is relative/absolute is `URI.isAbsolute`, which basically checks if the table location URI has a scheme defined. So table URIs like `/table/path` are treated as relative and the scheme and authority of the database location URI are used to create the table. For example, when the database location URI is `s3a://bucket/db`, the table will be created at `s3a://bucket/table/path`, while it should be created under the file system defined in `SessionCatalog.hadoopConf` instead. This change fixes that by treating table location as absolute when the first letter of its path is slash. This also applies to alter table. ### Why are the changes needed? This is to fix the behavior described above. ### Does this PR introduce _any_ user-facing change? Yes. When users try to create/alter a table with a location that starts with a slash but without a scheme defined, the table will be created under/altered to the file system defined in `SessionCatalog.hadoopConf`, instead of the one defined in the database location URI. ### How was this patch tested? Updated unit tests. Closes apache#35591 from bozhang2820/spark-31709-3.2. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-31709][SQL] Proper base path for location when it is a relativ…

93c19e4

…e path

probot-autolabeler bot added the SQL label May 14, 2020

yaooqinn changed the title ~~[SPARK-31709][SQL] Proper base path for location when it is a relative path~~ [SPARK-31709][SQL] Proper base path for database/table location when it is a relative path May 14, 2020

scala lint

7d50c17

maropu reviewed May 17, 2020

View reviewed changes

cloud-fan reviewed May 18, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala Show resolved Hide resolved

maropu approved these changes May 18, 2020

View reviewed changes

nit

3fbe65a

cloud-fan reviewed Jul 31, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala Show resolved Hide resolved

cloud-fan approved these changes Jul 31, 2020

View reviewed changes

Merge branch 'master' into SPARK-31709

b471a0e

cloud-fan reviewed Aug 3, 2020

View reviewed changes

cloud-fan closed this in 3deb59d Aug 3, 2020

bozhang2820 mentioned this pull request Feb 9, 2022

[SPARK-38236][SQL] Check if table location is absolute by "new Path(locationUri).isAbsolute" in create/alter table #35462

Closed

bozhang2820 mentioned this pull request Feb 21, 2022

[SPARK-38236][SQL][3.2][3.1] Check if table location is absolute by "new Path(locationUri).isAbsolute" in create/alter table #35591

Closed

[SPARK-31709][SQL] Proper base path for database/table location when it is a relative path #28527

[SPARK-31709][SQL] Proper base path for database/table location when it is a relative path #28527

Conversation

yaooqinn commented May 14, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

after

How was this patch tested?

SparkQA commented May 14, 2020

SparkQA commented May 14, 2020

yaooqinn commented May 14, 2020

SparkQA commented May 14, 2020

yaooqinn commented May 15, 2020

SparkQA commented May 15, 2020

yaooqinn commented May 15, 2020

yaooqinn commented May 15, 2020

SparkQA commented May 16, 2020

yaooqinn commented May 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented May 17, 2020

SparkQA commented May 18, 2020

yaooqinn commented May 18, 2020

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented May 19, 2020

maropu commented May 19, 2020

SparkQA commented May 20, 2020

maropu commented May 29, 2020

maropu commented May 29, 2020

SparkQA commented May 29, 2020

SparkQA commented Jul 31, 2020

cloud-fan commented Jul 31, 2020

cloud-fan commented Jul 31, 2020

SparkQA commented Jul 31, 2020

SparkQA commented Jul 31, 2020

yaooqinn commented Aug 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 3, 2020

yaooqinn commented May 14, 2020 •

edited

Loading