Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19667][SQL]create table with hiveenabled in default database use warehouse path instead of the location of default database #17001

Closed
wants to merge 31 commits into from

Conversation

windpiger
Copy link
Contributor

What changes were proposed in this pull request?

Currently, when we create a managed table with HiveEnabled in default database, Spark will use the location of default database as the table's location, this is ok in non-shared metastore.

While if we use a shared metastore between different clusters, for example,

  1. there is a hive metastore in Cluster-A, and the metastore use a remote mysql as its db, and create a default database in metastore, then the location of the default database is the path in Cluster-A

  2. then we create another Cluster-B, and Cluster-B also use the same remote mysql as its metastore's db, so the default database conf in Cluster-B download from mysql, which location is the path of Cluster-A

  3. then we create a table in Cluster-B in default database, it will throw an exception, that UnknowHost Cluster-A

In Hive2.0.0, it is allowed to create a table in default database which shared between clusters , and this action is not allowed in other database, just for default.

As a spark User, we will want to have the same action as Hive, thus we can create table in default database using a shared mysql in metastore.

How was this patch tested?

unit test added

@windpiger windpiger changed the title [SPARK-19667][SQL][WIP]create table with hiveenabled in default database se warehouse path instead of the location of default database [SPARK-19667][SQL][WIP]create table with hiveenabled in default database use warehouse path instead of the location of default database Feb 20, 2017
@SparkQA
Copy link

SparkQA commented Feb 20, 2017

Test build #73171 has finished for PR 17001 at commit 825c0ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73201 has finished for PR 17001 at commit a2c9168.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger windpiger changed the title [SPARK-19667][SQL][WIP]create table with hiveenabled in default database use warehouse path instead of the location of default database [SPARK-19667][SQL]create table with hiveenabled in default database use warehouse path instead of the location of default database Feb 21, 2017
@windpiger
Copy link
Contributor Author

cc @gatorsmile @cloud-fan

@cloud-fan
Copy link
Contributor

I'd like to treat this as a workaround, the location of default database is still invalid in cluster-
B.

We can make this logic more clear and consistent: the default database should not have a location, when we try to get the location of default DB, we should use the warehouse path.

@windpiger
Copy link
Contributor Author

Agreed, I process the logic in create/get database in HiveClientImpl

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73260 has finished for PR 17001 at commit bacd528.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73261 has finished for PR 17001 at commit 3f6e061.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

client.createDatabase(
new HiveDatabase(
database.name,
database.description,
database.locationUri,
if (database.name == SessionCatalog.DEFAULT_DATABASE) "" else database.locationUri,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is empty, metastore will set it for us, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, actually it will throw an exception, my local default has created, so it does not hit the exception, I will just replace the default database location when reload from metastore, drop the logic when create database set location to empty string.

@gatorsmile
Copy link
Member

In Hive2.0.0, it is allowed to create a table in default database which shared between clusters

Are you able to find the specific Hive JIRA for this?

@windpiger
Copy link
Contributor Author

HIVE-1537 related jira PR
but it seems that is does not have related comments

@gatorsmile
Copy link
Member

gatorsmile commented Feb 22, 2017

Personally, I think we should improve the test case. Instead of doing it in HiveDDLSuite, we can do it in HiveSparkSubmitSuite.scala. Basically, when using the same metastore, you just need to verify whether the table location is dependent on spark.sql.warehouse.dir. Then, you do not need to introduce the extra flag TEST_HIVE_CREATETABLE_DEFAULTDB_USEWAREHOUSE_PATH. Below is the test cases you can follow: #16388

@windpiger
Copy link
Contributor Author

great, I will take a look at it~

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73263 has finished for PR 17001 at commit 96dcc7d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73270 has finished for PR 17001 at commit 58a0020.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

windpiger commented Feb 22, 2017

locally test BucketedWriteWithoutHiveSupportSuite is ok, let me find out why failed in jenkins

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73273 has finished for PR 17001 at commit 1dce2d7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73274 has finished for PR 17001 at commit 12f81d3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73656 has finished for PR 17001 at commit 096ae63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -30,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions.Expression
*
* Implementations should throw [[NoSuchDatabaseException]] when databases don't exist.
*/
abstract class ExternalCatalog {
abstract class ExternalCatalog(conf: SparkConf, hadoopConf: Configuration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we just pass in a defaultDB: CatalogDatabase? then we don't need to add the protected def warehousePath: String

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think conf/hadoopConf is more useful, later logic can use it. and it's subclass also has these two conf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still have conf/hadoopConf in InMemoryCatalog and HiveExternalCatalog, we can just add one more parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we pass a defaultDB, it seems like we introduce an instance of defaultDB as we discussed above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it will be only used in getDatabase, and we can save a metastore call to get the default database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok~ let me fix it~

Copy link
Contributor Author

@windpiger windpiger Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I found it that if we add a parameter defaultDB for ExternalCatalog and its subclass InMemoryCatalog and HiveExternalCatalog, this change will cause a lot of related code to be modified, such as test cases ,and other logic where create InMemoryCatalog and HiveExternalCatalog

For example:

currently all the parameters of InMemoryCatalog have its own default value

class InMemoryCatalog(conf: SparkConf = new SparkConf,hadoopConfig: Configuration = new Configuration)

we can create it without an parameters, but if we add a defaultDB, we should new a defaultDB in the parameter, while we can not create a legal deafultDB because we can not get the warehouse path for the defaultDB like this:

class InMemoryCatalog(conf: SparkConf = new SparkConf,hadoopConfig: Configuration = new Configuration, defaultDB: CatalogDatabase = CatalogDatabase("default","","${can not get the warehouse path}",Map.empty))

if we don't provide a default value for defautDB in the parameter, this will cause more code change which I think it is not proper.

what about we keep the provided def warehousePath in ExternalCatalog, and add a
lazy val defaultDB = { val qualifiedWarehousePath = SessionCatalog .makeQualifiedPath(warehousePath, hadoopConf).toString CatalogDatabase("default","", qualifiedWarehousePath, Map.empty) }

this can also avoid call getDatabase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modify the code by adding

lazy val defaultDB = { val qualifiedWarehousePath = SessionCatalog .makeQualifiedPath(warehousePath, hadoopConf).toString CatalogDatabase("default","", qualifiedWarehousePath, Map.empty) }

in ExternalCatalog

if it is not ok ,I will revert it, thanks~

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73755 has finished for PR 17001 at commit badd61b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73761 has finished for PR 17001 at commit 35d2b59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
try {
val warehousePath = s"file:${spark.sharedState.warehousePath.stripSuffix("/")}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing this modify

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73827 has started for PR 17001 at commit e3a467e.

@@ -74,7 +88,17 @@ abstract class ExternalCatalog {
*/
def alterDatabase(dbDefinition: CatalogDatabase): Unit

def getDatabase(db: String): CatalogDatabase
final def getDatabase(db: String): CatalogDatabase = {
val database = getDatabaseInternal(db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in the else branch

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73834 has finished for PR 17001 at commit ae9938a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73838 has finished for PR 17001 at commit 7739ccd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74170 has finished for PR 17001 at commit f93f5d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

spark.sql("CREATE TABLE t4(e string)")
val table4 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t4"))
// the table created in the database which created in this job, it will use the location
// of the database.
Copy link
Member

@gatorsmile gatorsmile Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

->

The table created in the non-default database (created in this job) is under the database location.

assert(new Path(table2.location) != fs.makeQualified(
new Path(warehousePath, "not_default.db/t2")))

spark.sql("CREATE DATABASE not_default_1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> non_default_db1

spark.sql("CREATE TABLE t2(c string)")
val table2 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t2"))
// the table in not default database created here in this job, it will use the location
// of the database as its location, not the warehouse path in this job
Copy link
Member

@gatorsmile gatorsmile Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

->

The table created in the non-default database (created in another job) is under the database location.

// the location when it's created.
assert(new Path(table1.location) != fs.makeQualified(
new Path(warehousePath, "not_default.db/t1")))
assert(!new File(warehousePath.toString, "not_default.db/t1").exists())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario (line 993-1000) is not needed to test, IMO. Most of the test cases already cover it.

spark.sql("CREATE TABLE t3(d string)")
val table3 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t3"))
// the table in default database created here in this job, it will use the warehouse path
// of this job as its location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

->

When a job creates a table in the default database, the table location is under the warehouse path
that is configured for the local job.

val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
// the table in default database created in job(SPARK_19667_CREATE_TABLE) above,
// which has different warehouse path from this job, its location still equals to
// the location when it's created.
Copy link
Member

@gatorsmile gatorsmile Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

For the table created by another job in the default database, the location of this table is not changed, even if the current job has a different warehouse path.

val warehousePath = new Path(spark.sharedState.warehousePath)
val fs = warehousePath.getFileSystem(spark.sessionState.newHadoopConf())
val defaultDB = spark.sessionState.catalog.getDatabaseMetadata("default")
// default database use warehouse path as its location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default -> The default
use warehouse path -> uses the warehouse path

@gatorsmile
Copy link
Member

A general suggestion in the table names of the test case. We can name the database based on the database type. (e.g., default, non_default_db1). We can name the tables based on local/remote, database name. For example, tab1_local_default_db1, and tab1_remote_non_default_db.

@windpiger
Copy link
Contributor Author

@gatorsmile thanks for your suggestion~

@gatorsmile
Copy link
Member

any update? ping @windpiger

1 similar comment
@HyukjinKwon
Copy link
Member

any update? ping @windpiger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants