Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19329][SQL]Reading from or writing to a datasource table with a non pre-existing location should succeed #16672

Closed
wants to merge 13 commits into from

Conversation

windpiger
Copy link
Contributor

@windpiger windpiger commented Jan 22, 2017

What changes were proposed in this pull request?

when we insert data into a datasource table use sqlText, and the table has an not exists location,
this will throw an Exception.

example:

spark.sql("create table t(a string, b int) using parquet")
spark.sql("alter table t set location '/xx'")
spark.sql("insert into table t select 'c', 1")

Exception:

com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx;
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)

As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton:

  1. reading from a datasource table: return 0 rows
  2. writing to a datasource table: write data successfully

How was this patch tested?

unit test added

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71803 has finished for PR 16672 at commit 5ec7dd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71808 has finished for PR 16672 at commit 2196648.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -240,7 +240,7 @@ class FindDataSourceTable(sparkSession: SparkSession) extends Rule[LogicalPlan]
// TODO: improve `InMemoryCatalog` and remove this limitation.
catalogTable = if (withHiveSupport) Some(table) else None)

LogicalRelation(dataSource.resolveRelation(), catalogTable = Some(table))
LogicalRelation(dataSource.resolveRelation(false), catalogTable = Some(table))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: dataSource.resolveRelation(false) -> dataSource.resolveRelation(checkFilesExist = false)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks~


test("insert data to a table which has altered the table location " +
"to an not exist location should success") {
withTable("t", "t1") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t1?

@gatorsmile
Copy link
Member

We also need to add a test case for in-memory catalog. Maybe we can wait until #16592 is resolved?

Actually, this fix is not right. We should create the directory when we set the location.

@windpiger
Copy link
Contributor Author

I test it in hive,alter table set location does not create the dir ,when we insert data on,the dir createed

@gatorsmile
Copy link
Member

I am not very sure whether we should follow Hive in this case. The path might be wrong or no permission to create such a directory. Thus, it might be more user friendly if they can get the error of creating the directory when changing the location. cc @cloud-fan @yhuai @hvanhovell

This PR focues on the write path. How about the read path? What does Hive behave when try to select a table whose location/directory is not created? What is the behavior of our Spark SQL?

@cloud-fan
Copy link
Contributor

I'm wondering how hive read a table with non-existing path.

@windpiger
Copy link
Contributor Author

In hive:

  1. read a table with non-existing path, no exception and return 0 rows
  2. read a table with non-permission path, throw runtime exception
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if hdfs:/tmp/noownerpermission is encrypted: org.apache.hadoop.security.AccessControlException: Permission denied: user=test, access=READ, inode="/tmp/noownerpermission":hadoop:hadoop:drwxr-x--x
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)

  1. write to a non-exist path ,it will create it and insert data to it, everything is ok

  2. write to a non-permission path, it will throw an exception

  3. alter table set location to a non-permission path, it is ok

@windpiger
Copy link
Contributor Author

If we create the location path when alter table by user A, maybe we use user B to run the job which have no permission to write data to the location, is't it also not friendly? Maybe throw an runtime Exception is properly, and don't create path when alter table.
@gatorsmile @cloud-fan @yhuai @hvanhovell

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71898 has finished for PR 16672 at commit abc57dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

It seems to me following hive is safer, any other ideas?

@windpiger
Copy link
Contributor Author

@gatorsmile could you give some suggestion? thanks very much!

@cloud-fan
Copy link
Contributor

ping @gatorsmile

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

Could you please add the test cases for the scenarios (of non pre existing location) you explained above? Thanks!

@gatorsmile
Copy link
Member

The changes in this PR affects both read and write paths. Please update the PR description and title. Thanks!

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72746 has finished for PR 16672 at commit abc57dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger windpiger changed the title [SPARK-19329][SQL]insert data to a not exist location datasource table should success [SPARK-19329][SQL]insert/read data to a not exist location datasource table should success Feb 12, 2017
@gatorsmile
Copy link
Member

: ) success is a noun and exist is a verb.

insert/read data to a not exist location datasource table should success -> Reading from or writing to a data-source table with a non pre-existing location should succeed

@@ -1431,4 +1431,30 @@ class HiveDDLSuite
}
}
}

test("insert data to a table which has altered the table location " +
"to an not exist location should success") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case names are not accurate after you add new test cases. Actually, could you split the test cases?

@gatorsmile
Copy link
Member

More test cases for non pre-existing locations? For example, INESRT without an ALTER LOCATION? You can simply drop the directory. This scenario is reasonable when the table is external.

@gatorsmile
Copy link
Member

The test case also can check another INSERT mode. INSERT OVERWRITE? Also verifying the behaviors for Hive Serde tables?

@windpiger
Copy link
Contributor Author

windpiger commented Feb 12, 2017

yes, let me add these tests

withTable("t") {
withTempDir { dir =>
spark.sql(
s"""create table t(a string, b int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General style suggestions. Please use upper case for SQL keywords. For example, in this SQL statement can be improved to

CREATE TABLE t(a STRING, b INT)
USING parquet
OPTIONS(path "xyz")

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72759 has finished for PR 16672 at commit c3439ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72804 has started for PR 16672 at commit 334e89f.

spark.sql("INSERT INTO TABLE t PARTITION(a=1, b=2) SELECT 3, 4")
checkAnswer(spark.table("t"), Row(3, 4, 1, 2) :: Nil)

val partLoc = new File(s"${dir.getAbsolutePath}/a=1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment about the test cases. Can you please check whether the directory exists after the insert? It can help others confirm the path is correct

s"""CREATE TABLE t(a string, b int)
|USING parquet
|OPTIONS(path "file:${dir.getCanonicalPath}")
""".stripMargin)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment about the style. We prefer to the following indentation styles.

        sql(
          """
            |SELECT '1' AS part, key, value FROM VALUES
            |(1, "one"), (2, "two"), (3, null) AS data(key, value)
          """.stripMargin)

|USING parquet
|OPTIONS(path "file:${dir.getCanonicalPath}")
""".stripMargin)
var table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another general comment. Please avoid using var, if possible.

@gatorsmile
Copy link
Member

Could you move the test cases to DDLSuite.scala? This is not for Hive specific. Thanks!

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72813 has finished for PR 16672 at commit b238e8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72814 has finished for PR 16672 at commit dee844c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(partLoc.exists())
checkAnswer(spark.table("t"), Row(7, 8, 1, 2) :: Nil)

// TODO:insert into a partition after alter the partition location by alter command
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: ALTER TABLE SET LOCATION for partition is not allowed for tables defined using the datasource API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found there is a bug in this situation. and I create a jira
https://issues.apache.org/jira/browse/SPARK-19577

shall we just forbid this situation or fix it?

s"""
|CREATE TABLE t(a string, b int)
|USING parquet
|OPTIONS(path "file:${dir.getCanonicalPath}")
Copy link
Member

@gatorsmile gatorsmile Feb 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • First, you do not need to add the file:
  • Second, you still need to adjust the indent.
        spark.sql(
          s"""
            |CREATE TABLE t(a int, b int, c int, d int)
            |USING parquet
            |PARTITIONED BY(a, b)
            |LOCATION '$dir'
           """.stripMargin)
        val expectedPath = dir.getAbsolutePath.stripSuffix("/")

|PARTITIONED BY(a, b)
|LOCATION "file:${dir.getCanonicalPath}"
""".stripMargin)
val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not being used.

@gatorsmile
Copy link
Member

Actually, I found another issue in CTAS with pre-existing location. Maybe you can take that too? https://issues.apache.org/jira/browse/SPARK-19583

@windpiger
Copy link
Contributor Author

@gatorsmile I'd like to take that. https://issues.apache.org/jira/browse/SPARK-19583 Thanks~

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72855 has started for PR 16672 at commit 0d947a5.

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72869 has finished for PR 16672 at commit 0d947a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

@gatorsmile I have fixed some review issues. Could you help to continue to review this ?

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 6a9a85b Feb 15, 2017
s"""
|CREATE TABLE t(a string, b int)
|USING parquet
|OPTIONS(path "$dir")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the path doesn't exist when create? will we succeed or fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, it will throw an exception that the path does not existed. maybe we can check if the path is a dir or not, dir can not exist and file must be exist?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the behavior of hive?

assert(tableLocFile.exists)
checkAnswer(spark.table("t"), Row("c", 1) :: Nil)

val newDir = dir.getAbsolutePath.stripSuffix("/") + "/x"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new File(dir, "x")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will fix this when I do another pr, thanks~

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 16, 2017
… a non pre-existing location should succeed

## What changes were proposed in this pull request?

when we insert data into a datasource table use `sqlText`, and the table has an not exists location,
this will throw an Exception.

example:

```
spark.sql("create table t(a string, b int) using parquet")
spark.sql("alter table t set location '/xx'")
spark.sql("insert into table t select 'c', 1")
```

Exception:
```
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx;
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
```

As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton:

1. reading from a datasource table: return 0 rows
2. writing to a datasource table:  write data successfully

## How was this patch tested?
unit test added

Author: windpiger <songjun@outlook.com>

Closes apache#16672 from windpiger/insertNotExistLocation.
@gatorsmile
Copy link
Member

To backport #17097, we need to backport multiple PRs. This is one of it.

@windpiger Could you please submit a PR to backport it to Spark 2.1?

@windpiger
Copy link
Contributor Author

ok thanks,its my pleasure~

asfgit pushed a commit that referenced this pull request Mar 16, 2017
…e table with a non pre-existing location should succeed

## What changes were proposed in this pull request?

This is a backport pr of #16672 into branch-2.1.

## How was this patch tested?
Existing tests.

Author: windpiger <songjun@outlook.com>

Closes #17317 from windpiger/backport-insertnotexists.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants