[SPARK-19329][SQL]Reading from or writing to a datasource table with a non pre-existing location should succeed #16672

windpiger · 2017-01-22T09:30:43Z

What changes were proposed in this pull request?

when we insert data into a datasource table use sqlText, and the table has an not exists location,
this will throw an Exception.

example:

spark.sql("create table t(a string, b int) using parquet")
spark.sql("alter table t set location '/xx'")
spark.sql("insert into table t select 'c', 1")

Exception:

com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx;
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)

As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton:

reading from a datasource table: return 0 rows
writing to a datasource table: write data successfully

How was this patch tested?

unit test added

…e should success

SparkQA · 2017-01-22T11:54:23Z

Test build #71803 has finished for PR 16672 at commit 5ec7dd6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-22T16:24:56Z

Test build #71808 has finished for PR 16672 at commit 2196648.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-22T18:49:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

@@ -240,7 +240,7 @@ class FindDataSourceTable(sparkSession: SparkSession) extends Rule[LogicalPlan]
            // TODO: improve `InMemoryCatalog` and remove this limitation.
            catalogTable = if (withHiveSupport) Some(table) else None)

-        LogicalRelation(dataSource.resolveRelation(), catalogTable = Some(table))
+        LogicalRelation(dataSource.resolveRelation(false), catalogTable = Some(table))


Nit: dataSource.resolveRelation(false) -> dataSource.resolveRelation(checkFilesExist = false)

gatorsmile · 2017-01-22T18:50:22Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+
+  test("insert data to a table which has altered the table location " +
+    "to an not exist location should success") {
+    withTable("t", "t1") {


gatorsmile · 2017-01-22T18:56:37Z

We also need to add a test case for in-memory catalog. Maybe we can wait until #16592 is resolved?

Actually, this fix is not right. We should create the directory when we set the location.

windpiger · 2017-01-22T23:41:26Z

I test it in hive，alter table set location does not create the dir ，when we insert data on，the dir createed

gatorsmile · 2017-01-23T00:01:15Z

I am not very sure whether we should follow Hive in this case. The path might be wrong or no permission to create such a directory. Thus, it might be more user friendly if they can get the error of creating the directory when changing the location. cc @cloud-fan @yhuai @hvanhovell

This PR focues on the write path. How about the read path? What does Hive behave when try to select a table whose location/directory is not created? What is the behavior of our Spark SQL?

cloud-fan · 2017-01-23T02:44:33Z

I'm wondering how hive read a table with non-existing path.

windpiger · 2017-01-23T04:53:35Z

In hive:

read a table with non-existing path, no exception and return 0 rows
read a table with non-permission path, throw runtime exception

FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if hdfs:/tmp/noownerpermission is encrypted: org.apache.hadoop.security.AccessControlException: Permission denied: user=test, access=READ, inode="/tmp/noownerpermission":hadoop:hadoop:drwxr-x--x
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)

write to a non-exist path ,it will create it and insert data to it, everything is ok
write to a non-permission path, it will throw an exception
alter table set location to a non-permission path, it is ok

windpiger · 2017-01-24T01:31:33Z

If we create the location path when alter table by user A, maybe we use user B to run the job which have no permission to write data to the location, is't it also not friendly? Maybe throw an runtime Exception is properly, and don't create path when alter table.
@gatorsmile @cloud-fan @yhuai @hvanhovell

SparkQA · 2017-01-24T04:08:48Z

Test build #71898 has finished for PR 16672 at commit abc57dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-24T06:32:00Z

It seems to me following hive is safer, any other ideas?

windpiger · 2017-01-26T07:53:15Z

@gatorsmile could you give some suggestion? thanks very much!

cloud-fan · 2017-02-09T06:34:43Z

ping @gatorsmile

gatorsmile · 2017-02-12T02:05:05Z

retest this please

gatorsmile · 2017-02-12T03:05:07Z

Could you please add the test cases for the scenarios (of non pre existing location) you explained above? Thanks!

gatorsmile · 2017-02-12T03:12:46Z

The changes in this PR affects both read and write paths. Please update the PR description and title. Thanks!

SparkQA · 2017-02-12T04:36:29Z

Test build #72746 has finished for PR 16672 at commit abc57dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-12T05:51:38Z

: ) success is a noun and exist is a verb.

insert/read data to a not exist location datasource table should success -> Reading from or writing to a data-source table with a non pre-existing location should succeed

gatorsmile · 2017-02-12T05:59:54Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

@@ -1431,4 +1431,30 @@ class HiveDDLSuite
      }
    }
  }
+
+  test("insert data to a table which has altered the table location " +
+    "to an not exist location should success") {


Test case names are not accurate after you add new test cases. Actually, could you split the test cases?

gatorsmile · 2017-02-12T06:04:03Z

More test cases for non pre-existing locations? For example, INESRT without an ALTER LOCATION? You can simply drop the directory. This scenario is reasonable when the table is external.

gatorsmile · 2017-02-12T06:05:56Z

The test case also can check another INSERT mode. INSERT OVERWRITE? Also verifying the behaviors for Hive Serde tables?

windpiger · 2017-02-12T06:10:13Z

yes, let me add these tests

gatorsmile · 2017-02-12T06:10:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+    withTable("t") {
+      withTempDir { dir =>
+        spark.sql(
+          s"""create table t(a string, b int)


General style suggestions. Please use upper case for SQL keywords. For example, in this SQL statement can be improved to

CREATE TABLE t(a STRING, b INT) USING parquet OPTIONS(path "xyz")

SparkQA · 2017-02-12T07:48:03Z

Test build #72759 has finished for PR 16672 at commit c3439ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-13T06:38:38Z

Test build #72804 has started for PR 16672 at commit 334e89f.

gatorsmile · 2017-02-13T07:12:16Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+        spark.sql("INSERT INTO TABLE t PARTITION(a=1, b=2) SELECT 3, 4")
+        checkAnswer(spark.table("t"), Row(3, 4, 1, 2) :: Nil)
+
+        val partLoc = new File(s"${dir.getAbsolutePath}/a=1")


A general comment about the test cases. Can you please check whether the directory exists after the insert? It can help others confirm the path is correct

gatorsmile · 2017-02-13T07:15:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+          s"""CREATE TABLE t(a string, b int)
+              |USING parquet
+              |OPTIONS(path "file:${dir.getCanonicalPath}")
+           """.stripMargin)


A general comment about the style. We prefer to the following indentation styles.

sql( """ |SELECT '1' AS part, key, value FROM VALUES |(1, "one"), (2, "two"), (3, null) AS data(key, value) """.stripMargin)

gatorsmile · 2017-02-13T07:16:33Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+              |USING parquet
+              |OPTIONS(path "file:${dir.getCanonicalPath}")
+           """.stripMargin)
+        var table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))


Another general comment. Please avoid using var, if possible.

gatorsmile · 2017-02-13T07:18:31Z

Could you move the test cases to DDLSuite.scala? This is not for Hive specific. Thanks!

SparkQA · 2017-02-13T12:18:16Z

Test build #72813 has finished for PR 16672 at commit b238e8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-13T14:58:25Z

Test build #72814 has finished for PR 16672 at commit dee844c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-13T18:26:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+        assert(partLoc.exists())
+        checkAnswer(spark.table("t"), Row(7, 8, 1, 2) :: Nil)
+
+        // TODO:insert into a partition after alter the partition location by alter command


To other reviewers: ALTER TABLE SET LOCATION for partition is not allowed for tables defined using the datasource API

I found there is a bug in this situation. and I create a jira
https://issues.apache.org/jira/browse/SPARK-19577

shall we just forbid this situation or fix it?

gatorsmile · 2017-02-13T19:52:52Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+          s"""
+              |CREATE TABLE t(a string, b int)
+              |USING parquet
+              |OPTIONS(path "file:${dir.getCanonicalPath}")


First, you do not need to add the file:

Second, you still need to adjust the indent.

spark.sql( s""" |CREATE TABLE t(a int, b int, c int, d int) |USING parquet |PARTITIONED BY(a, b) |LOCATION '$dir' """.stripMargin) val expectedPath = dir.getAbsolutePath.stripSuffix("/")

gatorsmile · 2017-02-13T19:55:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+              |PARTITIONED BY(a, b)
+              |LOCATION "file:${dir.getCanonicalPath}"
+           """.stripMargin)
+        val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))


This is not being used.

gatorsmile · 2017-02-13T20:30:16Z

Actually, I found another issue in CTAS with pre-existing location. Maybe you can take that too? https://issues.apache.org/jira/browse/SPARK-19583

windpiger · 2017-02-14T06:12:30Z

@gatorsmile I'd like to take that. https://issues.apache.org/jira/browse/SPARK-19583 Thanks~

SparkQA · 2017-02-14T06:12:41Z

Test build #72855 has started for PR 16672 at commit 0d947a5.

windpiger · 2017-02-14T09:36:04Z

retest this please

SparkQA · 2017-02-14T12:00:48Z

Test build #72869 has finished for PR 16672 at commit 0d947a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-15T04:45:38Z

@gatorsmile I have fixed some review issues. Could you help to continue to review this ?

gatorsmile · 2017-02-15T21:20:15Z

LGTM

gatorsmile · 2017-02-15T21:22:27Z

Thanks! Merging to master.

cloud-fan · 2017-02-15T21:32:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+          s"""
+             |CREATE TABLE t(a string, b int)
+             |USING parquet
+             |OPTIONS(path "$dir")


what if the path doesn't exist when create? will we succeed or fail?

currently, it will throw an exception that the path does not existed. maybe we can check if the path is a dir or not, dir can not exist and file must be exist?

what's the behavior of hive?

cloud-fan · 2017-02-15T21:34:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+        assert(tableLocFile.exists)
+        checkAnswer(spark.table("t"), Row("c", 1) :: Nil)
+
+        val newDir = dir.getAbsolutePath.stripSuffix("/") + "/x"


nit: new File(dir, "x")

ok, I will fix this when I do another pr, thanks~

… a non pre-existing location should succeed ## What changes were proposed in this pull request? when we insert data into a datasource table use `sqlText`, and the table has an not exists location, this will throw an Exception. example: ``` spark.sql("create table t(a string, b int) using parquet") spark.sql("alter table t set location '/xx'") spark.sql("insert into table t select 'c', 1") ``` Exception: ``` com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx; at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814) at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453) ``` As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton: 1. reading from a datasource table: return 0 rows 2. writing to a datasource table: write data successfully ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes apache#16672 from windpiger/insertNotExistLocation.

gatorsmile · 2017-03-16T07:35:14Z

To backport #17097, we need to backport multiple PRs. This is one of it.

@windpiger Could you please submit a PR to backport it to Spark 2.1?

windpiger · 2017-03-16T12:32:11Z

ok thanks，its my pleasure～

…e table with a non pre-existing location should succeed ## What changes were proposed in this pull request? This is a backport pr of #16672 into branch-2.1. ## How was this patch tested? Existing tests. Author: windpiger <songjun@outlook.com> Closes #17317 from windpiger/backport-insertnotexists.

[SPARK-19329][SQL]insert data to a not exist location datasource tabl…

5ec7dd6

…e should success

modify a test name desc

2196648

gatorsmile reviewed Jan 22, 2017

View reviewed changes

add a param name

abc57dd

add read from an non exist path

c3439ff

windpiger changed the title ~~[SPARK-19329][SQL]insert data to a not exist location datasource table should success~~ [SPARK-19329][SQL]insert/read data to a not exist location datasource table should success Feb 12, 2017

gatorsmile reviewed Feb 12, 2017

View reviewed changes

gatorsmile reviewed Feb 13, 2017

View reviewed changes

windpiger mentioned this pull request Feb 13, 2017

[SPARK-19575][SQL]Reading from or writing to a hive serde table with a non pre-existing location should succeed #16910

Closed

windpiger added 2 commits February 13, 2017 17:53

move test case to DDLSuit

b238e8d

remove an redundant import

dee844c

gatorsmile reviewed Feb 13, 2017

View reviewed changes

fix some code style

0d947a5

asfgit closed this in 6a9a85b Feb 15, 2017

cloud-fan reviewed Feb 15, 2017

View reviewed changes

windpiger mentioned this pull request Mar 16, 2017

[SPARK-19329][SQL][BRANCH-2.1]Reading from or writing to a datasource table with a non pre-existing location should succeed #17317

Closed

[SPARK-19329][SQL]Reading from or writing to a datasource table with a non pre-existing location should succeed #16672

[SPARK-19329][SQL]Reading from or writing to a datasource table with a non pre-existing location should succeed #16672

Conversation

windpiger commented Jan 22, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 22, 2017

SparkQA commented Jan 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 22, 2017

windpiger commented Jan 22, 2017

gatorsmile commented Jan 23, 2017

cloud-fan commented Jan 23, 2017

windpiger commented Jan 23, 2017

windpiger commented Jan 24, 2017

SparkQA commented Jan 24, 2017

cloud-fan commented Jan 24, 2017

windpiger commented Jan 26, 2017

cloud-fan commented Feb 9, 2017

gatorsmile commented Feb 12, 2017

gatorsmile commented Feb 12, 2017

gatorsmile commented Feb 12, 2017

SparkQA commented Feb 12, 2017

gatorsmile commented Feb 12, 2017

Choose a reason for hiding this comment

gatorsmile commented Feb 12, 2017

gatorsmile commented Feb 12, 2017

windpiger commented Feb 12, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Feb 12, 2017

SparkQA commented Feb 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 13, 2017

SparkQA commented Feb 13, 2017

SparkQA commented Feb 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Feb 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 13, 2017

windpiger commented Feb 14, 2017

SparkQA commented Feb 14, 2017

windpiger commented Feb 14, 2017

SparkQA commented Feb 14, 2017

windpiger commented Feb 15, 2017

gatorsmile commented Feb 15, 2017

gatorsmile commented Feb 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 16, 2017

windpiger commented Mar 16, 2017

windpiger commented Jan 22, 2017 •

edited

Loading

windpiger commented Feb 12, 2017 •

edited

Loading

gatorsmile Feb 13, 2017 •

edited

Loading