-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19329][SQL]Reading from or writing to a datasource table with a non pre-existing location should succeed #16672
Conversation
Test build #71803 has finished for PR 16672 at commit
|
Test build #71808 has finished for PR 16672 at commit
|
@@ -240,7 +240,7 @@ class FindDataSourceTable(sparkSession: SparkSession) extends Rule[LogicalPlan] | |||
// TODO: improve `InMemoryCatalog` and remove this limitation. | |||
catalogTable = if (withHiveSupport) Some(table) else None) | |||
|
|||
LogicalRelation(dataSource.resolveRelation(), catalogTable = Some(table)) | |||
LogicalRelation(dataSource.resolveRelation(false), catalogTable = Some(table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: dataSource.resolveRelation(false)
-> dataSource.resolveRelation(checkFilesExist = false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks~
|
||
test("insert data to a table which has altered the table location " + | ||
"to an not exist location should success") { | ||
withTable("t", "t1") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t1
?
We also need to add a test case for in-memory catalog. Maybe we can wait until #16592 is resolved? Actually, this fix is not right. We should create the directory when we set the location. |
I test it in hive,alter table set location does not create the dir ,when we insert data on,the dir createed |
I am not very sure whether we should follow Hive in this case. The path might be wrong or no permission to create such a directory. Thus, it might be more user friendly if they can get the error of creating the directory when changing the location. cc @cloud-fan @yhuai @hvanhovell This PR focues on the write path. How about the read path? What does Hive behave when try to select a table whose location/directory is not created? What is the behavior of our Spark SQL? |
I'm wondering how hive read a table with non-existing path. |
In hive:
|
If we create the location path when alter table by user A, maybe we use user B to run the job which have no permission to write data to the location, is't it also not friendly? Maybe throw an runtime Exception is properly, and don't create path when alter table. |
Test build #71898 has finished for PR 16672 at commit
|
It seems to me following hive is safer, any other ideas? |
@gatorsmile could you give some suggestion? thanks very much! |
ping @gatorsmile |
retest this please |
Could you please add the test cases for the scenarios (of non pre existing location) you explained above? Thanks! |
The changes in this PR affects both read and write paths. Please update the PR description and title. Thanks! |
Test build #72746 has finished for PR 16672 at commit
|
: )
|
@@ -1431,4 +1431,30 @@ class HiveDDLSuite | |||
} | |||
} | |||
} | |||
|
|||
test("insert data to a table which has altered the table location " + | |||
"to an not exist location should success") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test case names are not accurate after you add new test cases. Actually, could you split the test cases?
More test cases for non pre-existing locations? For example, INESRT without an ALTER LOCATION? You can simply drop the directory. This scenario is reasonable when the table is external. |
The test case also can check another INSERT mode. INSERT OVERWRITE? Also verifying the behaviors for Hive Serde tables? |
yes, let me add these tests |
withTable("t") { | ||
withTempDir { dir => | ||
spark.sql( | ||
s"""create table t(a string, b int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General style suggestions. Please use upper case for SQL keywords. For example, in this SQL statement can be improved to
CREATE TABLE t(a STRING, b INT)
USING parquet
OPTIONS(path "xyz")
Test build #72759 has finished for PR 16672 at commit
|
Test build #72804 has started for PR 16672 at commit |
spark.sql("INSERT INTO TABLE t PARTITION(a=1, b=2) SELECT 3, 4") | ||
checkAnswer(spark.table("t"), Row(3, 4, 1, 2) :: Nil) | ||
|
||
val partLoc = new File(s"${dir.getAbsolutePath}/a=1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment about the test cases. Can you please check whether the directory exists after the insert? It can help others confirm the path is correct
s"""CREATE TABLE t(a string, b int) | ||
|USING parquet | ||
|OPTIONS(path "file:${dir.getCanonicalPath}") | ||
""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment about the style. We prefer to the following indentation styles.
sql(
"""
|SELECT '1' AS part, key, value FROM VALUES
|(1, "one"), (2, "two"), (3, null) AS data(key, value)
""".stripMargin)
|USING parquet | ||
|OPTIONS(path "file:${dir.getCanonicalPath}") | ||
""".stripMargin) | ||
var table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another general comment. Please avoid using var
, if possible.
Could you move the test cases to |
Test build #72813 has finished for PR 16672 at commit
|
Test build #72814 has finished for PR 16672 at commit
|
assert(partLoc.exists()) | ||
checkAnswer(spark.table("t"), Row(7, 8, 1, 2) :: Nil) | ||
|
||
// TODO:insert into a partition after alter the partition location by alter command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To other reviewers: ALTER TABLE SET LOCATION for partition is not allowed for tables defined using the datasource API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found there is a bug in this situation. and I create a jira
https://issues.apache.org/jira/browse/SPARK-19577
shall we just forbid this situation or fix it?
s""" | ||
|CREATE TABLE t(a string, b int) | ||
|USING parquet | ||
|OPTIONS(path "file:${dir.getCanonicalPath}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- First, you do not need to add the
file:
- Second, you still need to adjust the indent.
spark.sql(
s"""
|CREATE TABLE t(a int, b int, c int, d int)
|USING parquet
|PARTITIONED BY(a, b)
|LOCATION '$dir'
""".stripMargin)
val expectedPath = dir.getAbsolutePath.stripSuffix("/")
|PARTITIONED BY(a, b) | ||
|LOCATION "file:${dir.getCanonicalPath}" | ||
""".stripMargin) | ||
val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not being used.
Actually, I found another issue in CTAS with pre-existing location. Maybe you can take that too? https://issues.apache.org/jira/browse/SPARK-19583 |
@gatorsmile I'd like to take that. https://issues.apache.org/jira/browse/SPARK-19583 Thanks~ |
Test build #72855 has started for PR 16672 at commit |
retest this please |
Test build #72869 has finished for PR 16672 at commit
|
@gatorsmile I have fixed some review issues. Could you help to continue to review this ? |
LGTM |
Thanks! Merging to master. |
s""" | ||
|CREATE TABLE t(a string, b int) | ||
|USING parquet | ||
|OPTIONS(path "$dir") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the path doesn't exist when create? will we succeed or fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, it will throw an exception that the path does not existed. maybe we can check if the path is a dir or not, dir can not exist and file must be exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the behavior of hive?
assert(tableLocFile.exists) | ||
checkAnswer(spark.table("t"), Row("c", 1) :: Nil) | ||
|
||
val newDir = dir.getAbsolutePath.stripSuffix("/") + "/x" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new File(dir, "x")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will fix this when I do another pr, thanks~
… a non pre-existing location should succeed ## What changes were proposed in this pull request? when we insert data into a datasource table use `sqlText`, and the table has an not exists location, this will throw an Exception. example: ``` spark.sql("create table t(a string, b int) using parquet") spark.sql("alter table t set location '/xx'") spark.sql("insert into table t select 'c', 1") ``` Exception: ``` com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx; at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814) at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453) ``` As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton: 1. reading from a datasource table: return 0 rows 2. writing to a datasource table: write data successfully ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes apache#16672 from windpiger/insertNotExistLocation.
To backport #17097, we need to backport multiple PRs. This is one of it. @windpiger Could you please submit a PR to backport it to Spark 2.1? |
ok thanks,its my pleasure~ |
What changes were proposed in this pull request?
when we insert data into a datasource table use
sqlText
, and the table has an not exists location,this will throw an Exception.
example:
Exception:
As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton:
How was this patch tested?
unit test added