New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34528][SQL] Named explicitly field in struct of a catalog view #31639
Conversation
ok to test |
Thanks for your contribution, @tprelle ! Btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE |
Hi @maropu, sorry i just update the PR with the template |
Really? It seems the operation cannot be allowd;
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala Lines 351 to 357 in b17754a
|
Yes it's not allow by spark, but it's allow by Hive Tez or MR. And when you have a commun environnement where Tez and spark access to the same metastore and the same data it's can append (it's append to us hopefully we add a cast exception and not the silent fail) |
Ah, I see. Please describe more in the PR description to make the usecase clearer? The current one looks ambiguous a bit. |
cc: @cloud-fan @viirya |
Test build #135441 has finished for PR 31639 at commit
|
thanks @maropu i try to explain more in the PR. |
@@ -205,7 +205,7 @@ class HiveCatalogedDDLSuite extends DDLSuite with TestHiveSingleton with BeforeA | |||
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q") | |||
checkAnswer(spark.table("v"), Row(Row("a", 1)) :: Nil) | |||
|
|||
spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `b`, 1 AS b) q1") | |||
spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `c`, 1 AS b) q1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Since we don't allow duplicate names in a top-level, it seems we need to follow it in this case, too. cc: @cloud-fan
scala> spark.sql("CREATE VIEW v1 AS SELECT 'a' AS `a`, 1 AS b")
scala> spark.sql("ALTER VIEW v1 AS SELECT 'a' AS `b`, 1 AS b")
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the view definition: `b`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want me to add this test ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out-of-scope in this PR. Do you wanna work on it? If so, please feel free to file jira for it.
} | ||
} else { | ||
viewColumnNames.zip(metadata.schema). | ||
map { case (name, field) => innerStruct(Seq(), name, field)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit format:
viewColumnNames.zip(metadata.schema).map { case (name, field) =>
innerStruct(Seq(), name, field)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, i will make some test and fix also for array and map
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a regression? Does the example work before SPARK-34269?
private def innerStruct(parent : Seq[String], name : String, | ||
field : StructField) : NamedExpression = { | ||
field.dataType match { | ||
case structType : StructType => Alias(CreateStruct.create(structType.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most cases that the references table is not altered, doesn't this produce unnecessary CreateStruct
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe but i dit not know if this making a huge difference, and i do not found a way to check at this moment of the analysis of the code if the references table is altered or not.
I just deploy and check on a 3.1 branch and 3.0 branch (i can not test before that spark) and I have the bug. I just update the jira with it. |
10e9383
to
3034cee
Compare
Since the top-level table columns can be re-ordered when resolving the view, I don't have a problem with doing this on nested fields. Do we have more places that re-order top-level columns but not nested fields? I vaguely remember that there are a lot of places. |
I know this part of the code spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala Line 794 in c1beb16
|
3034cee
to
f79f0da
Compare
Test build #135567 has finished for PR 31639 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #135605 has finished for PR 31639 at commit
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
In a shared environnement where Hive Tez and Spark shared the same metastore and data and where Spark is connecting to the metastore of Hive Tez. I found a bug because Hive Tez allow you to change the order inside a struct that spark do not allow you.
In Hive Tez :
CREATE table test_struct (id int, sub STRUCT <a :INT, b:STRING>);
INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1");
CREATE view test_view_struct as select id, sub from test_view_struct
You try to access it in spark, you can access both.
In Hive Tez :
4) Change the table struct reodoring the struct
ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;
Hive Tez can query the table and the view.
Spark can query the table but when spark query the view spark have a cast issue because spark will try to cast a STRUCT < b:STRING,a :INT> in a struct<a :INT, b:STRING>
And if the modification it's castable you can even have a silent failed the data of column a are in column b and vice versa.
So I proposed to instead of resolving the struct directly during the resolution of the view, i use explicit named during the select.
Why are the changes needed?
It safer to resolve by name the view and it's affected the ability to spark to be used in a shared environnement and to be integrated in a eco system where Hive it's predominant.
The silent failed it's also really dangerous because the dev can miss it so you will have at the end the wrong information in the column.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New UT added.
I also have to change a test because duplicate named in a struct are not allowed in hive.
Test on a shared Hive Tez and Spark environnement with an external metastore for spark