[SPARK-34528][SQL] Named explicitly field in struct of a catalog view #31639

tprelle · 2021-02-24T21:20:39Z

What changes were proposed in this pull request?

In a shared environnement where Hive Tez and Spark shared the same metastore and data and where Spark is connecting to the metastore of Hive Tez. I found a bug because Hive Tez allow you to change the order inside a struct that spark do not allow you.

In Hive Tez :

You create a table with a struct:
CREATE table test_struct (id int, sub STRUCT <a :INT, b:STRING>);
You insert data into it :
INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1");
Create a view on top of it :
CREATE view test_view_struct as select id, sub from test_view_struct

You try to access it in spark, you can access both.

In Hive Tez :
4) Change the table struct reodoring the struct
ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;
Hive Tez can query the table and the view.

Spark can query the table but when spark query the view spark have a cast issue because spark will try to cast a STRUCT < b:STRING,a :INT> in a struct<a :INT, b:STRING>
And if the modification it's castable you can even have a silent failed the data of column a are in column b and vice versa.

So I proposed to instead of resolving the struct directly during the resolution of the view, i use explicit named during the select.

Why are the changes needed?

It safer to resolve by name the view and it's affected the ability to spark to be used in a shared environnement and to be integrated in a eco system where Hive it's predominant.
The silent failed it's also really dangerous because the dev can miss it so you will have at the end the wrong information in the column.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT added.
I also have to change a test because duplicate named in a struct are not allowed in hive.
Test on a shared Hive Tez and Spark environnement with an external metastore for spark

maropu · 2021-02-24T23:01:15Z

ok to test

maropu · 2021-02-24T23:02:46Z

Thanks for your contribution, @tprelle ! Btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE

tprelle · 2021-02-24T23:43:23Z

Hi @maropu, sorry i just update the PR with the template

maropu · 2021-02-25T00:13:33Z

Change the table struct reodoring the struct
ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;

Really? It seems the operation cannot be allowd;

scala> sql("""ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;""")
org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'sub' with type 'StructType(StructField(a,IntegerType,true), StructField(b,StringType,true))' to 'sub' with type 'StructType(StructField(b,StringType,true), StructField(a,IntegerType,true))'

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

Lines 351 to 357 in b17754a

    
           // Throw an AnalysisException if the column name/dataType is changed. 
        
           if (!columnEqual(originColumn, newColumn, resolver)) { 
        
             throw new AnalysisException( 
        
               "ALTER TABLE CHANGE COLUMN is not supported for changing column " + 
        
                 s"'${originColumn.name}' with type '${originColumn.dataType}' to " + 
        
                 s"'${newColumn.name}' with type '${newColumn.dataType}'") 
        
           }

tprelle · 2021-02-25T01:03:10Z

Change the table struct reodoring the struct
ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;

Really? It seems the operation cannot be allowd;
scala> sql("""ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;""")
org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'sub' with type 'StructType(StructField(a,IntegerType,true), StructField(b,StringType,true))' to 'sub' with type 'StructType(StructField(b,StringType,true), StructField(a,IntegerType,true))'
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

Lines 351 to 357 in b17754a

// Throw an AnalysisException if the column name/dataType is changed.

if (!columnEqual(originColumn, newColumn, resolver)) {

throw new AnalysisException(

"ALTER TABLE CHANGE COLUMN is not supported for changing column " +

s"'${originColumn.name}' with type '${originColumn.dataType}' to " +

s"'${newColumn.name}' with type '${newColumn.dataType}'")

}

Yes it's not allow by spark, but it's allow by Hive Tez or MR. And when you have a commun environnement where Tez and spark access to the same metastore and the same data it's can append (it's append to us hopefully we add a cast exception and not the silent fail)

maropu · 2021-02-25T01:16:49Z

Ah, I see. Please describe more in the PR description to make the usecase clearer? The current one looks ambiguous a bit.

maropu · 2021-02-25T01:16:52Z

cc: @cloud-fan @viirya

SparkQA · 2021-02-25T01:17:20Z

Test build #135441 has finished for PR 31639 at commit aaebcd7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tprelle · 2021-02-25T01:41:16Z

thanks @maropu i try to explain more in the PR.

maropu · 2021-02-25T02:10:56Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

@@ -205,7 +205,7 @@ class HiveCatalogedDDLSuite extends DDLSuite with TestHiveSingleton with BeforeA
      spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
      checkAnswer(spark.table("v"), Row(Row("a", 1)) :: Nil)

-      spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `b`, 1 AS b) q1")
+      spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `c`, 1 AS b) q1")


Ah, I see. Since we don't allow duplicate names in a top-level, it seems we need to follow it in this case, too. cc: @cloud-fan

scala> spark.sql("CREATE VIEW v1 AS SELECT 'a' AS `a`, 1 AS b") scala> spark.sql("ALTER VIEW v1 AS SELECT 'a' AS `b`, 1 AS b") org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the view definition: `b`

So you want me to add this test ?

This is out-of-scope in this PR. Do you wanna work on it? If so, please feel free to file jira for it.

maropu · 2021-02-25T02:35:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+      }
+    } else {
+      viewColumnNames.zip(metadata.schema).
+        map { case (name, field) => innerStruct(Seq(), name, field)}


nit format:

viewColumnNames.zip(metadata.schema).map { case (name, field) => innerStruct(Seq(), name, field) }

You are right, i will make some test and fix also for array and map

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

viirya

Is this a regression? Does the example work before SPARK-34269?

viirya · 2021-02-25T08:48:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+  private def innerStruct(parent : Seq[String], name : String,
+                          field : StructField) : NamedExpression = {
+    field.dataType match {
+      case structType : StructType => Alias(CreateStruct.create(structType.map {


For most cases that the references table is not altered, doesn't this produce unnecessary CreateStruct?

Maybe but i dit not know if this making a huge difference, and i do not found a way to check at this moment of the analysis of the code if the references table is altered or not.

tprelle · 2021-02-25T10:44:43Z

Is this a regression? Does the example work before SPARK-34269?

I just deploy and check on a 3.1 branch and 3.0 branch (i can not test before that spark) and I have the bug. I just update the jira with it.
#31368 seems backportable to 3.1 easily so this version can be fix.
It's more complex for 3.0

cloud-fan · 2021-02-26T07:47:56Z

Since the top-level table columns can be re-ordered when resolving the view, I don't have a problem with doing this on nested fields. Do we have more places that re-order top-level columns but not nested fields? I vaguely remember that there are a lot of places.

tprelle · 2021-02-26T12:55:07Z

Since the top-level table columns can be re-ordered when resolving the view, I don't have a problem with doing this on nested fields. Do we have more places that re-order top-level columns but not nested fields? I vaguely remember that there are a lot of places.

I know this part of the code

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

Line 794 in c1beb16

if (DataType.equalsIgnoreCaseAndNullability(reorderedSchema, table.schema) ||

but it's already handling the issue.

SparkQA · 2021-03-01T02:24:20Z

Test build #135567 has finished for PR 31639 at commit f79f0da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-01T19:55:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40186/

SparkQA · 2021-03-01T20:28:39Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40186/

SparkQA · 2021-03-01T23:42:16Z

Test build #135605 has finished for PR 31639 at commit a458128.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tprelle · 2021-03-08T11:10:36Z

@maropu @viirya I add also the cases for complexe type like array and map

github-actions · 2021-06-17T00:07:38Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Feb 24, 2021

maropu changed the title ~~[SPARK-34528][CORE] named explicitly field in struct of a view~~ [SPARK-34528][SQL] Named explicitly field in struct of a view Feb 24, 2021

maropu reviewed Feb 25, 2021

View reviewed changes

maropu changed the title ~~[SPARK-34528][SQL] Named explicitly field in struct of a view~~ [SPARK-34528][SQL] Named explicitly field in struct of a catalog view Feb 25, 2021

maropu reviewed Feb 25, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala Outdated Show resolved Hide resolved

viirya reviewed Feb 25, 2021

View reviewed changes

tprelle force-pushed the fixStructView branch 2 times, most recently from 10e9383 to 3034cee Compare February 25, 2021 20:58

tprelle force-pushed the fixStructView branch from 3034cee to f79f0da Compare February 28, 2021 21:14

[SPARK-34528][CORE] named explicitly field in struct of a view

a458128

tprelle force-pushed the fixStructView branch from f79f0da to a458128 Compare March 1, 2021 18:08

github-actions bot added the Stale label Jun 17, 2021

github-actions bot closed this Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34528][SQL] Named explicitly field in struct of a catalog view #31639

[SPARK-34528][SQL] Named explicitly field in struct of a catalog view #31639

tprelle commented Feb 24, 2021 •

edited

maropu commented Feb 24, 2021

maropu commented Feb 24, 2021

tprelle commented Feb 24, 2021

maropu commented Feb 25, 2021

tprelle commented Feb 25, 2021

maropu commented Feb 25, 2021

maropu commented Feb 25, 2021

SparkQA commented Feb 25, 2021

tprelle commented Feb 25, 2021

maropu Feb 25, 2021

tprelle Feb 25, 2021

maropu Feb 26, 2021

maropu Feb 25, 2021

tprelle Feb 25, 2021

viirya left a comment

viirya Feb 25, 2021

tprelle Feb 25, 2021

tprelle commented Feb 25, 2021

cloud-fan commented Feb 26, 2021

tprelle commented Feb 26, 2021

SparkQA commented Mar 1, 2021

SparkQA commented Mar 1, 2021

SparkQA commented Mar 1, 2021

SparkQA commented Mar 1, 2021

tprelle commented Mar 8, 2021

github-actions bot commented Jun 17, 2021

[SPARK-34528][SQL] Named explicitly field in struct of a catalog view #31639

[SPARK-34528][SQL] Named explicitly field in struct of a catalog view #31639

Conversation

tprelle commented Feb 24, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Feb 24, 2021

maropu commented Feb 24, 2021

tprelle commented Feb 24, 2021

maropu commented Feb 25, 2021

tprelle commented Feb 25, 2021

maropu commented Feb 25, 2021

maropu commented Feb 25, 2021

SparkQA commented Feb 25, 2021

tprelle commented Feb 25, 2021

maropu Feb 25, 2021

Choose a reason for hiding this comment

tprelle Feb 25, 2021

Choose a reason for hiding this comment

maropu Feb 26, 2021

Choose a reason for hiding this comment

maropu Feb 25, 2021

Choose a reason for hiding this comment

tprelle Feb 25, 2021

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

viirya Feb 25, 2021

Choose a reason for hiding this comment

tprelle Feb 25, 2021

Choose a reason for hiding this comment

tprelle commented Feb 25, 2021

cloud-fan commented Feb 26, 2021

tprelle commented Feb 26, 2021

SparkQA commented Mar 1, 2021

SparkQA commented Mar 1, 2021

SparkQA commented Mar 1, 2021

SparkQA commented Mar 1, 2021

tprelle commented Mar 8, 2021

github-actions bot commented Jun 17, 2021

tprelle commented Feb 24, 2021 •

edited