Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34528][SQL] Named explicitly field in struct of a catalog view #31639

Closed
wants to merge 1 commit into from

Conversation

tprelle
Copy link

@tprelle tprelle commented Feb 24, 2021

What changes were proposed in this pull request?

In a shared environnement where Hive Tez and Spark shared the same metastore and data and where Spark is connecting to the metastore of Hive Tez. I found a bug because Hive Tez allow you to change the order inside a struct that spark do not allow you.

In Hive Tez :

  1. You create a table with a struct:
    CREATE table test_struct (id int, sub STRUCT <a :INT, b:STRING>);
  2. You insert data into it :
    INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1");
  3. Create a view on top of it :
    CREATE view test_view_struct as select id, sub from test_view_struct

You try to access it in spark, you can access both.

In Hive Tez :
4) Change the table struct reodoring the struct
ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;
Hive Tez can query the table and the view.

Spark can query the table but when spark query the view spark have a cast issue because spark will try to cast a STRUCT < b:STRING,a :INT> in a struct<a :INT, b:STRING>
And if the modification it's castable you can even have a silent failed the data of column a are in column b and vice versa.

So I proposed to instead of resolving the struct directly during the resolution of the view, i use explicit named during the select.

Why are the changes needed?

It safer to resolve by name the view and it's affected the ability to spark to be used in a shared environnement and to be integrated in a eco system where Hive it's predominant.
The silent failed it's also really dangerous because the dev can miss it so you will have at the end the wrong information in the column.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT added.
I also have to change a test because duplicate named in a struct are not allowed in hive.
Test on a shared Hive Tez and Spark environnement with an external metastore for spark

@github-actions github-actions bot added the SQL label Feb 24, 2021
@maropu
Copy link
Member

maropu commented Feb 24, 2021

ok to test

@maropu maropu changed the title [SPARK-34528][CORE] named explicitly field in struct of a view [SPARK-34528][SQL] Named explicitly field in struct of a view Feb 24, 2021
@maropu
Copy link
Member

maropu commented Feb 24, 2021

Thanks for your contribution, @tprelle ! Btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE

@tprelle
Copy link
Author

tprelle commented Feb 24, 2021

Hi @maropu, sorry i just update the PR with the template

@maropu
Copy link
Member

maropu commented Feb 25, 2021

Change the table struct reodoring the struct
ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;

Really? It seems the operation cannot be allowd;

scala> sql("""ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;""")
org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'sub' with type 'StructType(StructField(a,IntegerType,true), StructField(b,StringType,true))' to 'sub' with type 'StructType(StructField(b,StringType,true), StructField(a,IntegerType,true))'

// Throw an AnalysisException if the column name/dataType is changed.
if (!columnEqual(originColumn, newColumn, resolver)) {
throw new AnalysisException(
"ALTER TABLE CHANGE COLUMN is not supported for changing column " +
s"'${originColumn.name}' with type '${originColumn.dataType}' to " +
s"'${newColumn.name}' with type '${newColumn.dataType}'")
}

@tprelle
Copy link
Author

tprelle commented Feb 25, 2021

Change the table struct reodoring the struct
ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;

Really? It seems the operation cannot be allowd;

scala> sql("""ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;""")
org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'sub' with type 'StructType(StructField(a,IntegerType,true), StructField(b,StringType,true))' to 'sub' with type 'StructType(StructField(b,StringType,true), StructField(a,IntegerType,true))'

// Throw an AnalysisException if the column name/dataType is changed.
if (!columnEqual(originColumn, newColumn, resolver)) {
throw new AnalysisException(
"ALTER TABLE CHANGE COLUMN is not supported for changing column " +
s"'${originColumn.name}' with type '${originColumn.dataType}' to " +
s"'${newColumn.name}' with type '${newColumn.dataType}'")
}

Yes it's not allow by spark, but it's allow by Hive Tez or MR. And when you have a commun environnement where Tez and spark access to the same metastore and the same data it's can append (it's append to us hopefully we add a cast exception and not the silent fail)

@maropu
Copy link
Member

maropu commented Feb 25, 2021

Ah, I see. Please describe more in the PR description to make the usecase clearer? The current one looks ambiguous a bit.

@maropu
Copy link
Member

maropu commented Feb 25, 2021

cc: @cloud-fan @viirya

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Test build #135441 has finished for PR 31639 at commit aaebcd7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tprelle
Copy link
Author

tprelle commented Feb 25, 2021

thanks @maropu i try to explain more in the PR.

@@ -205,7 +205,7 @@ class HiveCatalogedDDLSuite extends DDLSuite with TestHiveSingleton with BeforeA
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
checkAnswer(spark.table("v"), Row(Row("a", 1)) :: Nil)

spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `b`, 1 AS b) q1")
spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `c`, 1 AS b) q1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Since we don't allow duplicate names in a top-level, it seems we need to follow it in this case, too. cc: @cloud-fan

scala> spark.sql("CREATE VIEW v1 AS SELECT 'a' AS `a`, 1 AS b")
scala> spark.sql("ALTER VIEW v1 AS SELECT 'a' AS `b`, 1 AS b")
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the view definition: `b`

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want me to add this test ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out-of-scope in this PR. Do you wanna work on it? If so, please feel free to file jira for it.

}
} else {
viewColumnNames.zip(metadata.schema).
map { case (name, field) => innerStruct(Seq(), name, field)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit format:

      viewColumnNames.zip(metadata.schema).map { case (name, field) =>
        innerStruct(Seq(), name, field)
      }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, i will make some test and fix also for array and map

@maropu maropu changed the title [SPARK-34528][SQL] Named explicitly field in struct of a view [SPARK-34528][SQL] Named explicitly field in struct of a catalog view Feb 25, 2021
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a regression? Does the example work before SPARK-34269?

private def innerStruct(parent : Seq[String], name : String,
field : StructField) : NamedExpression = {
field.dataType match {
case structType : StructType => Alias(CreateStruct.create(structType.map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most cases that the references table is not altered, doesn't this produce unnecessary CreateStruct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe but i dit not know if this making a huge difference, and i do not found a way to check at this moment of the analysis of the code if the references table is altered or not.

@tprelle
Copy link
Author

tprelle commented Feb 25, 2021

Is this a regression? Does the example work before SPARK-34269?

I just deploy and check on a 3.1 branch and 3.0 branch (i can not test before that spark) and I have the bug. I just update the jira with it.
#31368 seems backportable to 3.1 easily so this version can be fix.
It's more complex for 3.0

@tprelle tprelle force-pushed the fixStructView branch 2 times, most recently from 10e9383 to 3034cee Compare February 25, 2021 20:58
@cloud-fan
Copy link
Contributor

Since the top-level table columns can be re-ordered when resolving the view, I don't have a problem with doing this on nested fields. Do we have more places that re-order top-level columns but not nested fields? I vaguely remember that there are a lot of places.

@tprelle
Copy link
Author

tprelle commented Feb 26, 2021

Since the top-level table columns can be re-ordered when resolving the view, I don't have a problem with doing this on nested fields. Do we have more places that re-order top-level columns but not nested fields? I vaguely remember that there are a lot of places.

I know this part of the code

if (DataType.equalsIgnoreCaseAndNullability(reorderedSchema, table.schema) ||
but it's already handling the issue.

@SparkQA
Copy link

SparkQA commented Mar 1, 2021

Test build #135567 has finished for PR 31639 at commit f79f0da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40186/

@SparkQA
Copy link

SparkQA commented Mar 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40186/

@SparkQA
Copy link

SparkQA commented Mar 1, 2021

Test build #135605 has finished for PR 31639 at commit a458128.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tprelle
Copy link
Author

tprelle commented Mar 8, 2021

@maropu @viirya I add also the cases for complexe type like array and map

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 17, 2021
@github-actions github-actions bot closed this Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants