Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37569][SQL] Don't mark nested view fields as nullable #34839

Closed
wants to merge 2 commits into from

Conversation

shardulm94
Copy link
Contributor

What changes were proposed in this pull request?

When analyzing a view, we should not unnecessarily mark nested fields as nullable. If the columns projected by the view define themselves as non-nullable, their nullability should be preserved.

Why are the changes needed?

Consider a view as follows with all fields non-nullable (required)

spark.sql("""
    CREATE OR REPLACE VIEW v AS 
    SELECT id, named_struct('a', id) AS nested
    FROM RANGE(10)
""")

When trying to read this view, it incorrectly marks nested column a as nullable

scala> spark.table("v2").printSchema
root
 |-- id: long (nullable = false)
 |-- nested: struct (nullable = false)
 |    |-- a: long (nullable = true)

However, we can see that the view schema has been correctly stored as non-nullable

scala> System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default", "v2"))
CatalogTable(
Database: default
Table: v2
.
.
.
Schema: root
 |-- id: long (nullable = false)
 |-- nested: struct (nullable = false)
 |    |-- a: long (nullable = false)
)

This is caused by this line in Analyzer.scala. Going through the history of changes for this block of code, it seems like asNullable is a remnant of a time before we added checks to ensure that the from and to types of the cast were compatible. As nullability is already checked, it should be safe to add a cast without converting the target datatype to nullable.

Does this PR introduce any user-facing change?

Yes. View analysis will preserve nullability of nested fields instead of marking all nested fields as nullable.

How was this patch tested?

Added unit test

@github-actions github-actions bot added the SQL label Dec 8, 2021
@shardulm94
Copy link
Contributor Author

@xkrogen
Copy link
Contributor

xkrogen commented Dec 8, 2021

Great find @shardulm94 ! Nice to see that it's a simple fix. Appreciate the short history lesson as well.

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Test build #146028 has finished for PR 34839 at commit 4b8e2ca.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50504/

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50504/

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50520/

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50520/

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Test build #146045 has finished for PR 34839 at commit e34dbd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50527/

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50527/

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems okay. This change doesn't break other tests too.

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Test build #146052 has finished for PR 34839 at commit be8b9df.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

val sql = "SELECT id, named_struct('a', id) AS nested FROM RANGE(10)"
val df = spark.sql(sql)

spark.sql(s"CREATE VIEW v AS $sql")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call the createView method to create view, otherwise we just test the permanent view 3 times...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it now

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50549/

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50549/

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Test build #146075 has finished for PR 34839 at commit 16b44a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 7d8a721 Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants