Skip to content

[SPARK-35967][SQL] Update nullability based on column statistics#33170

Closed
wangyum wants to merge 4 commits intoapache:masterfrom
wangyum:SPARK-35967
Closed

[SPARK-35967][SQL] Update nullability based on column statistics#33170
wangyum wants to merge 4 commits intoapache:masterfrom
wangyum:SPARK-35967

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jul 1, 2021

What changes were proposed in this pull request?

Update column nullability based on column statistics if it exists.

Why are the changes needed?

Reduce useless IsNotNull filter condition to improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Jul 1, 2021
@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45014/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45014/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140502 has finished for PR 33170 at commit 0a72486.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45030/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45030/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45033/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140517 has finished for PR 33170 at commit 17dc4b1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45033/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140520 has finished for PR 33170 at commit 134308a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum requested review from cloud-fan and maropu July 2, 2021 00:54
val output = table.stats.map(_.colStats) match {
case Some(colStats) =>
attributes
.map(a => a.withNullability(colStats.get(a.name).forall(_.nullCount.forall(_ > 0L))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm .. does this mean that if a table is saved with nullable schema but the table doesn't have nulls, here the schema becomes non-nullable when we read it back?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Based on column statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on column statistics.

Stats doesn't have to be 100% accurate?

  • it could be an estimate based on sampled data;
  • it could be outdated with the ground truth data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern. AFAIK the baseline is Spark runs slower if stats are inaccurate. But wrong nullability can lead to wrong result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same feeling with @HyukjinKwon @cloud-fan ...

@wangyum wangyum closed this Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants