[SPARK-35967][SQL] Update nullability based on column statistics by wangyum · Pull Request #33170 · apache/spark

+    val output = table.stats.map(_.colStats) match {
+      case Some(colStats) =>
+        attributes
+          .map(a => a.withNullability(colStats.get(a.name).forall(_.nullCount.forall(_ > 0L))))


Hmmm .. does this mean that if a table is saved with nullable schema but the table doesn't have nulls, here the schema becomes non-nullable when we read it back?

Yes. Based on column statistics.

Based on column statistics.

Stats doesn't have to be 100% accurate?

it could be an estimate based on sampled data;

it could be outdated with the ground truth data.

I have the same concern. AFAIK the baseline is Spark runs slower if stats are inaccurate. But wrong nullability can lead to wrong result.

I have the same feeling with @HyukjinKwon @cloud-fan ...

Update nullability based on column statistics

8730836

github-actions bot added the SQL label Jul 1, 2021

Reduce changes

0a72486

wangyum added 2 commits July 1, 2021 19:56

Fix test

17dc4b1

Fix test

134308a

wangyum requested review from cloud-fan and maropu July 2, 2021 00:54

HyukjinKwon reviewed Jul 2, 2021

View reviewed changes

wangyum closed this Sep 9, 2021

Conversation

wangyum commented Jul 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

HyukjinKwon Jul 2, 2021

Choose a reason for hiding this comment

Uh oh!

wangyum Jul 2, 2021

Choose a reason for hiding this comment

Uh oh!

sigmod Jul 10, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

maropu Jul 16, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wangyum commented Jul 1, 2021 •

edited

Loading