[SPARK-35967][SQL] Update nullability based on column statistics#33170
[SPARK-35967][SQL] Update nullability based on column statistics#33170wangyum wants to merge 4 commits intoapache:masterfrom wangyum:SPARK-35967
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #140502 has finished for PR 33170 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Test build #140517 has finished for PR 33170 at commit
|
|
Kubernetes integration test status success |
|
Test build #140520 has finished for PR 33170 at commit
|
| val output = table.stats.map(_.colStats) match { | ||
| case Some(colStats) => | ||
| attributes | ||
| .map(a => a.withNullability(colStats.get(a.name).forall(_.nullCount.forall(_ > 0L)))) |
There was a problem hiding this comment.
Hmmm .. does this mean that if a table is saved with nullable schema but the table doesn't have nulls, here the schema becomes non-nullable when we read it back?
There was a problem hiding this comment.
Yes. Based on column statistics.
There was a problem hiding this comment.
Based on column statistics.
Stats doesn't have to be 100% accurate?
- it could be an estimate based on sampled data;
- it could be outdated with the ground truth data.
There was a problem hiding this comment.
I have the same concern. AFAIK the baseline is Spark runs slower if stats are inaccurate. But wrong nullability can lead to wrong result.
There was a problem hiding this comment.
I have the same feeling with @HyukjinKwon @cloud-fan ...
What changes were proposed in this pull request?
Update column nullability based on column statistics if it exists.
Why are the changes needed?
Reduce useless
IsNotNullfilter condition to improve query performance.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test.