HIVE-27327 : Iceberg basic stats: Incorrect row count in snapshot sum…#4301
HIVE-27327 : Iceberg basic stats: Incorrect row count in snapshot sum…#4301deniskuzZ merged 5 commits intoapache:masterfrom
Conversation
…mary leading to unoptimized plans
InvisibleProgrammer
left a comment
There was a problem hiding this comment.
If I understood it correctly, it runs when we query the statistics, not when we write them. I wonder, it is possible to fix it on writes?
iceberg/iceberg-handler/src/test/results/positive/row_count.q.out
Outdated
Show resolved
Hide resolved
| if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) { | ||
| stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); | ||
| long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); | ||
| if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) && |
There was a problem hiding this comment.
What if onlye one of TOTAL_EQ_DELETES_PROP and TOTAL_POS_DELETES_PROP persists?
InvisibleProgrammer
left a comment
There was a problem hiding this comment.
If I understood it correctly, it runs when we query the statistics, not when we write them. I wonder, it is possible to fix it on writes?
aturoczy
left a comment
There was a problem hiding this comment.
Please reject if you not agree with my findings
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
Outdated
Show resolved
Hide resolved
| long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); | ||
| if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) && | ||
| summary.containsKey(SnapshotSummary.TOTAL_POS_DELETES_PROP)) { | ||
| Long actualRecords = |
There was a problem hiding this comment.
Just share some my thought.
Not sure if i am understand correctly, the delete file in iceberg is also a special data file, and table scan in actual execution stage also should read all related delete files.
That is to say, the actual execution still requires scanning more data than the explain shows.
So, i am not sure if this PR can be give a optimized plans when iceberg table has both data files and delete files.
There was a problem hiding this comment.
count(*) should use stats instead of scanning the whole dataset.
To be honest I don't really understand the purpose of 'total-records' if it doesn't reflect an accurate row count. insert 100 rows, delete all, 'total-records'=100 ???
looks like if there are only positional deletes we could get the accurate count by subtrracting “total-position-deletes” from “total-records”
@zhangbutao do you see any issues with that?
Another alternative would be to push row count stats from Hive to Iceberg summary.
There was a problem hiding this comment.
count(*) should use stats instead of scanning the whole dataset.
@deniskuzZ Do you mean we can push down min/max to iceberg? I think it's beyond this PR scope and it's not so easy. I find some info: HIVE-27099 select count(*) from table queries all data, and Spark has push down min/max to iceberg apache/iceberg#6622, but Spark will skip pushdown if including delete files https://github.com/apache/iceberg/pull/6622/files#diff-66bfda4bda6d505fe3de7db3b4d6b7923b3711b00e2801846dd7325edcdbf65eR224
@zhangbutao do you see any issues with that?
To be honest, i don't know too much about iceberg statistics at the moment so I can't share any more context.
Maybe after some time I can tell more or Implement some stats pushdown in Hive. :)
There was a problem hiding this comment.
@zhangbutao, we are already pushing down column stats in puffin format HIVE-27158, however, we are still relying on Iceberg for basic stats.
In case of deletes, it becomes invalid. In this PR we are doing a workaround just for positional deletes use-case until it's fixed on the iceberg side.
There was a problem hiding this comment.
I see. Just say something else here. I think TrinoDB has a good implementation about puffin stats, maybe we can refer to some designs from TrinoDB. But, I also think lots of stuff need to be done on the iceberg side.
We can keep looking at the evolution about iceberg stats, as this can give a better cbo, pushdown, etc
| 0 Map 1 | ||
| Statistics: Num rows: 5 Data size: 4320 Basic stats: COMPLETE Column stats: COMPLETE | ||
| 1 Map 6 | ||
| Statistics: Num rows: 7 Data size: 6656 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
was the row count inaccurate before?
|
Kudos, SonarCloud Quality Gate passed! |
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
Show resolved
Hide resolved
deniskuzZ
left a comment
There was a problem hiding this comment.
LGTM, minor change requested
|
Thanks @deniskuzZ , @zhangbutao , @aturoczy , @InvisibleProgrammer for the review ! |
…ptimal plans (Simhadri Govindappa, reviewed by Attila Turoczy, Butao Zhang, Denys Kuzmenko, Zsolt Miskolczi) Closes apache#4301
…ptimal plans (Simhadri Govindappa, reviewed by Attila Turoczy, Butao Zhang, Denys Kuzmenko, Zsolt Miskolczi) Closes apache#4301








…mary leading to unoptimized plans
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?