HIVE-27327 : Iceberg basic stats: Incorrect row count in snapshot sum… by simhadri-g · Pull Request #4301 · apache/hive

simhadri-g · 2023-05-08T17:02:42Z

…mary leading to unoptimized plans

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…mary leading to unoptimized plans

InvisibleProgrammer

If I understood it correctly, it runs when we query the statistics, not when we write them. I wonder, it is possible to fix it on writes?

iceberg/iceberg-handler/src/test/queries/positive/row_count.q

iceberg/iceberg-handler/src/test/results/positive/row_count.q.out

InvisibleProgrammer · 2023-05-08T17:21:06Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

            if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
-              stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) &&


What if onlye one of TOTAL_EQ_DELETES_PROP and TOTAL_POS_DELETES_PROP persists?

InvisibleProgrammer

If I understood it correctly, it runs when we query the statistics, not when we write them. I wonder, it is possible to fix it on writes?

aturoczy

Please reject if you not agree with my findings

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

zhangbutao · 2023-05-09T08:45:47Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

+              long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+              if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) &&
+                  summary.containsKey(SnapshotSummary.TOTAL_POS_DELETES_PROP)) {
+                Long actualRecords =


Just share some my thought.
Not sure if i am understand correctly, the delete file in iceberg is also a special data file, and table scan in actual execution stage also should read all related delete files.

That is to say, the actual execution still requires scanning more data than the explain shows.
So, i am not sure if this PR can be give a optimized plans when iceberg table has both data files and delete files.

count(*) should use stats instead of scanning the whole dataset.
To be honest I don't really understand the purpose of 'total-records' if it doesn't reflect an accurate row count. insert 100 rows, delete all, 'total-records'=100 ???

looks like if there are only positional deletes we could get the accurate count by subtrracting “total-position-deletes” from “total-records”
@zhangbutao do you see any issues with that?

Another alternative would be to push row count stats from Hive to Iceberg summary.

count(*) should use stats instead of scanning the whole dataset.

@deniskuzZ Do you mean we can push down min/max to iceberg? I think it's beyond this PR scope and it's not so easy. I find some info: HIVE-27099 select count(*) from table queries all data, and Spark has push down min/max to iceberg apache/iceberg#6622, but Spark will skip pushdown if including delete files https://github.com/apache/iceberg/pull/6622/files#diff-66bfda4bda6d505fe3de7db3b4d6b7923b3711b00e2801846dd7325edcdbf65eR224

@zhangbutao do you see any issues with that?

To be honest, i don't know too much about iceberg statistics at the moment so I can't share any more context.
Maybe after some time I can tell more or Implement some stats pushdown in Hive. :)

@zhangbutao, we are already pushing down column stats in puffin format HIVE-27158, however, we are still relying on Iceberg for basic stats.
In case of deletes, it becomes invalid. In this PR we are doing a workaround just for positional deletes use-case until it's fixed on the iceberg side.

I see. Just say something else here. I think TrinoDB has a good implementation about puffin stats, maybe we can refer to some designs from TrinoDB. But, I also think lots of stuff need to be done on the iceberg side.
We can keep looking at the evolution about iceberg stats, as this can give a better cbo, pushdown, etc

deniskuzZ · 2023-05-09T14:27:07Z

iceberg/iceberg-handler/src/test/results/positive/vectorized_iceberg_merge_mixed.q.out

-                        0 Map 1
-                      Statistics: Num rows: 5 Data size: 4320 Basic stats: COMPLETE Column stats: COMPLETE
+                        1 Map 6
+                      Statistics: Num rows: 7 Data size: 6656 Basic stats: COMPLETE Column stats: COMPLETE


was the row count inaccurate before?

sonarqubecloud · 2023-05-13T00:21:34Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
29 Code Smells

No Coverage information
No Duplication information

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

deniskuzZ

LGTM, minor change requested

simhadri-g · 2023-05-15T10:44:58Z

Thanks @deniskuzZ , @zhangbutao , @aturoczy , @InvisibleProgrammer for the review !

…ptimal plans (Simhadri Govindappa, reviewed by Attila Turoczy, Butao Zhang, Denys Kuzmenko, Zsolt Miskolczi) Closes apache#4301

HIVE-27327 : Iceberg basic stats: Incorrect row count in snapshot sum…

8fcd6a4

…mary leading to unoptimized plans

kgyrtkirk added the tests pending label May 8, 2023

InvisibleProgrammer reviewed May 8, 2023

View reviewed changes

aturoczy suggested changes May 8, 2023

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java Outdated Show resolved Hide resolved

kgyrtkirk added tests unstable and removed tests pending labels May 8, 2023

zhangbutao reviewed May 9, 2023

View reviewed changes

Addressed review comments

7509816

kgyrtkirk added tests pending and removed tests unstable labels May 9, 2023

deniskuzZ reviewed May 9, 2023

View reviewed changes

kgyrtkirk added tests unstable and removed tests pending labels May 9, 2023

SimhadriG added 2 commits May 10, 2023 01:15

Fix failing tests

a9950f8

Fix failing tests

8c79b58

kgyrtkirk added tests pending tests unstable and removed tests unstable tests pending labels May 11, 2023

Fix failing tests

97d195c

kgyrtkirk added tests pending tests failed and removed tests unstable tests pending tests failed labels May 11, 2023

kgyrtkirk added tests pending tests failed tests unstable and removed tests pending tests failed tests unstable labels May 12, 2023

kgyrtkirk added tests passed and removed tests pending labels May 13, 2023

deniskuzZ reviewed May 15, 2023

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java Show resolved Hide resolved

deniskuzZ reviewed May 15, 2023

View reviewed changes

deniskuzZ approved these changes May 15, 2023

View reviewed changes

deniskuzZ merged commit add340d into apache:master May 15, 2023

Conversation

simhadri-g commented May 8, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

InvisibleProgrammer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

InvisibleProgrammer May 8, 2023

Choose a reason for hiding this comment

Uh oh!

InvisibleProgrammer left a comment

Choose a reason for hiding this comment

Uh oh!

aturoczy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhangbutao May 9, 2023

Choose a reason for hiding this comment

Uh oh!

deniskuzZ May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangbutao May 9, 2023

Choose a reason for hiding this comment

Uh oh!

deniskuzZ May 10, 2023

Choose a reason for hiding this comment

Uh oh!

zhangbutao May 10, 2023

Choose a reason for hiding this comment

Uh oh!

deniskuzZ May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simhadri-g May 15, 2023

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented May 13, 2023

Uh oh!

Uh oh!

deniskuzZ left a comment

Choose a reason for hiding this comment

Uh oh!

simhadri-g commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

deniskuzZ May 9, 2023 •

edited

Loading

deniskuzZ May 9, 2023 •

edited

Loading