Skip to content

HIVE-27327 : Iceberg basic stats: Incorrect row count in snapshot sum…#4301

Merged
deniskuzZ merged 5 commits intoapache:masterfrom
simhadri-g:HIVE-27327
May 15, 2023
Merged

HIVE-27327 : Iceberg basic stats: Incorrect row count in snapshot sum…#4301
deniskuzZ merged 5 commits intoapache:masterfrom
simhadri-g:HIVE-27327

Conversation

@simhadri-g
Copy link
Member

…mary leading to unoptimized plans

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@InvisibleProgrammer InvisibleProgrammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood it correctly, it runs when we query the statistics, not when we write them. I wonder, it is possible to fix it on writes?

if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if onlye one of TOTAL_EQ_DELETES_PROP and TOTAL_POS_DELETES_PROP persists?

Copy link
Contributor

@InvisibleProgrammer InvisibleProgrammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood it correctly, it runs when we query the statistics, not when we write them. I wonder, it is possible to fix it on writes?

Copy link

@aturoczy aturoczy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reject if you not agree with my findings

long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) &&
summary.containsKey(SnapshotSummary.TOTAL_POS_DELETES_PROP)) {
Long actualRecords =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just share some my thought.
Not sure if i am understand correctly, the delete file in iceberg is also a special data file, and table scan in actual execution stage also should read all related delete files.

That is to say, the actual execution still requires scanning more data than the explain shows.
So, i am not sure if this PR can be give a optimized plans when iceberg table has both data files and delete files.

Copy link
Member

@deniskuzZ deniskuzZ May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(*) should use stats instead of scanning the whole dataset.
To be honest I don't really understand the purpose of 'total-records' if it doesn't reflect an accurate row count. insert 100 rows, delete all, 'total-records'=100 ???

looks like if there are only positional deletes we could get the accurate count by subtrracting “total-position-deletes” from “total-records”
@zhangbutao do you see any issues with that?

Another alternative would be to push row count stats from Hive to Iceberg summary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(*) should use stats instead of scanning the whole dataset.

@deniskuzZ Do you mean we can push down min/max to iceberg? I think it's beyond this PR scope and it's not so easy. I find some info: HIVE-27099 select count(*) from table queries all data, and Spark has push down min/max to iceberg apache/iceberg#6622, but Spark will skip pushdown if including delete files https://github.com/apache/iceberg/pull/6622/files#diff-66bfda4bda6d505fe3de7db3b4d6b7923b3711b00e2801846dd7325edcdbf65eR224

@zhangbutao do you see any issues with that?

To be honest, i don't know too much about iceberg statistics at the moment so I can't share any more context.
Maybe after some time I can tell more or Implement some stats pushdown in Hive. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao, we are already pushing down column stats in puffin format HIVE-27158, however, we are still relying on Iceberg for basic stats.
In case of deletes, it becomes invalid. In this PR we are doing a workaround just for positional deletes use-case until it's fixed on the iceberg side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Just say something else here. I think TrinoDB has a good implementation about puffin stats, maybe we can refer to some designs from TrinoDB. But, I also think lots of stuff need to be done on the iceberg side.
We can keep looking at the evolution about iceberg stats, as this can give a better cbo, pushdown, etc

0 Map 1
Statistics: Num rows: 5 Data size: 4320 Basic stats: COMPLETE Column stats: COMPLETE
1 Map 6
Statistics: Num rows: 7 Data size: 6656 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Member

@deniskuzZ deniskuzZ May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the row count inaccurate before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 29 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor change requested

@deniskuzZ deniskuzZ merged commit add340d into apache:master May 15, 2023
@simhadri-g
Copy link
Member Author

Thanks @deniskuzZ , @zhangbutao , @aturoczy , @InvisibleProgrammer for the review !

yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
…ptimal plans (Simhadri Govindappa, reviewed by Attila Turoczy, Butao Zhang, Denys Kuzmenko, Zsolt Miskolczi)

Closes apache#4301
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…ptimal plans (Simhadri Govindappa, reviewed by Attila Turoczy, Butao Zhang, Denys Kuzmenko, Zsolt Miskolczi)

Closes apache#4301
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants