Core: Add dv_count column to PartitionsTable metadata table#16125
Core: Add dv_count column to PartitionsTable metadata table#16125hemanthboyina wants to merge 1 commit into
Conversation
|
pushed test fixes |
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for the PR @hemanthboyina !
This test suite runs the metadata table queries from Spark: TestIcebergSourceTablesBase. Would you mind adding test coverage there and make sure the breaking tests are also fixed.
I haven't thought it through, just thinking loud: Would it cause any incompatibility issues if we introduced the new field as required as in this PR? Maybe not, asking to be on the safe side.
|
thanks for the review @gaborkaszab , added the test changes to TestIcebergSourceTablesBase. I think this should not cause any incompatibility issues, as the partitions metadata table is runtime computed table. I assume the same pattern was used when last_updated_snapshot_id was added to partitions table previously, it was added as new column to the runtime schema without any compatibility issues. |
|
test failures are from different versions of spark, fixed it across |
Those fields were added as optional. That's why I'm asking |
|
Good catch, thanks for pointing that out. I'll change dv_count to optional to be consistent with how last_updated_at and last_updated_snapshot_id were added. Will push the fix |
| this.posDeleteFileCount = 0; | ||
| this.eqDeleteRecordCount = 0L; | ||
| this.eqDeleteFileCount = 0; | ||
| this.dvCount = 0; |
There was a problem hiding this comment.
Other optional fields, aren't initialized here
There was a problem hiding this comment.
Changed to Integer as suggested. Kept initialization to 0 because without it, dvCount += 1 would NPE on first DV encounter.
There was a problem hiding this comment.
Well, the point of changing it to Integer was to have null as default value. But this might be obsolete now, see my general comment on the PR
|
|
||
| @Test | ||
| public void testPartitionsTableDvCountColumn() { | ||
| TableIdentifier tableIdentifier = TableIdentifier.of("db", "partitions_dv_count_test"); |
There was a problem hiding this comment.
Isn't this covered with the other partition table tests?
|
will resolve conflicts shortly |
fb5b433 to
24af896
Compare
|
few changes were missed while rebase, pushed those |
gaborkaszab
left a comment
There was a problem hiding this comment.
Hey @hemanthboyina ,
Thanks again for taking care of these changes. I gave this another thought, and I think I made you doing extra work, I apologize. I checked the history for PartitionsTable and apparently there is no issue with adding new fields as required, as this was the case with total_data_file_size_in_bytes here. I believe the reason for last_updated_at and last_updated_snapshot_id being optional is because they don't have an "initial value" as counters do.
Would you mind taking another look and change the new dv_count field to required and also the representation to primitive int? I'll take another look once that's done.
| this.posDeleteFileCount = 0; | ||
| this.eqDeleteRecordCount = 0L; | ||
| this.eqDeleteFileCount = 0; | ||
| this.dvCount = 0; |
There was a problem hiding this comment.
Well, the point of changing it to Integer was to have null as default value. But this might be obsolete now, see my general comment on the PR
|
thanks for the review @gaborkaszab , pushed the changes, can you please check |
524dc25 to
94492bb
Compare
94492bb to
2a3c7dc
Compare
Add a dv_count column to the partitions metadata table to expose deletion vector counts per partition. Users querying from table.partitions can now see DV density per partition, which is useful for understanding which partitions have accumulated row-level deletes.