Core: Add total data size to Partitions table #7920

hsiang-c · 2023-06-27T03:53:16Z

Closes #7896

This PR adds total_data_file_size_in_bytes to Partitions Table

docs/flink-queries.md

docs/spark-queries.md

core/src/main/java/org/apache/iceberg/PartitionsTable.java

ajantha-bhat · 2023-06-27T13:26:42Z

cc: @szehon-ho as you are mostly working and reviewing this area.

szehon-ho

Thanks , yea I was chatting earlier with @hsiang-c for this :)

core/src/main/java/org/apache/iceberg/PartitionsTable.java

szehon-ho · 2023-06-27T17:13:34Z

core/src/main/java/org/apache/iceberg/PartitionsTable.java

@@ -82,7 +82,12 @@ public class PartitionsTable extends BaseMetadataTable {
                10,
                "last_updated_snapshot_id",
                Types.LongType.get(),
-                "Id of snapshot that last updated this partition"));
+                "Id of snapshot that last updated this partition"),
+            Types.NestedField.required(


And this topic always comes up, but what do you think of the position? @ajantha-bhat @dramaticlly . Maybe its better after file_count? (so we have 3 columns for data, pos_delete, and eq_delete)

Yeah I think what Szehon said make sense, given last 2 columns are optional and new column is required

Agree. I have already kept it beside file_count for partition stats.

Note: Here we should not modify field id while reordering to maintain the compatibility.

dramaticlly · 2023-06-27T22:38:36Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+                "total_data_size_in_bytes",
+                StreamSupport.stream(
+                        table.currentSnapshot().addedDataFiles(table.io()).spliterator(), false)
+                    .mapToLong(DataFile::fileSizeInBytes)
+                    .sum())


probably worth extract a variable instead of inline the computation.

also I saw you added coverage for unpartitioned table only, shall we also add one for partitioned table to make sure it s data size in bytes match for each partition?

@dramaticlly Thank you for your feedback.

Yes, we should add tests for partitioned table. I was able to do it for testPartitionsTable and testPartitionsTableDeleteStats but not testPartitionsTableLastUpdatedSnapshot.

Will dig into it more today.

@dramaticlly I think I fixed testPartitionsTableLastUpdatedSnapshot, please take a look, thanks!

szehon-ho

Looks good, just some style nits

szehon-ho · 2023-06-29T18:14:55Z

core/src/main/java/org/apache/iceberg/PartitionsTable.java

+                11,
+                "total_data_file_size_in_bytes",
+                Types.LongType.get(),
+                "Total bytes of data files in a partition"),


nit: 'total size in bytes'

szehon-ho · 2023-06-29T18:24:20Z

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

@@ -2028,4 +2042,10 @@ public static Dataset<Row> selectNonDerived(Dataset<Row> metadataTable) {
  public static Types.StructType nonDerivedSchema(Dataset<Row> metadataTable) {
    return SparkSchemaUtil.convert(selectNonDerived(metadataTable).schema()).asStruct();
  }
+
+  private long getDataFileSizeInBytes(Iterable<DataFile> dataFiles) {


nit: we can remove 'get' (Iceberg code style guideline are a bit different: https://iceberg.apache.org/contribute/#method-naming)

szehon-ho · 2023-06-29T18:25:33Z

core/src/main/java/org/apache/iceberg/PartitionsTable.java

@@ -275,6 +283,7 @@ static class Partition {
    private int eqDeleteFileCount;
    private Long lastUpdatedMs;
    private Long lastUpdatedSnapshotId;
+    private long dataFileSizeInBytes;


nit: can we move after dataFileCount? (as its part of 'dataFile' group)

speak of which, @szehon-ho do you feel we shall do the same in Schema method to move this new field with id 11 to be right after file_count (field id 3)? It seem to fit into same dataFile group by it might be some concern about reference by position to mess up?

Oh yea , i think that was the consensus from the other comemnt: #7920 (comment) @hsiang-c do you think we can move it?

@szehon-ho Sure thing!

szehon-ho · 2023-06-30T07:56:09Z

core/src/main/java/org/apache/iceberg/PartitionsTable.java

+                11,
+                "total_data_file_size_in_bytes",
+                Types.LongType.get(),
+                "Total size in bytes"),


Ah sorry, in my previous comment I meant just change "total bytes' => 'total size in bytes', but the rest was ok.

So can we revert back the original end of sentence where you talked about data files?

'Total size in bytes of data files' (maybe 'in a partition' was redundant there)

szehon-ho · 2023-06-30T18:47:27Z

docs/flink-queries.md

@@ -436,7 +436,7 @@ SELECT * FROM prod.db.table$partitions;
 | {20211002, 10} | 1            | 1          | 0       |

 Note:
-For unpartitioned tables, the partitions table will contain only the record_count and file_count columns.
+For unpartitioned tables, the partitions table will contain only the record_count, file_count, position_delete_record_count, position_delete_file_count, equality_delete_record_count, equality_delete_file_count, last_updated_ms, last_updated_snapshot_id and total_data_file_size_in_bytes columns.


Should we do this in another pr? I feel we need to edit the table above as well.

Also, I think we can just say 'For unpartitioned tables, the partitions table will not contain the partition and spec_id field', as the list of fields we do support is becoming too big.

agreed, we can follow up with doc PR after this is merged

szehon-ho · 2023-06-30T18:48:28Z

core/src/main/java/org/apache/iceberg/PartitionsTable.java

@@ -73,6 +73,8 @@ public class PartitionsTable extends BaseMetadataTable {
                "equality_delete_file_count",
                Types.IntegerType.get(),
                "Count of equality delete files"),
+            Types.NestedField.required(
+                11, "total_data_file_size_in_bytes", Types.LongType.get(), "Total size in bytes"),


This is still not changed back? "Total size in bytes of data files" Sorry if its still pending

and also let's move it up to between 3 and 5 since it belong to data file group

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

szehon-ho · 2023-07-06T18:04:10Z

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

@@ -1469,12 +1483,20 @@ public void testPartitionsTableLastUpdatedSnapshot() {
        new GenericRecordBuilder(
            AvroSchemaUtil.convert(
                partitionsTable.schema().findType("partition").asStructType(), "partition"));
+
+    List<DataFile> dataFilesFromFirstCommit = listDataFilesFromCommitId(table, firstCommitId);


Would it work to make a method List dataFiles(table) to get all the data files, so we don't have to do add data files from both commits?

I did this before here: https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewritePositionDeleteFilesAction.java#L682

(maybe we can do it without column stats here, to be shorter).

If we do this, we can even extract to TestHelpers in a later PR.

@szehon-ho Thanks for pointing out! Adopted it.

If we do this, we can even extract to TestHelpers in a later PR.

+1, let's do the extraction in a later PR.

- https://github.com/apache/iceberg/pull/7105/files

szehon-ho · 2023-07-07T06:26:33Z

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+    return Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+  }
+
+  private void assertDataFilePartitions(List<DataFile> dataFiles, int[] expectedPartitionIds) {


Nit: we can put back the size check.

spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

This reverts commit e3dbd94.

szehon-ho · 2023-07-07T17:35:27Z

Merged , thanks a lot @hsiang-c for the first contribution, and thanks @ajantha-bhat and @dramaticlly for additional reviews!

szehon-ho · 2023-07-07T17:36:08Z

(Feel free to make follow prs to update the docs)

github-actions bot added core docs spark labels Jun 27, 2023

ajantha-bhat reviewed Jun 27, 2023

View reviewed changes

docs/flink-queries.md Outdated Show resolved Hide resolved

docs/spark-queries.md Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/PartitionsTable.java Outdated Show resolved Hide resolved

szehon-ho reviewed Jun 27, 2023

View reviewed changes

dramaticlly reviewed Jun 27, 2023

View reviewed changes

szehon-ho reviewed Jun 29, 2023

View reviewed changes

szehon-ho reviewed Jun 30, 2023

View reviewed changes

hsiang-c commented Jul 3, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java Outdated Show resolved Hide resolved

dramaticlly reviewed Jul 5, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java Outdated Show resolved Hide resolved

szehon-ho reviewed Jul 6, 2023

View reviewed changes

hsiang-c added 17 commits July 7, 2023 11:10

Add total_data_size_in_bytes to Partitions table

b818218

Return total_data_size_in_bytes for unpartitioned partitions table

155f6fb

Style fix

943b52d

Required columns first

de3948e

Align column name w/ partition stats spec

2328796

- https://github.com/apache/iceberg/pull/7105/files

Extract datafile size summing to a method

647b2a2

Test 'total_data_file_size_in_bytes' for partitioned table

630fc6c

Deleted files are excluded from size stat

7289c46

Renamed doc/method based on review comments

e559b2b

Sum data file size after rewriting manifest

77001d6

Fix style

28d9d58

Sync implementation and flink/spark docs

1cd14a9

Fixed field doc according to comments

7e35379

Group data file stats

fdf508d

Suppress MethodLength warnings

2f52907

Make long explicit

b340540

Revert doc change. Will fix it in later PR.

61de57d

hsiang-c added 6 commits July 7, 2023 11:55

Extract assertions to a helper method

92c66af

Parameterize helper method

e5fb59a

Fix statement

6067cfd

Rename methods based on feedbacks

4641684

Switch to Guava's Lists

32074d6

Check partition id for all data files

f29a865

hsiang-c force-pushed the partitions_data_size branch from 0373042 to f29a865 Compare July 7, 2023 05:22

Switch to array impl

e3dbd94

szehon-ho reviewed Jul 7, 2023

View reviewed changes

szehon-ho approved these changes Jul 7, 2023

View reviewed changes

hsiang-c added 3 commits July 7, 2023 14:37

Revert "Switch to array impl"

88b8504

This reverts commit e3dbd94.

Move size assertion into test helper

78a4c65

Removed redundant size

e0b7f87

szehon-ho merged commit 025cdf0 into apache:master Jul 7, 2023
41 checks passed

hsiang-c deleted the partitions_data_size branch July 9, 2023 02:31

This was referenced Jul 9, 2023

Docs: Update Partitions table in Flink/Spark doc #8021

Merged

Spark: Consolidate duplicated test methods to TestHelpers #8024

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Add total data size to Partitions table #7920

Core: Add total data size to Partitions table #7920

hsiang-c commented Jun 27, 2023 •

edited

ajantha-bhat commented Jun 27, 2023

szehon-ho left a comment

szehon-ho Jun 27, 2023

dramaticlly Jun 27, 2023 •

edited

ajantha-bhat Jun 28, 2023

dramaticlly Jun 27, 2023

hsiang-c Jun 28, 2023 •

edited

hsiang-c Jun 30, 2023

szehon-ho left a comment

szehon-ho Jun 29, 2023

szehon-ho Jun 29, 2023

szehon-ho Jun 29, 2023

dramaticlly Jun 30, 2023

szehon-ho Jun 30, 2023

hsiang-c Jul 3, 2023 •

edited

szehon-ho Jun 30, 2023 •

edited

szehon-ho Jun 30, 2023

dramaticlly Jun 30, 2023

szehon-ho Jun 30, 2023

dramaticlly Jun 30, 2023

szehon-ho Jul 6, 2023 •

edited

hsiang-c Jul 7, 2023 •

edited

szehon-ho Jul 7, 2023

szehon-ho commented Jul 7, 2023

szehon-ho commented Jul 7, 2023

Core: Add total data size to Partitions table #7920

Core: Add total data size to Partitions table #7920

Conversation

hsiang-c commented Jun 27, 2023 • edited

ajantha-bhat commented Jun 27, 2023

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dramaticlly Jun 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsiang-c Jun 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsiang-c Jul 3, 2023 • edited

Choose a reason for hiding this comment

szehon-ho Jun 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Jul 6, 2023 • edited

Choose a reason for hiding this comment

hsiang-c Jul 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented Jul 7, 2023

szehon-ho commented Jul 7, 2023

hsiang-c commented Jun 27, 2023 •

edited

dramaticlly Jun 27, 2023 •

edited

hsiang-c Jun 28, 2023 •

edited

hsiang-c Jul 3, 2023 •

edited

szehon-ho Jun 30, 2023 •

edited

szehon-ho Jul 6, 2023 •

edited

hsiang-c Jul 7, 2023 •

edited