AddPartitionSpec: A new way to set new partition specs #10737

shanielh · 2024-07-21T11:47:39Z

Most of the query engines use SQL to update a partition specification of a table.

The SQL usually look like: alter table partition by a, b, c, ...

The interface of UpdatePartitionSpec is interrupting a clear flow of setting a new partition spec when the user specifies a full partition definition, because it requires the implementation of the command to remove / rename old fields and add new fields that didn't exist.

Instead of that, we can introduce a new command to Iceberg Tables that adds a new partition spec from scratch.

RussellSpitzer · 2024-07-23T20:57:51Z

api/src/main/java/org/apache/iceberg/Transaction.java

@@ -44,6 +44,13 @@ public interface Transaction {
   */
  UpdatePartitionSpec updateSpec();

+  /**
+   * Create a new {@link AddPartitionSpec} to alter the partition spec of this table.


This would only be able to add a new Partition Spec to the table correct? We can't actually alter any existing specs

RussellSpitzer · 2024-07-23T20:58:41Z

api/src/main/java/org/apache/iceberg/Table.java

@@ -183,6 +183,14 @@ default IncrementalChangelogScan newIncrementalChangelogScan() {
   */
  UpdatePartitionSpec updateSpec();

+  /**
+   * Create a new {@link AddPartitionSpec} to alter the partition spec of this table and commit the


Same comment as below on Java Doc

RussellSpitzer · 2024-07-23T21:00:59Z

api/src/main/java/org/apache/iceberg/AddPartitionSpec.java

+ * <p>When committing, these changes will be applied to the current table metadata. Commit conflicts
+ * will not be resolved and will result in a {@link CommitFailedException}.
+ */
+public interface AddPartitionSpec extends PendingUpdate<PartitionSpec> {


Instead of adding a fully new API couldn't we just add an API to UpdatePartitionSpec that just sets the starting point of the spec as unpartitioned?

Something like

updatePartitionSpec() .fromUnpartitioned() .add... .add..

?

Yeah I think I can do that. Maybe from(spec)?

I reverted the old commit and added a new one with an implementation of fromSpec(..), and also added a test. PLMK if there's anything else I need to do for this to get merged 😄

Tests are still failing at the moment, please check out the failed runs below

~~Ok I think it should pass now, I've an ignore to the revapi (after rebasing on apache/iceberg), I hope that what I should have done 😄~~

Moving the new method in the interface to the bottom worked without modifying revapi

RussellSpitzer · 2024-07-29T20:00:58Z

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

+
+  @Override
+  public UpdatePartitionSpec fromSpec(PartitionSpec partitionSpec) {
+    // Clear all changes


Implementation wise, I would rather we return a new object here? Couldn't we add a constructor

BaseUpdatePartitionSpec(TableOperations ops, PartitionSpec spec)

That way we could keep all of the variables above final?

I'm also wondering if we may have issues here with re-adding a transform that already exists in another spec. I think currently that's handled by re-using the existing transform.

Example

Add Identity (x) Remove Identity (x) Add Identity (x)

In the current code there would only ever be one Identity(x) transform. Would we be able to maintain that if we did

Add Identity(x) From(Unpartitioned) Add Identity(x)

The above code should be a noop right?

Implementation wise, I would rather we return a new object here? Couldn't we add a constructor

BaseUpdatePartitionSpec(TableOperations ops, PartitionSpec spec)

That way we could keep all of the variables above final?

Modified and fixed revApi

I'm also wondering if we may have issues here with re-adding a transform that already exists in another spec. I think currently that's handled by re-using the existing transform.

Example

Add Identity (x) Remove Identity (x) Add Identity (x)

In the current code there would only ever be one Identity(x) transform. Would we be able to maintain that if we did

Add Identity(x) From(Unpartitioned) Add Identity(x)

The above code should be a noop right?

I've added more specs, I hope that would answer your question.

Calling Add Identity(X), and then Remove Identity(X) within the same UpdatePartitionSpec throws an exception:

Cannot delete newly added field: 1001: id: identity(1) java.lang.IllegalArgumentException: Cannot delete newly added field: 1001: id: identity(1) at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:220) at org.apache.iceberg.BaseUpdatePartitionSpec.removeField(BaseUpdatePartitionSpec.java:247)

Calling Add Identity(X), and then Remove Identity(X) within the same UpdatePartitionSpec throws an exception:

Cannot delete newly added field: 1001: id: identity(1) java.lang.IllegalArgumentException: Cannot delete newly added field: 1001: id: identity(1) at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:220) at org.apache.iceberg.BaseUpdatePartitionSpec.removeField(BaseUpdatePartitionSpec.java:247)

Not in the same command.
Two commands

Command 1
Add Identity(id)
commit()

Command 2
From unpartitioned
Add Identity(id)
commit() // This should fail I think? or be a noop?

Those seem to be the specs that @shanielh added, resulting in a noop.

@RussellSpitzer any chance to look at this? I've added the specs you wanted in the last iteration. Thanks 🙏

RussellSpitzer · 2024-08-05T03:45:39Z

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java

@@ -29,6 +29,7 @@
 * will not be resolved and will result in a {@link CommitFailedException}.
 */
 public interface UpdatePartitionSpec extends PendingUpdate<PartitionSpec> {
+


unrelated white-space change

RussellSpitzer · 2024-08-05T03:49:54Z

core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java

    this.schema = spec.schema();
    this.nameToField = indexSpecByName(spec);
    this.transformToField = indexSpecByTransform(spec);
-    this.lastAssignedPartitionId = base.lastAssignedPartitionId();
+    this.lastAssignedPartitionId =


I'm not sure I understand the change here, why would lastAssignedPartitionId be drawn from the "spec"? Shouldn't this always be the lastAssignedPartitionId?

As the spec requires:

In v2, partition field IDs must be explicitly tracked for each partition field. New IDs are assigned based on the last assigned partition ID in table metadata. In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation.

Since we always evolve the latest spec, base.lastAssignedPartitionId would work for both v1 and v2, but now when we're able to evolve from any spec (which might be non-latest), it doesn't work and requires a branch in the code.

Reverting this would fail some of the new tests:

testReAddFieldUsingFromUnpartitionedSpec

🔴 formatVersion = 1

✅ formatVersion = 2

✅ formatVersion = 3

testCommitFromSpec

🔴 formatVersion = 1

✅ formatVersion = 2

✅ formatVersion = 3

I did modified the branch to condition on formatVersion == 1 instead of formatVersion == 2 for better forward compatibility.

I'm not sure I follow and I don't think this is correct. I think if base.lastMetadataAssigned doesn't work then the logic in the code is incorrect for using this value/ re-using existing fields.

In V1 and V2 it should be the same.

V1:

Add a transform identity(x) and get 1000 as the id Spec 1 = (1000: Identity(x)) Remove identity(x), the transform is changed to void(x) in the new spec. Spec 2 = (1000: Void()) Add identity(x) This should reset current spec to Spec 1 Add identity (y) Spec 3 = (1000: Identity (x), 1001: Identity(y))

V2

Add a transform identity(x) and get 1000 as the id Spec 1 = (1000: Identity(x)) Remove identity(x), the transform is changed to void(x) in the new spec. Spec 2 = () Add identity(x) This should reset current spec to Spec 1 Add identity(y) Spec 3 = (1000: Identity(x), 1001: identity (y))

Ok, got it, I think, can you check if the test is testing things as expected?

RussellSpitzer · 2024-08-05T03:51:15Z

core/src/test/java/org/apache/iceberg/TestTableUpdatePartitionSpec.java

+  public void testCommitFromSpec() {
+    table.updateSpec().addField(bucket("id", 8)).commit();
+
+    // Evolve the spec


I would avoid generic comments like this if possible

RussellSpitzer · 2024-08-05T03:53:02Z

core/src/test/java/org/apache/iceberg/TestTableUpdatePartitionSpec.java

+    // Restart the spec
+    table
+        .updateSpec()
+        .fromSpec(PartitionSpec.builderFor(table.schema()).build())


Shouldn't this just be PartitionSpec.unpartitioned()?

This won't work as PartitionSpec.unpartitioned() has an empty schema, and the line after (.addField(bucket("data", 16))) would throw: Cannot find field 'data' in struct: struct<>

RussellSpitzer · 2024-08-05T03:55:29Z

core/src/test/java/org/apache/iceberg/TestTableUpdatePartitionSpec.java

+        .commit();
+
+    V1Assert.assertEquals(
+        "Should soft delete id and data buckets",


Isn't this incorrect? Don't we keep the id transform?

RussellSpitzer · 2024-08-05T03:56:02Z

core/src/test/java/org/apache/iceberg/TestTableUpdatePartitionSpec.java

+        table.spec());
+
+    V2Assert.assertEquals(
+        "Should hard delete id and data buckets",


Same comment as above, we should not delete the id bucket

This method allows to evolve a partition spec which isn't the latest table spec. A good usage for this would be for implementations of DDL like ALTER TABLE table_name PARTITION BY a,b,c The implementation up to now would have to remove old partition fields that aren't exist in the new partition spec and to add partition fields that doesn't exist in the old partition fields. Now you can use fromSpec(PartitionSpec.builderFor(table.schema()).build()) and add partition fields as requested from the user without refering to the latest table partition spec.

Changed the way fromSpec works, to use the latest spec but remove / add fields from the given partitionSpec

github-actions · 2024-11-10T00:16:09Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions bot added API core labels Jul 21, 2024

RussellSpitzer reviewed Jul 23, 2024

View reviewed changes

shanielh force-pushed the feature/add-partition-spec branch 6 times, most recently from 9e537f2 to 2e9d60f Compare July 28, 2024 06:24

RussellSpitzer reviewed Jul 29, 2024

View reviewed changes

shanielh force-pushed the feature/add-partition-spec branch 2 times, most recently from ae8032a to b8b117c Compare July 31, 2024 14:07

RussellSpitzer reviewed Aug 5, 2024

View reviewed changes

shanielh force-pushed the feature/add-partition-spec branch from b8b117c to 1065bf3 Compare August 5, 2024 05:32

fixup! UpdatePartitionSpec: Added fromSpec method

25fa405

Changed the way fromSpec works, to use the latest spec but remove / add fields from the given partitionSpec

shanielh force-pushed the feature/add-partition-spec branch from aa2b8be to 25fa405 Compare August 5, 2024 06:58

github-actions bot added the stale label Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AddPartitionSpec: A new way to set new partition specs #10737

AddPartitionSpec: A new way to set new partition specs #10737

shanielh commented Jul 21, 2024

RussellSpitzer Jul 23, 2024

shanielh Jul 24, 2024

RussellSpitzer Jul 23, 2024

RussellSpitzer Jul 23, 2024

shanielh Jul 24, 2024

shanielh Jul 24, 2024

RussellSpitzer Jul 25, 2024

shanielh Jul 28, 2024 •

edited

Loading

RussellSpitzer Jul 29, 2024

RussellSpitzer Jul 29, 2024

shanielh Jul 31, 2024

shanielh Jul 31, 2024

RussellSpitzer Jul 31, 2024

jasonf20 Jul 31, 2024

shanielh Aug 5, 2024

RussellSpitzer Aug 5, 2024

RussellSpitzer Aug 5, 2024

shanielh Aug 5, 2024

RussellSpitzer Aug 5, 2024

shanielh Aug 5, 2024

RussellSpitzer Aug 5, 2024

RussellSpitzer Aug 5, 2024

shanielh Aug 5, 2024

RussellSpitzer Aug 5, 2024

RussellSpitzer Aug 5, 2024

github-actions bot commented Nov 10, 2024

AddPartitionSpec: A new way to set new partition specs #10737

Are you sure you want to change the base?

AddPartitionSpec: A new way to set new partition specs #10737

Conversation

shanielh commented Jul 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shanielh Jul 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 10, 2024

shanielh Jul 28, 2024 •

edited

Loading