-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AddPartitionSpec: A new way to set new partition specs #10737
base: main
Are you sure you want to change the base?
Conversation
@@ -44,6 +44,13 @@ public interface Transaction { | |||
*/ | |||
UpdatePartitionSpec updateSpec(); | |||
|
|||
/** | |||
* Create a new {@link AddPartitionSpec} to alter the partition spec of this table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would only be able to add a new Partition Spec to the table correct? We can't actually alter any existing specs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
@@ -183,6 +183,14 @@ default IncrementalChangelogScan newIncrementalChangelogScan() { | |||
*/ | |||
UpdatePartitionSpec updateSpec(); | |||
|
|||
/** | |||
* Create a new {@link AddPartitionSpec} to alter the partition spec of this table and commit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as below on Java Doc
* <p>When committing, these changes will be applied to the current table metadata. Commit conflicts | ||
* will not be resolved and will result in a {@link CommitFailedException}. | ||
*/ | ||
public interface AddPartitionSpec extends PendingUpdate<PartitionSpec> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a fully new API couldn't we just add an API to UpdatePartitionSpec that just sets the starting point of the spec as unpartitioned?
Something like
updatePartitionSpec()
.fromUnpartitioned()
.add...
.add..
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I can do that. Maybe from(spec)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the old commit and added a new one with an implementation of fromSpec(..)
, and also added a test. PLMK if there's anything else I need to do for this to get merged 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are still failing at the moment, please check out the failed runs below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think it should pass now, I've an ignore to the revapi (after rebasing on apache/iceberg), I hope that what I should have done 😄
Moving the new method in the interface to the bottom worked without modifying revapi
9e537f2
to
2e9d60f
Compare
|
||
@Override | ||
public UpdatePartitionSpec fromSpec(PartitionSpec partitionSpec) { | ||
// Clear all changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation wise, I would rather we return a new object here? Couldn't we add a constructor
BaseUpdatePartitionSpec(TableOperations ops, PartitionSpec spec)
That way we could keep all of the variables above final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering if we may have issues here with re-adding a transform that already exists in another spec. I think currently that's handled by re-using the existing transform.
Example
Add Identity (x)
Remove Identity (x)
Add Identity (x)
In the current code there would only ever be one Identity(x) transform. Would we be able to maintain that if we did
Add Identity(x)
From(Unpartitioned)
Add Identity(x)
The above code should be a noop right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation wise, I would rather we return a new object here? Couldn't we add a constructor
BaseUpdatePartitionSpec(TableOperations ops, PartitionSpec spec)
That way we could keep all of the variables above final?
Modified and fixed revApi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering if we may have issues here with re-adding a transform that already exists in another spec. I think currently that's handled by re-using the existing transform.
Example
Add Identity (x) Remove Identity (x) Add Identity (x)
In the current code there would only ever be one Identity(x) transform. Would we be able to maintain that if we did
Add Identity(x) From(Unpartitioned) Add Identity(x)
The above code should be a noop right?
I've added more specs, I hope that would answer your question.
Calling Add Identity(X), and then Remove Identity(X) within the same UpdatePartitionSpec throws an exception:
Cannot delete newly added field: 1001: id: identity(1)
java.lang.IllegalArgumentException: Cannot delete newly added field: 1001: id: identity(1)
at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:220)
at org.apache.iceberg.BaseUpdatePartitionSpec.removeField(BaseUpdatePartitionSpec.java:247)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling Add Identity(X), and then Remove Identity(X) within the same UpdatePartitionSpec throws an exception:
Cannot delete newly added field: 1001: id: identity(1) java.lang.IllegalArgumentException: Cannot delete newly added field: 1001: id: identity(1) at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:220) at org.apache.iceberg.BaseUpdatePartitionSpec.removeField(BaseUpdatePartitionSpec.java:247)
Not in the same command.
Two commands
Command 1
Add Identity(id)
commit()
Command 2
From unpartitioned
Add Identity(id)
commit() // This should fail I think? or be a noop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those seem to be the specs that @shanielh added, resulting in a noop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer any chance to look at this? I've added the specs you wanted in the last iteration. Thanks 🙏
ae8032a
to
b8b117c
Compare
@@ -29,6 +29,7 @@ | |||
* will not be resolved and will result in a {@link CommitFailedException}. | |||
*/ | |||
public interface UpdatePartitionSpec extends PendingUpdate<PartitionSpec> { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated white-space change
this.schema = spec.schema(); | ||
this.nameToField = indexSpecByName(spec); | ||
this.transformToField = indexSpecByTransform(spec); | ||
this.lastAssignedPartitionId = base.lastAssignedPartitionId(); | ||
this.lastAssignedPartitionId = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the change here, why would lastAssignedPartitionId be drawn from the "spec"? Shouldn't this always be the lastAssignedPartitionId?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the spec requires:
In v2, partition field IDs must be explicitly tracked for each partition field. New IDs are assigned based on the last assigned partition ID in table metadata.
In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation.
Since we always evolve the latest spec, base.lastAssignedPartitionId
would work for both v1 and v2, but now when we're able to evolve from any spec (which might be non-latest), it doesn't work and requires a branch in the code.
Reverting this would fail some of the new tests:
testReAddFieldUsingFromUnpartitionedSpec
- 🔴
formatVersion = 1
- ✅
formatVersion = 2
- ✅
formatVersion = 3
- 🔴
testCommitFromSpec
- 🔴
formatVersion = 1
- ✅
formatVersion = 2
- ✅
formatVersion = 3
- 🔴
I did modified the branch to condition on formatVersion == 1
instead of formatVersion == 2
for better forward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow and I don't think this is correct. I think if base.lastMetadataAssigned doesn't work then the logic in the code is incorrect for using this value/ re-using existing fields.
In V1 and V2 it should be the same.
V1:
Add a transform identity(x) and get 1000 as the id
Spec 1 = (1000: Identity(x))
Remove identity(x), the transform is changed to void(x) in the new spec.
Spec 2 = (1000: Void())
Add identity(x)
This should reset current spec to Spec 1
Add identity (y)
Spec 3 = (1000: Identity (x), 1001: Identity(y))
V2
Add a transform identity(x) and get 1000 as the id
Spec 1 = (1000: Identity(x))
Remove identity(x), the transform is changed to void(x) in the new spec.
Spec 2 = ()
Add identity(x)
This should reset current spec to Spec 1
Add identity(y)
Spec 3 = (1000: Identity(x), 1001: identity (y))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it, I think, can you check if the test is testing things as expected?
public void testCommitFromSpec() { | ||
table.updateSpec().addField(bucket("id", 8)).commit(); | ||
|
||
// Evolve the spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid generic comments like this if possible
// Restart the spec | ||
table | ||
.updateSpec() | ||
.fromSpec(PartitionSpec.builderFor(table.schema()).build()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this just be PartitionSpec.unpartitioned()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work as PartitionSpec.unpartitioned()
has an empty schema, and the line after (.addField(bucket("data", 16))
) would throw: Cannot find field 'data' in struct: struct<>
.commit(); | ||
|
||
V1Assert.assertEquals( | ||
"Should soft delete id and data buckets", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this incorrect? Don't we keep the id transform?
table.spec()); | ||
|
||
V2Assert.assertEquals( | ||
"Should hard delete id and data buckets", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, we should not delete the id bucket
This method allows to evolve a partition spec which isn't the latest table spec. A good usage for this would be for implementations of DDL like ALTER TABLE table_name PARTITION BY a,b,c The implementation up to now would have to remove old partition fields that aren't exist in the new partition spec and to add partition fields that doesn't exist in the old partition fields. Now you can use fromSpec(PartitionSpec.builderFor(table.schema()).build()) and add partition fields as requested from the user without refering to the latest table partition spec.
b8b117c
to
1065bf3
Compare
Changed the way fromSpec works, to use the latest spec but remove / add fields from the given partitionSpec
aa2b8be
to
25fa405
Compare
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
Most of the query engines use SQL to update a partition specification of a table.
The SQL usually look like: alter table partition by a, b, c, ...
The interface of UpdatePartitionSpec is interrupting a clear flow of setting a new partition spec when the user specifies a full partition definition, because it requires the implementation of the command to remove / rename old fields and add new fields that didn't exist.
Instead of that, we can introduce a new command to Iceberg Tables that adds a new partition spec from scratch.