Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, API: View Version implementation #6861

Merged
merged 1 commit into from
Mar 22, 2023

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Feb 16, 2023

View Version parser implementation

Co-authored-by: John Zhuge jzhuge@apache.org

CC @jzhuge @rdblue @nastra @danielcweeks @jackye1995 Let me know your thoughts!

@jackye1995 jackye1995 self-requested a review February 16, 2023 18:17
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few small things to address, but other than that it LGTM

@amogh-jahagirdar amogh-jahagirdar force-pushed the view-version branch 2 times, most recently from 03783c8 to baf9be2 Compare February 22, 2023 00:49
@@ -43,8 +47,7 @@ public interface ViewVersion {
long timestampMillis();

/**
* Return the version summary such as the name of the operation that created that version of the
* view
* Return a string map of summary data for the operation that produced this view version.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the data model itself, I think we should follow what Snapshot does where summary returns all the fields other than the operation itself. We have a separate operation method below for the operation. When we serialize the version, the operation will get put in the summary of the version. Let me know if this makes sense @nastra @jackye1995 @rdblue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid having a change of spec here, the information in summary are just FYI purpose, I don't think we need to assign a semantic meaning to operations now, we can always add later if it becomes necessary. But for views I don't see anything other than CREATE and REPLACE, but it can also be identified directly through the version ID, ID=1 means CREATE, everything else are REPLACE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion here, there's no change in spec here, it's a change in the data model layer but since there was no implementation until now we have some flexibility to change if desired.

We'll still serialize the operation in the summary map when writing the metadata file, so the spec is maintained.

When we deserialize into the in memory representation, the summary map will not contain operation, and instead a user would use the operation() API. It's a similar model to Snapshot and SnapshotParser

Let me know if that makes sense, or if there's doubts on the API itself after this change! cc: @rdblue @nastra @danielcweeks @jzhuge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the argument also applies to the data model, because for Snapshot I am interested in what is the operation that produces the snapshot, because there are different types for data write. But for view, I am not sure how useful that information is, it's always just create or replace of the metadata definition. So I don't know if we should add this layer of complication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @jackye1995 instead of restricting the type of operation up front, we can keep it open. It looks like Snapshot summary also does has the same behavior. There's a set of recognized operations by the Iceberg library like append, replace etc but those aren't strictly validated when reading the metadata. So that's a similar model we can follow here. We also don't need an enum following this logic, it can just be defined Strings (which we can define separately when we get to the catalog implementation for createView replaceView etc. Also we can always add an operation API if it becomes to awkward to get the operation from the summary but it's not needed at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @jackye1995 we were talking on two different issues here.

1.) The first is on strict validation of the operation, we concluded that's probably not a desired behavior, and that's the behavior of snapshot currently.

2.) The second is on the API and serialization of operation itself. Operation will be serialized in the summary but when it's read we deserialize it into a separate field in memory. This is the same as what happens with Snapshot and I think it makes sense to have that same behavior and have a separate API for operation. Updated the PR to reflect this.

api/src/main/java/org/apache/iceberg/view/ViewVersion.java Outdated Show resolved Hide resolved
@amogh-jahagirdar amogh-jahagirdar force-pushed the view-version branch 3 times, most recently from 2170c9f to c6dc13c Compare March 10, 2023 02:42
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still LGTM, thanks @amogh-jahagirdar

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

generator.writeNumberField(VERSION_ID, version.versionId());
generator.writeNumberField(TIMESTAMP_MS, version.timestampMillis());
generator.writeObjectFieldStart(SUMMARY);
for (Map.Entry<String, String> summaryEntry : version.summary().entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also write OPERATION here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary should be guaranteed to have operation at the time of writing the metadata (should already be an entry in the map).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, see #6861 (comment) for the discussion on this

@amogh-jahagirdar amogh-jahagirdar force-pushed the view-version branch 3 times, most recently from e4d60b6 to 35562fc Compare March 21, 2023 20:44
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@amogh-jahagirdar amogh-jahagirdar force-pushed the view-version branch 2 times, most recently from 3b43bf8 to d6968e7 Compare March 21, 2023 21:11
Co-authored-by: John Zhuge <jzhuge@apache.org>
@jackye1995
Copy link
Contributor

Thanks for all the changes, let me know when you have the metadata parser PR!

@jackye1995 jackye1995 merged commit b099a33 into apache:master Mar 22, 2023
@nastra nastra added this to Done in View support Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants