Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add snapshot summary and operation. #74

Merged
merged 3 commits into from
Mar 6, 2019

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jan 11, 2019

This adds a "summary" field to each snapshot stored in the metadata file.

The purpose of this extra information is to speed up operations by ignoring some snapshots. For example, when finding data to process incrementally, snapshots produced by a delete or replace operation can be ignored because no new data is added.

The summary is stored in the metadata file's list of snapshot objects as "summary". Each summary is an object with at least one key, "operation" that describes the operation that produced the snapshot. Possible values are "append", "replace", "overwrite" and "delete". These are defined in DataOperations.

@rdblue
Copy link
Contributor Author

rdblue commented Jan 11, 2019

@aokolnychyi, @mccheah, can you review this PR?

}
} else {
// if there was no previous snapshot, default the summary to start totals at 0
previousSummary = ImmutableMap.of(
Copy link
Contributor

@fbocse fbocse Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would you consider not having to do the initialization of these expected properties ahead of time? It feels like we need to consider which properties we're going to update so we can carefully craft their initialization in this else block first.
I'd consider increasing the arity of updateTotal and support a default value for each of the properties we'd invoke the method for and that default value gets used in case that the property is missing from the previous snapshot summary or maybe if there is an exception trying to load it (parse it) from previous snapshot summary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could make this a constant, but I like that this is more readable having the initialization done in the same place as the updates.

We can't add the default to updateTotal because these totals can't be initialized unless there is no previous snapshot. That's the case that indicates that the total number of files and records is 0. If there a previous snapshot, then we have to use it's totals or not produce totals.

return builder.build();
}

private static void setIf(boolean expression, ImmutableMap.Builder<String, String> builder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👍on readability for coming up with this method

summaryBuilder.put(totalProperty, String.valueOf(newTotal));

} catch (NumberFormatException e) {
// ignore and do not add total
Copy link
Contributor

@fbocse fbocse Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider logging a warning message for this exception though? User should not expect faster operations given that summary metadata is broken, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a log message would be useful here.

Map<String, String> currentSummary,
String addedProperty, String deletedProperty) {
String totalStr = previousSummary.get(totalProperty);
if (totalStr != null) {
Copy link
Contributor

@fbocse fbocse Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question on this particular branch and I don't know if it's valid or worth looking into so I'm looking for your opinion on this.
Say we get a NumberFormatException when attempting to parse a long out of TOTAL_RECORDS_PROP and that basically has the code failing to set this particular total property in the current snapshot summary. But other properties, such as TOTAL_FILES_PROP, do get included, so the current snapshot has a summary, but it's missing the TOTAL_RECORDS_PROP property. That's possible right, an incomplete summary?
Now, following snapshot comes along, load current snapshot's summary and looks up TOTAL_RECORDS_PROP so it gets a null and skips the branch altogether and I think that this effect gets propagated indefinitely, right?

So I guess the question is whether the code should attempt to stop erroneous total properties (or their effects) from getting propagated onto following snapshot summaries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. A bad total value will cause totals to be dropped from all future summaries. There's not much we can do about that besides writing correct total values.

@aokolnychyi
Copy link
Contributor

This change LGTM. The logic doesn't seem to be computationally expensive even though we might need to iterate through all added/removed files. So, I do not think it will significantly impact the write performance but will definitely give benefits later on.

* snapshots that are not needed for some tasks. For example, snapshot expiration does not need to
* clean up deleted files for appends, which have no deleted files.
*/
public class DataOperations {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you chose to not make this an Enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to be forward-compatible. Iceberg should recognize known data operations constants, but not reject unknown ones.

* <p>
* This operation is implemented by {@link AppendFiles}.
*/
public static final String APPEND = "append";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add Create as an operation given it will add a snapshot but no datafiles will be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean create table? That doesn't produce a snapshot. The snapshot list is empty and the current snapshot ID is -1.

@rdblue rdblue merged commit e8c5e18 into apache:master Mar 6, 2019
@rdblue
Copy link
Contributor Author

rdblue commented Mar 6, 2019

Merged. Thanks for the reviews, @fbocse, @aokolnychyi, and @Parth-Brahmbhatt!

jun-ma-0 pushed a commit to jun-ma-0/incubator-iceberg that referenced this pull request May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants