Refactor RemoveSnapshots for easier extension by RussellSpitzer · Pull Request #1245 · apache/iceberg

RussellSpitzer · 2020-07-24T22:33:22Z

Previously all of the code for looking up which manifests need to be scanned
and the code for scanning them was deeply tied to the RemoveSnapshots object. This made
it difficult to write any new implementations which had similar functionality. In
this patch we extract out many of the subfunctions and more complicated pieces of code
into a new utility class to allow from access from other modules.

The code for actually scanning manifests was extracted into ManifestExpirationManager
to allow access to Manifest information without exposing Manifest internals.

Previously all of the code for looking up which manifests need to be scanned and the code for scanning them was deeply tied to the RemoveSnapshots object. This made it difficult to write any new implementations which had similar functionality. In this patch we extract out many of the subfunctions and more complicated pieces of code into a new utility class to allow from access from other modules. The code for actually scanning manifests was extracted into ManifestExpirationManager to allow access to Manifest information without exposing Manifest internals.

RussellSpitzer · 2020-07-24T22:34:18Z

@aokolnychyi just the Refactor

rdblue · 2020-07-27T16:45:04Z

core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java

+  //No public constructor for utility class
+  private ManifestExpirationManager(){}
+
+  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);


Minor: I'd normally put the static final fields above the constructor and other instance methods.

rdblue · 2020-07-27T16:59:39Z

core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java

+   */
+  public static Set<String> scanManifestsForAbandonedDeletedFiles(
+      Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup,
+      FileIO io) {


Since we are introducing a class for this, I think it makes sense to simplify the methods. Before, we kept breaking out new methods to keep complexity down, which kept requiring passing all of the information as arguments. Now, we can share many of these arguments, like validSnapshotIds in the manager instance and not need to pass them around as args.

We can also initialize the manager using the base snapshot and current metadata, so that the class handles its own configuration for things like specLookup. We'd probably need to make changes to how this is configured to be able to parallelize it, but I think we could make configuration a bit cleaner.

What do you think?

As long as I can keep all the fields final and serializable I'm ok with that. I have too much Scala in my brain and I generally hate having any methods on a class if I can help it and prefer to just have data classes and then static methods which operate on them.

I think in this case all the inputs to this class are also serializable so it's probably safe ...

This ends up being a bit more complicated than I anticipated if we want to make sure that "ManfiestExpirationChanges" and the Manager are both not only serializable but also splittable ... I think if I do a bit more refactoring I can make this happen but I believe this will make it harder/ more confusing in the future.

For example I'll need this class to not only generate the full list of manifests which need to be scanned but also have a method which only operates on some of the generated list.

Also we end up loosing a bit of control about how we serialize the component parts, we would have always broadcast everything if we wrap the entire module up.

I can go ahead and finish up the refactor but I would like to check with you before I devote more time to this.

My code at the moment is a bit like

//These are all serializable private final SnapshotExpirationChanges snapshotChanges; private final ManifestExpirationChanges manifestExpirationChanges; private final Map<Integer, PartitionSpec> specLookup; private final FileIO io; public ManifestExpirationManager(TableMetadata original, TableMetadata current, FileIO io){ this.io = io; this.specLookup = current.specsById(); this.snapshotChanges = ExpireSnapshotUtil.getExpiredSnapshots(current, original); this.manifestExpirationChanges = ExpireSnapshotUtil.determineManifestChangesFromSnapshotExpiration( snapshotChanges.getValidSnapshotIds(), snapshotChanges.getExpiredSnapshotIds(), current, original, io) ) //We can simplify the functions above and move them into this class // But then we need to be able to get manifestExpirationChanges because we want to parallelize information from them getManifestExpirationChanges() // And now our scan method has to take a list of manifests still or we have to pass some kind of partition/split info public Set<String> scanManifestsForAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) { }

Let me know if this is kind of what you were thinking about

I pushed a commit with this as a work in progress. I tried to keep the class as stateless as possible but it will have some null handling behaviors to maintain old effects.

See #1264 For a demonstration of how we would use this in Spark

rdblue · 2020-07-28T20:12:43Z

core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java

+public class ExpireSnapshotUtil {
+  private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);
+
+  //Utility Class No Instantiation Allowed


Nit: missing space between // and Utility.

aokolnychyi · 2020-07-28T20:19:32Z

It does not seem like we reduced complexity at this point. In some cases, we actually added it. Quick question: do we want to keep ExpireSnapshotUtil and ManifestExpirationManager separate? Would it simplify our life if it was one class?

rdblue · 2020-07-28T20:21:43Z

core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java

+   * @return Wrapper around which manifests contain references to possibly abandoned files
+   */
+  public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(Set<Long> validIds,
+      Set<Long> expiredIds, TableMetadata current, TableMetadata original, FileIO io) {


Why pass both valid/expired IDs as well as the current/original table metadata? The valid and expired IDs can come from the table metadata.

Yes we could recompute that here, I think @aokolnychyi and I were discussing this before and we were trying to reduce the duplication of some calculations. But we could easily just have it recompute changes here.

The difficulty here, and in most of the code, is that the previous behavior was

Find Snapshot Differences
Log Expired Snapshots
If expiredSnapshots is not empty & deleteFiles:
Figure out Ancestors // Only requires the current Snapshot
Figure out CherryPicked Ancestors // requires Metadata

In the old code each step of the way we build up some of the information used further down the pathway, sometimes
the references are serializable and sometimes they aren't. Now we want something we can serialize and split, so because of that we have to be very careful about what things we keep in the class, and what things are only used during Init.

Ideally I wouldn't want to pass through the tablemetdata at all since it's a very heavy class and not serializable but there are a few points where we can't avoid it. For example in computing cherry picked ancestors we need to use a map that exists within the metadata to lookup snapshots from their Id's.

Let me see if I can refactor this function so that it doesn't use the metadata at all and only takes in the information from the "SnapshotChanges" discovered previously. I think the cherry picked ancestors is the only one that ends up being a bit hairy.

aokolnychyi · 2020-07-28T20:47:28Z

I think I get the idea of why we want to have both ExpireSnapshotUtil and ManifestExpirationManager as separate classes after looking into this more. However, do we need to know ManifestExpirationChanges in ManifestExpirationManager? Shouldn't this class just accept a list of manifests to scan as method args?

aokolnychyi · 2020-07-28T21:07:02Z

For example, I mean something like this:

public class ManifestExpirationManager {
  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);

  private final Set<Long> validSnapshotIds;
  private final Map<Integer, PartitionSpec> specs;
  private final FileIO io;

  public ManifestExpirationManager(Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specs, FileIO io) {
    this.validSnapshotIds = validSnapshotIds;
    this.specs = specs;
    this.io = io;
  }

  public Set<String> findAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) {
    Set<String> filesToDelete = new HashSet<>();
    Tasks.foreach(manifestsWithDeletes)
        .retry(3).suppressFailureWhenFinished()
        .onFailure((item, exc) -> LOG.warn("Failed to get deleted files: this may cause orphaned data files", exc))
        .run(manifest -> {
          try {
            filesToDelete.addAll(findAbandonedDeletedFiles(manifest));
          } catch (IOException e) {
            throw new UncheckedIOException(String.format("Failed to read manifest file: %s", manifest), e);
          }
        });
    return filesToDelete;
  }

  private Set<String> findAbandonedDeletedFiles(ManifestFile manifest) throws IOException {
    Set<String> filesToDelete = new HashSet<>();
    try (ManifestReader<?> reader = ManifestFiles.open(manifest, io, specs)) {
      for (ManifestEntry<?> entry : reader.entries()) {
        if (entry.status() == ManifestEntry.Status.DELETED && !validSnapshotIds.contains(entry.snapshotId())) {
          // use toString to ensure the path will not change (Utf8 is reused)
          filesToDelete.add(entry.file().path().toString());
        }
      }
    }
    return filesToDelete;
  }

...

}

This way, we only encapsulate the logic for scanning manifests in ManifestExpirationManager. We can either make this class serializable and broadcast or instantiate directly in executors.

rdblue · 2020-07-28T21:23:33Z

core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java

+
+    //Snapshots which are not expired but refer to manifests from expired snapshots
+    Set<ManifestFile> validManifests = getValidManifests(currentSnapshots, io);
+    Set<ManifestFile> manifestsToScan = findValidManifestsInExpiredSnapshots(validManifests, ancestorIds,


It would be good to have a comment here to explain why the valid manifests are in expired snapshots. Maybe the name should be more clear, too: validManifestsCreatedByExpiredSnapshots?

RussellSpitzer · 2020-07-28T21:34:09Z

@rdblue I think I moved a bit in the wrong direction here and I want to make sure we are on the same page as to the end goal of the refactor. I'll be on the call tomorrow if you want to discuss in person or if you want to keep discussing here that is fine as well.

I think I do want to revert the last commit and maybe concentrate on still isolating the Manifest Scanning code from the Determining which Manifests to Scan code.

core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java

RussellSpitzer · 2020-07-31T20:23:07Z

@aokolnychyi + @rdblue I worked on cleaning this up, I simplified the public methods so they should hopefully be easier to use. As well I tried to reduce the input to many of the internal private functions to just the inputs they required, rather than the parent classes which contain everything. I also moved out the ancestorInfo into a separate wrapper class so we don't recalculate that either.

rdsr · 2020-08-02T08:09:21Z

core/src/main/java/org/apache/iceberg/util/ExpireSnapshotUtil.java

+
+    Set<Snapshot> expiredSnapshots = Sets.newHashSet();
+    for (Snapshot snapshot : original.snapshots()) {
+      if (!validSnapshots.contains(snapshot)) {


Seems like we don't implement hashcode and equals for BaseSnapshot. Won't that be a problem here?

We are comparing the same objects in memory here, but we could do a more complicated hash and equals if we want. Originally this was just a match made on the SnapshotId but since everything is dealing with the same references I thought this would be a bit clearer.

rdsr · 2020-08-02T22:36:51Z

core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java

+   * Creates a Manifest Scanner for analyzing manifests for files which are no longer needed
+   * after expiration.
+   * @param current metadata for the table being expired
+   * @param io FileIO for the table, for Serializable usecases make sure this is also serializable for the framework


I don't follow the javadoc on the param completely.

for Serializable usecases make sure this is also serializable for the framework

Is that meant as a TODO?

Unfortunately we can't control the underlying implementation of IO here and some instances will require a specfic manipulation to make sure they serialize correctly.

I added this specifically (instead of just allowing passing in ops) because for Spark we use SparkUtil.serializableIO to check whether the IO is of the HadoopIO class and make it serializable if so. We can't do that here because it would require dependencies from other modules and also be more or less specific to Spark. I think our only real option is to leave it up to the caller to make sure that portion of the class is correct for their application. :/

RussellSpitzer · 2020-08-04T15:25:59Z

Brief Discussion with @rdblue + @aokolnychyi lead to another approach we are going to try out for finding expired files. We will instead of doing the old logic, read a table from a specific point in history and get all datafilee from that, then do an anti join with the current state to determine the set of no longer used files.

To do this there will be need to be another PR focused on being able to read a table from a specific manifest

RussellSpitzer · 2020-08-05T16:33:53Z

#1264 - Expire Snapshots Action without using RemoveSnapshots Code Path

rdblue · 2020-08-19T23:55:16Z

@RussellSpitzer, are you still interested in this or should we close it now that #1264 is in?

RussellSpitzer · 2020-08-20T00:34:46Z

I think we should close it out since I'm not sure there is much benefit without the other usecase

…

On Wed, Aug 19, 2020, 6:55 PM Ryan Blue ***@***.***> wrote: @RussellSpitzer <https://github.com/RussellSpitzer>, are you still interested in this or should we close it now that #1264 <#1264> is in? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1245 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADE2YJYDXSQVGF6S5Q5UMTSBRQ7DANCNFSM4PHD5P4Q> .

rdblue · 2020-08-20T17:50:46Z

Sounds good to me. I'll close it. Thanks!

RussellSpitzer mentioned this pull request Jul 24, 2020

Use Spark for Scanning of Manifest Files during Snapshot Expiration #1211

Closed

rdblue reviewed Jul 27, 2020

View reviewed changes

rdblue reviewed Jul 28, 2020

View reviewed changes

aokolnychyi reviewed Jul 28, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java Outdated Show resolved Hide resolved

RussellSpitzer force-pushed the RefactorRemoveSnapshot branch from 074af61 to 1225779 Compare July 31, 2020 16:49

Simplification of some methods in Refactor and Reviewer Comments

886c927

rdsr reviewed Aug 2, 2020

View reviewed changes

rdblue closed this Aug 20, 2020

Conversation

RussellSpitzer commented Jul 24, 2020

Uh oh!

RussellSpitzer commented Jul 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jul 28, 2020

Uh oh!

aokolnychyi commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Jul 28, 2020

Uh oh!

Uh oh!

RussellSpitzer commented Jul 31, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Aug 4, 2020

Uh oh!

RussellSpitzer commented Aug 5, 2020

Uh oh!

rdblue commented Aug 19, 2020

Uh oh!

RussellSpitzer commented Aug 20, 2020 via email

Uh oh!

rdblue commented Aug 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdblue Jul 27, 2020 •

edited

Loading

aokolnychyi commented Jul 28, 2020 •

edited

Loading

aokolnychyi commented Jul 28, 2020 •

edited

Loading