Skip to content

Refactor RemoveSnapshots for easier extension#1245

Closed
RussellSpitzer wants to merge 2 commits intoapache:masterfrom
RussellSpitzer:RefactorRemoveSnapshot
Closed

Refactor RemoveSnapshots for easier extension#1245
RussellSpitzer wants to merge 2 commits intoapache:masterfrom
RussellSpitzer:RefactorRemoveSnapshot

Conversation

@RussellSpitzer
Copy link
Member

Previously all of the code for looking up which manifests need to be scanned
and the code for scanning them was deeply tied to the RemoveSnapshots object. This made
it difficult to write any new implementations which had similar functionality. In
this patch we extract out many of the subfunctions and more complicated pieces of code
into a new utility class to allow from access from other modules.

The code for actually scanning manifests was extracted into ManifestExpirationManager
to allow access to Manifest information without exposing Manifest internals.

Previously all of the code for looking up which manifests need to be scanned
and the code for scanning them was deeply tied to the RemoveSnapshots object. This made
it difficult to write any new implementations which had similar functionality. In
this patch we extract out many of the subfunctions and more complicated pieces of code
into a new utility class to allow from access from other modules.

The code for actually scanning manifests was extracted into ManifestExpirationManager
to allow access to Manifest information without exposing Manifest internals.
@RussellSpitzer
Copy link
Member Author

@aokolnychyi just the Refactor

//No public constructor for utility class
private ManifestExpirationManager(){}

private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I'd normally put the static final fields above the constructor and other instance methods.

*/
public static Set<String> scanManifestsForAbandonedDeletedFiles(
Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup,
FileIO io) {
Copy link
Contributor

@rdblue rdblue Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are introducing a class for this, I think it makes sense to simplify the methods. Before, we kept breaking out new methods to keep complexity down, which kept requiring passing all of the information as arguments. Now, we can share many of these arguments, like validSnapshotIds in the manager instance and not need to pass them around as args.

We can also initialize the manager using the base snapshot and current metadata, so that the class handles its own configuration for things like specLookup. We'd probably need to make changes to how this is configured to be able to parallelize it, but I think we could make configuration a bit cleaner.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as I can keep all the fields final and serializable I'm ok with that. I have too much Scala in my brain and I generally hate having any methods on a class if I can help it and prefer to just have data classes and then static methods which operate on them.

I think in this case all the inputs to this class are also serializable so it's probably safe ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up being a bit more complicated than I anticipated if we want to make sure that "ManfiestExpirationChanges" and the Manager are both not only serializable but also splittable ... I think if I do a bit more refactoring I can make this happen but I believe this will make it harder/ more confusing in the future.

For example I'll need this class to not only generate the full list of manifests which need to be scanned but also have a method which only operates on some of the generated list.

Also we end up loosing a bit of control about how we serialize the component parts, we would have always broadcast everything if we wrap the entire module up.

I can go ahead and finish up the refactor but I would like to check with you before I devote more time to this.

My code at the moment is a bit like

//These are all serializable
  private final SnapshotExpirationChanges snapshotChanges; 
  private final ManifestExpirationChanges manifestExpirationChanges;
  private final Map<Integer, PartitionSpec> specLookup;
  private final FileIO io;

  public ManifestExpirationManager(TableMetadata original, TableMetadata current, FileIO io){
    this.io = io;
    this.specLookup = current.specsById();
    this.snapshotChanges = ExpireSnapshotUtil.getExpiredSnapshots(current, original);
    this.manifestExpirationChanges = ExpireSnapshotUtil.determineManifestChangesFromSnapshotExpiration(
        snapshotChanges.getValidSnapshotIds(), snapshotChanges.getExpiredSnapshotIds(), current, original, io)
    )
  //We can simplify the functions above and move them into this class

// But then we need to be able to get manifestExpirationChanges because we want to parallelize information from them
  getManifestExpirationChanges()

 // And now our scan method has to take a list of manifests still or we have to pass some kind of partition/split info
  public Set<String> scanManifestsForAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) {

  }

Let me know if this is kind of what you were thinking about

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit with this as a work in progress. I tried to keep the class as stateless as possible but it will have some null handling behaviors to maintain old effects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1264 For a demonstration of how we would use this in Spark

public class ExpireSnapshotUtil {
private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class);

//Utility Class No Instantiation Allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing space between // and Utility.

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Jul 28, 2020

It does not seem like we reduced complexity at this point. In some cases, we actually added it. Quick question: do we want to keep ExpireSnapshotUtil and ManifestExpirationManager separate? Would it simplify our life if it was one class?

* @return Wrapper around which manifests contain references to possibly abandoned files
*/
public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(Set<Long> validIds,
Set<Long> expiredIds, TableMetadata current, TableMetadata original, FileIO io) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass both valid/expired IDs as well as the current/original table metadata? The valid and expired IDs can come from the table metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could recompute that here, I think @aokolnychyi and I were discussing this before and we were trying to reduce the duplication of some calculations. But we could easily just have it recompute changes here.

The difficulty here, and in most of the code, is that the previous behavior was

Find Snapshot Differences
Log Expired Snapshots
If expiredSnapshots is not empty & deleteFiles:
Figure out Ancestors // Only requires the current Snapshot
Figure out CherryPicked Ancestors // requires Metadata

In the old code each step of the way we build up some of the information used further down the pathway, sometimes
the references are serializable and sometimes they aren't. Now we want something we can serialize and split, so because of that we have to be very careful about what things we keep in the class, and what things are only used during Init.

Ideally I wouldn't want to pass through the tablemetdata at all since it's a very heavy class and not serializable but there are a few points where we can't avoid it. For example in computing cherry picked ancestors we need to use a map that exists within the metadata to lookup snapshots from their Id's.

Let me see if I can refactor this function so that it doesn't use the metadata at all and only takes in the information from the "SnapshotChanges" discovered previously. I think the cherry picked ancestors is the only one that ends up being a bit hairy.

@aokolnychyi
Copy link
Contributor

I think I get the idea of why we want to have both ExpireSnapshotUtil and ManifestExpirationManager as separate classes after looking into this more. However, do we need to know ManifestExpirationChanges in ManifestExpirationManager? Shouldn't this class just accept a list of manifests to scan as method args?

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Jul 28, 2020

For example, I mean something like this:

public class ManifestExpirationManager {
  private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class);

  private final Set<Long> validSnapshotIds;
  private final Map<Integer, PartitionSpec> specs;
  private final FileIO io;

  public ManifestExpirationManager(Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specs, FileIO io) {
    this.validSnapshotIds = validSnapshotIds;
    this.specs = specs;
    this.io = io;
  }

  public Set<String> findAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) {
    Set<String> filesToDelete = new HashSet<>();
    Tasks.foreach(manifestsWithDeletes)
        .retry(3).suppressFailureWhenFinished()
        .onFailure((item, exc) -> LOG.warn("Failed to get deleted files: this may cause orphaned data files", exc))
        .run(manifest -> {
          try {
            filesToDelete.addAll(findAbandonedDeletedFiles(manifest));
          } catch (IOException e) {
            throw new UncheckedIOException(String.format("Failed to read manifest file: %s", manifest), e);
          }
        });
    return filesToDelete;
  }

  private Set<String> findAbandonedDeletedFiles(ManifestFile manifest) throws IOException {
    Set<String> filesToDelete = new HashSet<>();
    try (ManifestReader<?> reader = ManifestFiles.open(manifest, io, specs)) {
      for (ManifestEntry<?> entry : reader.entries()) {
        if (entry.status() == ManifestEntry.Status.DELETED && !validSnapshotIds.contains(entry.snapshotId())) {
          // use toString to ensure the path will not change (Utf8 is reused)
          filesToDelete.add(entry.file().path().toString());
        }
      }
    }
    return filesToDelete;
  }

...

}

This way, we only encapsulate the logic for scanning manifests in ManifestExpirationManager. We can either make this class serializable and broadcast or instantiate directly in executors.


//Snapshots which are not expired but refer to manifests from expired snapshots
Set<ManifestFile> validManifests = getValidManifests(currentSnapshots, io);
Set<ManifestFile> manifestsToScan = findValidManifestsInExpiredSnapshots(validManifests, ancestorIds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a comment here to explain why the valid manifests are in expired snapshots. Maybe the name should be more clear, too: validManifestsCreatedByExpiredSnapshots?

@RussellSpitzer
Copy link
Member Author

@rdblue I think I moved a bit in the wrong direction here and I want to make sure we are on the same page as to the end goal of the refactor. I'll be on the call tomorrow if you want to discuss in person or if you want to keep discussing here that is fine as well.

I think I do want to revert the last commit and maybe concentrate on still isolating the Manifest Scanning code from the Determining which Manifests to Scan code.

@RussellSpitzer RussellSpitzer force-pushed the RefactorRemoveSnapshot branch from 074af61 to 1225779 Compare July 31, 2020 16:49
@RussellSpitzer
Copy link
Member Author

@aokolnychyi + @rdblue I worked on cleaning this up, I simplified the public methods so they should hopefully be easier to use. As well I tried to reduce the input to many of the internal private functions to just the inputs they required, rather than the parent classes which contain everything. I also moved out the ancestorInfo into a separate wrapper class so we don't recalculate that either.


Set<Snapshot> expiredSnapshots = Sets.newHashSet();
for (Snapshot snapshot : original.snapshots()) {
if (!validSnapshots.contains(snapshot)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we don't implement hashcode and equals for BaseSnapshot. Won't that be a problem here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are comparing the same objects in memory here, but we could do a more complicated hash and equals if we want. Originally this was just a match made on the SnapshotId but since everything is dealing with the same references I thought this would be a bit clearer.

* Creates a Manifest Scanner for analyzing manifests for files which are no longer needed
* after expiration.
* @param current metadata for the table being expired
* @param io FileIO for the table, for Serializable usecases make sure this is also serializable for the framework
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the javadoc on the param completely.

for Serializable usecases make sure this is also serializable for the framework

Is that meant as a TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't control the underlying implementation of IO here and some instances will require a specfic manipulation to make sure they serialize correctly.

I added this specifically (instead of just allowing passing in ops) because for Spark we use SparkUtil.serializableIO to check whether the IO is of the HadoopIO class and make it serializable if so. We can't do that here because it would require dependencies from other modules and also be more or less specific to Spark. I think our only real option is to leave it up to the caller to make sure that portion of the class is correct for their application. :/

@RussellSpitzer
Copy link
Member Author

Brief Discussion with @rdblue + @aokolnychyi lead to another approach we are going to try out for finding expired files. We will instead of doing the old logic, read a table from a specific point in history and get all datafilee from that, then do an anti join with the current state to determine the set of no longer used files.

To do this there will be need to be another PR focused on being able to read a table from a specific manifest

@RussellSpitzer
Copy link
Member Author

#1264 - Expire Snapshots Action without using RemoveSnapshots Code Path

@rdblue
Copy link
Contributor

rdblue commented Aug 19, 2020

@RussellSpitzer, are you still interested in this or should we close it now that #1264 is in?

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Aug 20, 2020 via email

@rdblue
Copy link
Contributor

rdblue commented Aug 20, 2020

Sounds good to me. I'll close it. Thanks!

@rdblue rdblue closed this Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants