Refactor RemoveSnapshots for easier extension#1245
Refactor RemoveSnapshots for easier extension#1245RussellSpitzer wants to merge 2 commits intoapache:masterfrom
Conversation
Previously all of the code for looking up which manifests need to be scanned and the code for scanning them was deeply tied to the RemoveSnapshots object. This made it difficult to write any new implementations which had similar functionality. In this patch we extract out many of the subfunctions and more complicated pieces of code into a new utility class to allow from access from other modules. The code for actually scanning manifests was extracted into ManifestExpirationManager to allow access to Manifest information without exposing Manifest internals.
|
@aokolnychyi just the Refactor |
| //No public constructor for utility class | ||
| private ManifestExpirationManager(){} | ||
|
|
||
| private static final Logger LOG = LoggerFactory.getLogger(ManifestExpirationManager.class); |
There was a problem hiding this comment.
Minor: I'd normally put the static final fields above the constructor and other instance methods.
| */ | ||
| public static Set<String> scanManifestsForAbandonedDeletedFiles( | ||
| Set<ManifestFile> manifests, Set<Long> validSnapshotIds, Map<Integer, PartitionSpec> specLookup, | ||
| FileIO io) { |
There was a problem hiding this comment.
Since we are introducing a class for this, I think it makes sense to simplify the methods. Before, we kept breaking out new methods to keep complexity down, which kept requiring passing all of the information as arguments. Now, we can share many of these arguments, like validSnapshotIds in the manager instance and not need to pass them around as args.
We can also initialize the manager using the base snapshot and current metadata, so that the class handles its own configuration for things like specLookup. We'd probably need to make changes to how this is configured to be able to parallelize it, but I think we could make configuration a bit cleaner.
What do you think?
There was a problem hiding this comment.
As long as I can keep all the fields final and serializable I'm ok with that. I have too much Scala in my brain and I generally hate having any methods on a class if I can help it and prefer to just have data classes and then static methods which operate on them.
I think in this case all the inputs to this class are also serializable so it's probably safe ...
There was a problem hiding this comment.
This ends up being a bit more complicated than I anticipated if we want to make sure that "ManfiestExpirationChanges" and the Manager are both not only serializable but also splittable ... I think if I do a bit more refactoring I can make this happen but I believe this will make it harder/ more confusing in the future.
For example I'll need this class to not only generate the full list of manifests which need to be scanned but also have a method which only operates on some of the generated list.
Also we end up loosing a bit of control about how we serialize the component parts, we would have always broadcast everything if we wrap the entire module up.
I can go ahead and finish up the refactor but I would like to check with you before I devote more time to this.
My code at the moment is a bit like
//These are all serializable
private final SnapshotExpirationChanges snapshotChanges;
private final ManifestExpirationChanges manifestExpirationChanges;
private final Map<Integer, PartitionSpec> specLookup;
private final FileIO io;
public ManifestExpirationManager(TableMetadata original, TableMetadata current, FileIO io){
this.io = io;
this.specLookup = current.specsById();
this.snapshotChanges = ExpireSnapshotUtil.getExpiredSnapshots(current, original);
this.manifestExpirationChanges = ExpireSnapshotUtil.determineManifestChangesFromSnapshotExpiration(
snapshotChanges.getValidSnapshotIds(), snapshotChanges.getExpiredSnapshotIds(), current, original, io)
)
//We can simplify the functions above and move them into this class
// But then we need to be able to get manifestExpirationChanges because we want to parallelize information from them
getManifestExpirationChanges()
// And now our scan method has to take a list of manifests still or we have to pass some kind of partition/split info
public Set<String> scanManifestsForAbandonedDeletedFiles(Set<ManifestFile> manifestsWithDeletes) {
}
Let me know if this is kind of what you were thinking about
There was a problem hiding this comment.
I pushed a commit with this as a work in progress. I tried to keep the class as stateless as possible but it will have some null handling behaviors to maintain old effects.
There was a problem hiding this comment.
See #1264 For a demonstration of how we would use this in Spark
| public class ExpireSnapshotUtil { | ||
| private static final Logger LOG = LoggerFactory.getLogger(ExpireSnapshotUtil.class); | ||
|
|
||
| //Utility Class No Instantiation Allowed |
There was a problem hiding this comment.
Nit: missing space between // and Utility.
|
It does not seem like we reduced complexity at this point. In some cases, we actually added it. Quick question: do we want to keep |
| * @return Wrapper around which manifests contain references to possibly abandoned files | ||
| */ | ||
| public static ManifestExpirationChanges determineManifestChangesFromSnapshotExpiration(Set<Long> validIds, | ||
| Set<Long> expiredIds, TableMetadata current, TableMetadata original, FileIO io) { |
There was a problem hiding this comment.
Why pass both valid/expired IDs as well as the current/original table metadata? The valid and expired IDs can come from the table metadata.
There was a problem hiding this comment.
Yes we could recompute that here, I think @aokolnychyi and I were discussing this before and we were trying to reduce the duplication of some calculations. But we could easily just have it recompute changes here.
The difficulty here, and in most of the code, is that the previous behavior was
Find Snapshot Differences
Log Expired Snapshots
If expiredSnapshots is not empty & deleteFiles:
Figure out Ancestors // Only requires the current Snapshot
Figure out CherryPicked Ancestors // requires Metadata
In the old code each step of the way we build up some of the information used further down the pathway, sometimes
the references are serializable and sometimes they aren't. Now we want something we can serialize and split, so because of that we have to be very careful about what things we keep in the class, and what things are only used during Init.
Ideally I wouldn't want to pass through the tablemetdata at all since it's a very heavy class and not serializable but there are a few points where we can't avoid it. For example in computing cherry picked ancestors we need to use a map that exists within the metadata to lookup snapshots from their Id's.
Let me see if I can refactor this function so that it doesn't use the metadata at all and only takes in the information from the "SnapshotChanges" discovered previously. I think the cherry picked ancestors is the only one that ends up being a bit hairy.
|
I think I get the idea of why we want to have both |
|
For example, I mean something like this: This way, we only encapsulate the logic for scanning manifests in |
|
|
||
| //Snapshots which are not expired but refer to manifests from expired snapshots | ||
| Set<ManifestFile> validManifests = getValidManifests(currentSnapshots, io); | ||
| Set<ManifestFile> manifestsToScan = findValidManifestsInExpiredSnapshots(validManifests, ancestorIds, |
There was a problem hiding this comment.
It would be good to have a comment here to explain why the valid manifests are in expired snapshots. Maybe the name should be more clear, too: validManifestsCreatedByExpiredSnapshots?
|
@rdblue I think I moved a bit in the wrong direction here and I want to make sure we are on the same page as to the end goal of the refactor. I'll be on the call tomorrow if you want to discuss in person or if you want to keep discussing here that is fine as well. I think I do want to revert the last commit and maybe concentrate on still isolating the Manifest Scanning code from the Determining which Manifests to Scan code. |
core/src/main/java/org/apache/iceberg/ManifestExpirationManager.java
Outdated
Show resolved
Hide resolved
074af61 to
1225779
Compare
|
@aokolnychyi + @rdblue I worked on cleaning this up, I simplified the public methods so they should hopefully be easier to use. As well I tried to reduce the input to many of the internal private functions to just the inputs they required, rather than the parent classes which contain everything. I also moved out the ancestorInfo into a separate wrapper class so we don't recalculate that either. |
|
|
||
| Set<Snapshot> expiredSnapshots = Sets.newHashSet(); | ||
| for (Snapshot snapshot : original.snapshots()) { | ||
| if (!validSnapshots.contains(snapshot)) { |
There was a problem hiding this comment.
Seems like we don't implement hashcode and equals for BaseSnapshot. Won't that be a problem here?
There was a problem hiding this comment.
We are comparing the same objects in memory here, but we could do a more complicated hash and equals if we want. Originally this was just a match made on the SnapshotId but since everything is dealing with the same references I thought this would be a bit clearer.
| * Creates a Manifest Scanner for analyzing manifests for files which are no longer needed | ||
| * after expiration. | ||
| * @param current metadata for the table being expired | ||
| * @param io FileIO for the table, for Serializable usecases make sure this is also serializable for the framework |
There was a problem hiding this comment.
I don't follow the javadoc on the param completely.
for Serializable usecases make sure this is also serializable for the framework
Is that meant as a TODO?
There was a problem hiding this comment.
Unfortunately we can't control the underlying implementation of IO here and some instances will require a specfic manipulation to make sure they serialize correctly.
I added this specifically (instead of just allowing passing in ops) because for Spark we use SparkUtil.serializableIO to check whether the IO is of the HadoopIO class and make it serializable if so. We can't do that here because it would require dependencies from other modules and also be more or less specific to Spark. I think our only real option is to leave it up to the caller to make sure that portion of the class is correct for their application. :/
|
Brief Discussion with @rdblue + @aokolnychyi lead to another approach we are going to try out for finding expired files. We will instead of doing the old logic, read a table from a specific point in history and get all datafilee from that, then do an anti join with the current state to determine the set of no longer used files. To do this there will be need to be another PR focused on being able to read a table from a specific manifest |
|
#1264 - Expire Snapshots Action without using RemoveSnapshots Code Path |
|
@RussellSpitzer, are you still interested in this or should we close it now that #1264 is in? |
|
I think we should close it out since I'm not sure there is much benefit
without the other usecase
…On Wed, Aug 19, 2020, 6:55 PM Ryan Blue ***@***.***> wrote:
@RussellSpitzer <https://github.com/RussellSpitzer>, are you still
interested in this or should we close it now that #1264
<#1264> is in?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1245 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADE2YJYDXSQVGF6S5Q5UMTSBRQ7DANCNFSM4PHD5P4Q>
.
|
|
Sounds good to me. I'll close it. Thanks! |
Previously all of the code for looking up which manifests need to be scanned
and the code for scanning them was deeply tied to the RemoveSnapshots object. This made
it difficult to write any new implementations which had similar functionality. In
this patch we extract out many of the subfunctions and more complicated pieces of code
into a new utility class to allow from access from other modules.
The code for actually scanning manifests was extracted into ManifestExpirationManager
to allow access to Manifest information without exposing Manifest internals.