Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HEAD only migration option #185

Merged
merged 18 commits into from Jun 9, 2022
Merged

Conversation

mikejritter
Copy link
Contributor

JIRA Ticket: https://fedora-repository.atlassian.net/browse/FCREPO-3674

What does this Pull Request do?

Adds cli option for specifying HEAD only migrations and an option for passing in a list of datastream ids which should only have their HEAD migrated.

What's new?

  • --head-only option for initiating HEAD only migrations
  • --head-only-ids option for passing in a list of datastream ids
  • HeadOnlyDatastreamManager for processing the datastream id list and testing if an id matches
  • Additional tracking of migration state in ArchiveGroupHandler for when datastreams are skipped
  • Added Integration Test for HEAD only migrations
  • Updated README describing usage of new options

How should this be tested?

Depending on the dataset available, run a migration with the head only option and with the ``--head-only-ids` option.

E.g. with the brown dataset:

java -jar target/migration-utils-6.2.0-SNAPSHOT-driver.jar --debug -t legacy -o /data/migration/brown/repostore/data_2019/objects -d /data/migration/brown/repoarchive/datastreams_2019 -a /data/migration/brown-head-only -i work --head-only

Then navigate to the ocfl-root of the migrated dataset and search for objects with multiple datastream versions. This can be done manually or by checking using a mix of rocfl/cli tools:

for id p in `rocfl ls -p`; do dupes=$(find ${p} -mindepth 3 -maxdepth 3 -type f ! -name "*.nt" | cut -c 88- | sort | uniq -d | wc -l); echo "${id} has ${dupes} non-head datastreams" ; done

As a side note, this works for me using zsh but I haven't tested in other shells. The find command lists all files under the content directory and filters out n-triples. Then cut/sort/uniq to trim out excess paths (e.g. to make $uuid/v1/content/ds-id and $uuid/v2/content/ds-id to both be /ds-id) and search for duplicates.

Additional Notes:

I chose the Brown dataset for testing as it has RELS-INT entries which adds datastreams versions during migration if there are modifications which need to occur. My initial changes were failing here but by adding some basic state tracking, similar to the datastreamStates, it is easy enough to check if a datastream should have said changes or not.

Also since deletions occur at the end of a migration, the --head-only option will still migrate the last existing version of a datastream. This feels ok, but thought I would mention it in case there are any other thoughts on how deleted states should be handled.

Interested parties

@fcrepo/committers

Copy link
Contributor

@pwinckles pwinckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at the code. Haven't run it yet

README.md Outdated
Comment on lines 102 to 103
A list of datastreams to migrate only the HEAD of.
Only used if --head-only is specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear from this description that this option expects a file that contains one datastream id per line. Why not have this just be a comma separated list? It's unlikely to be very long.

@@ -297,12 +306,14 @@ public Integer call() throws Exception {
ocflStagingDir.toPath(), migrationType, user, userUri, algorithm, disableChecksumValidation)
.getObject();

final HeadOnlyDatastreamManager headOnlyManager = new HeadOnlyDatastreamManager(headOnly, headOnlyList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should enforce that headOnly is not false when heaedOnlyList is set

Comment on lines 120 to 121
private static final String HEAD_ONLY_OK = "OK";
private static final String HEAD_ONLY_SKIP = "SKIP";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an enum for this rather than a string

@@ -246,6 +256,15 @@ public void processObjectVersions(final Iterable<ObjectVersionReference> version
dsCreateDates.put(dsId, dv.getCreated());
datastreamStates.put(f6DsId, dv.getDatastreamInfo().getState());
}

headOnlyStates.put(f6DsId, HEAD_ONLY_OK);
final var skip = headOnlyDatastreamManager.accept(dsId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this method should be called something more descriptive like shouldMigrateOnlyHead. Personally, I find the generic accept makes it hard to interpret the meaning of the return value.

@pwinckles
Copy link
Contributor

Actually, thinking about it some more, won't this code still produce an OCFL version per F3 object version?

@mikejritter
Copy link
Contributor Author

Actually, thinking about it some more, won't this code still produce an OCFL version per F3 object version?

@pwinckles It can create additional OCFL versions from filename changes and deletions (potentially inactive too?). The version created from RELS-INT seemed ok, as it doesn't involve the datastream itself changing. The deletion was something I was thinking about, but it didn't really feel right to omit the datastream entirely.

I was hoping to get to your other comments and push changes up today but likely won't until tomorrow.

@pwinckles
Copy link
Contributor

@mikejritter But, I thought the goal of this task was to squash all F3 versions into a single OCFL version? This PR doesn't do that because it loops over the F3 object versions and creates a new OCFL version for each. I haven't thought about it in depth, but I think what you'd need to do is something like instead of iterating the versions, grab the ObjectReference, a new method to it that only returns the head versions of its datastreams, and iterate those to create the migrated object. I guess the problem with that approach is that you might have trouble getting the created date?

@pwinckles
Copy link
Contributor

pwinckles commented Apr 29, 2022

Oh, the other problem is that you only want to do this for a subset of datastreams. That's tricky. Maybe your approach is right and there just needs to be some additional logic around whether or not the session is committed? Or maybe I'm wrong and people don't really care about the additional versions and they just want to cleanup their binary history -- like this PR is currently doing.

@mikejritter
Copy link
Contributor Author

@pwinckles I've only just been able to get back to this. I hadn't really considered a squash -- I had thought about pruning the extra versions from the session but that was before I understood how the migrator was working.

I think the option of having a separate method handle the head only datastreams is interesting, it's just tricky since the RELS-INT can create versions for other datastreams, so either way there needs to be some way to resolve all the changes. Although I suppose if we know only the RELS-INT creates these changes, we could preprocess it and refer to that when handling head only datastreams (maybe handling the delete as well). I'm just thinking out loud though.

Also wrt to what people are actually expecting from this, maybe we should get some feedback from the people who commented on the ticket to see what their expectations are. It wouldn't make sense to merge this and then have the same issue crop up later because people expected x and got y.

@mikejritter
Copy link
Contributor Author

After discussion on the call last Thursday, I've updated this to create a single version for Fedora objects being migrated. I've removed the --head-only-ids option for now as there is quite a bit of complexity in having migrations allow certain datastreams to migrate with their versions while flattening others.

I also realized while testing that this wouldn't be compatible with atomic migrations, so I'm having the migrator throw an IllegalArgumentException if both are specified. It likely doesn't need too much logic to be supported, but I wanted to get something out which can be tested.

@dbernstein dbernstein merged commit 217a309 into fcrepo-exts:main Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants