Add HEAD only migration option #185

mikejritter · 2022-04-25T20:05:21Z

JIRA Ticket: https://fedora-repository.atlassian.net/browse/FCREPO-3674

What does this Pull Request do?

Adds cli option for specifying HEAD only migrations and an option for passing in a list of datastream ids which should only have their HEAD migrated.

What's new?

--head-only option for initiating HEAD only migrations
--head-only-ids option for passing in a list of datastream ids
HeadOnlyDatastreamManager for processing the datastream id list and testing if an id matches
Additional tracking of migration state in ArchiveGroupHandler for when datastreams are skipped
Added Integration Test for HEAD only migrations
Updated README describing usage of new options

How should this be tested?

Depending on the dataset available, run a migration with the head only option and with the ``--head-only-ids` option.

E.g. with the brown dataset:

java -jar target/migration-utils-6.2.0-SNAPSHOT-driver.jar --debug -t legacy -o /data/migration/brown/repostore/data_2019/objects -d /data/migration/brown/repoarchive/datastreams_2019 -a /data/migration/brown-head-only -i work --head-only

Then navigate to the ocfl-root of the migrated dataset and search for objects with multiple datastream versions. This can be done manually or by checking using a mix of rocfl/cli tools:

for id p in `rocfl ls -p`; do dupes=$(find ${p} -mindepth 3 -maxdepth 3 -type f ! -name "*.nt" | cut -c 88- | sort | uniq -d | wc -l); echo "${id} has ${dupes} non-head datastreams" ; done

As a side note, this works for me using zsh but I haven't tested in other shells. The find command lists all files under the content directory and filters out n-triples. Then cut/sort/uniq to trim out excess paths (e.g. to make $uuid/v1/content/ds-id and $uuid/v2/content/ds-id to both be /ds-id) and search for duplicates.

Additional Notes:

I chose the Brown dataset for testing as it has RELS-INT entries which adds datastreams versions during migration if there are modifications which need to occur. My initial changes were failing here but by adding some basic state tracking, similar to the datastreamStates, it is easy enough to check if a datastream should have said changes or not.

Also since deletions occur at the end of a migration, the --head-only option will still migrate the last existing version of a datastream. This feels ok, but thought I would mention it in case there are any other thoughts on how deleted states should be handled.

Interested parties

@fcrepo/committers

This is done so that when head only migrations are requested, operations can be skipped if an old datastream version was not migrated.

pwinckles

Only looked at the code. Haven't run it yet

pwinckles · 2022-04-28T12:57:46Z

README.md

+                             A list of datastreams to migrate only the HEAD of.
+                               Only used if --head-only is specified.


It's unclear from this description that this option expects a file that contains one datastream id per line. Why not have this just be a comma separated list? It's unlikely to be very long.

pwinckles · 2022-04-28T13:01:11Z

src/main/java/org/fcrepo/migration/PicocliMigrator.java

@@ -297,12 +306,14 @@ public Integer call() throws Exception {
                ocflStagingDir.toPath(), migrationType, user, userUri, algorithm, disableChecksumValidation)
                .getObject();

+        final HeadOnlyDatastreamManager headOnlyManager = new HeadOnlyDatastreamManager(headOnly, headOnlyList);


You should enforce that headOnly is not false when heaedOnlyList is set

pwinckles · 2022-04-28T13:02:47Z

src/main/java/org/fcrepo/migration/handlers/ocfl/ArchiveGroupHandler.java

+    private static final String HEAD_ONLY_OK = "OK";
+    private static final String HEAD_ONLY_SKIP = "SKIP";


Use an enum for this rather than a string

pwinckles · 2022-04-28T13:04:18Z

src/main/java/org/fcrepo/migration/handlers/ocfl/ArchiveGroupHandler.java

@@ -246,6 +256,15 @@ public void processObjectVersions(final Iterable<ObjectVersionReference> version
                    dsCreateDates.put(dsId, dv.getCreated());
                    datastreamStates.put(f6DsId, dv.getDatastreamInfo().getState());
                }
+
+                headOnlyStates.put(f6DsId, HEAD_ONLY_OK);
+                final var skip = headOnlyDatastreamManager.accept(dsId);


Perhaps this method should be called something more descriptive like shouldMigrateOnlyHead. Personally, I find the generic accept makes it hard to interpret the meaning of the return value.

pwinckles · 2022-04-28T20:06:11Z

Actually, thinking about it some more, won't this code still produce an OCFL version per F3 object version?

mikejritter · 2022-04-28T22:44:51Z

Actually, thinking about it some more, won't this code still produce an OCFL version per F3 object version?

@pwinckles It can create additional OCFL versions from filename changes and deletions (potentially inactive too?). The version created from RELS-INT seemed ok, as it doesn't involve the datastream itself changing. The deletion was something I was thinking about, but it didn't really feel right to omit the datastream entirely.

I was hoping to get to your other comments and push changes up today but likely won't until tomorrow.

pwinckles · 2022-04-29T12:23:23Z

@mikejritter But, I thought the goal of this task was to squash all F3 versions into a single OCFL version? This PR doesn't do that because it loops over the F3 object versions and creates a new OCFL version for each. I haven't thought about it in depth, but I think what you'd need to do is something like instead of iterating the versions, grab the ObjectReference, a new method to it that only returns the head versions of its datastreams, and iterate those to create the migrated object. I guess the problem with that approach is that you might have trouble getting the created date?

pwinckles · 2022-04-29T12:26:37Z

Oh, the other problem is that you only want to do this for a subset of datastreams. That's tricky. Maybe your approach is right and there just needs to be some additional logic around whether or not the session is committed? Or maybe I'm wrong and people don't really care about the additional versions and they just want to cleanup their binary history -- like this PR is currently doing.

mikejritter · 2022-05-03T19:11:54Z

@pwinckles I've only just been able to get back to this. I hadn't really considered a squash -- I had thought about pruning the extra versions from the session but that was before I understood how the migrator was working.

I think the option of having a separate method handle the head only datastreams is interesting, it's just tricky since the RELS-INT can create versions for other datastreams, so either way there needs to be some way to resolve all the changes. Although I suppose if we know only the RELS-INT creates these changes, we could preprocess it and refer to that when handling head only datastreams (maybe handling the delete as well). I'm just thinking out loud though.

Also wrt to what people are actually expecting from this, maybe we should get some feedback from the people who commented on the ticket to see what their expectations are. It wouldn't make sense to merge this and then have the same issue crop up later because people expected x and got y.

mikejritter · 2022-05-11T19:22:56Z

After discussion on the call last Thursday, I've updated this to create a single version for Fedora objects being migrated. I've removed the --head-only-ids option for now as there is quite a bit of complexity in having migrations allow certain datastreams to migrate with their versions while flattening others.

I also realized while testing that this wouldn't be compatible with atomic migrations, so I'm having the migrator throw an IllegalArgumentException if both are specified. It likely doesn't need too much logic to be supported, but I wanted to get something out which can be tested.

mikejritter added 11 commits April 12, 2022 13:24

Class to track pids/datastreams for head only migrations

3c415f0

Add head only migration options

4010023

Skip processing datastreams when head only is requested

f0e8477

Add integration tests for HeadOnlyPidListManager

f383868

Update tests to include HeadOnlyPidListManager

4e9fe02

Add head only options and usage

b50731b

Update to process only datastream ids

f89a550

Track if datastreams are processed or not for head only

1bf7352

This is done so that when head only migrations are requested, operations can be skipped if an old datastream version was not migrated.

Update head only list description

093388a

Use a temp index dir to test windows IT failure

c71e27b

Remove unused xml

4dd3d74

pwinckles reviewed Apr 28, 2022

View reviewed changes

mikejritter added 4 commits May 2, 2022 15:57

Create enum for head only processing options

8de6a64

Rename accept to isHeadOnly

55a5fec

Switch from file input to string

c1bc0d6

Add check if --head-only-ids is set without --head-only

799a1f0

mikejritter added 3 commits May 9, 2022 15:46

Update to flatten versions if head only is set

be97410

Remove --head-only-ids for now

d293501

Update readme

5895ea9

dbernstein approved these changes Jun 9, 2022

View reviewed changes

dbernstein merged commit 217a309 into fcrepo-exts:main Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HEAD only migration option #185

Add HEAD only migration option #185

mikejritter commented Apr 25, 2022

pwinckles left a comment

pwinckles Apr 28, 2022

pwinckles Apr 28, 2022

pwinckles Apr 28, 2022

pwinckles Apr 28, 2022

pwinckles commented Apr 28, 2022

mikejritter commented Apr 28, 2022

pwinckles commented Apr 29, 2022

pwinckles commented Apr 29, 2022 •

edited

mikejritter commented May 3, 2022

mikejritter commented May 11, 2022

		A list of datastreams to migrate only the HEAD of.
		Only used if --head-only is specified.

		private static final String HEAD_ONLY_OK = "OK";
		private static final String HEAD_ONLY_SKIP = "SKIP";

Add HEAD only migration option #185

Add HEAD only migration option #185

Conversation

mikejritter commented Apr 25, 2022

What does this Pull Request do?

What's new?

How should this be tested?

Additional Notes:

Interested parties

pwinckles left a comment

Choose a reason for hiding this comment

pwinckles Apr 28, 2022

Choose a reason for hiding this comment

pwinckles Apr 28, 2022

Choose a reason for hiding this comment

pwinckles Apr 28, 2022

Choose a reason for hiding this comment

pwinckles Apr 28, 2022

Choose a reason for hiding this comment

pwinckles commented Apr 28, 2022

mikejritter commented Apr 28, 2022

pwinckles commented Apr 29, 2022

pwinckles commented Apr 29, 2022 • edited

mikejritter commented May 3, 2022

mikejritter commented May 11, 2022

pwinckles commented Apr 29, 2022 •

edited