New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HEAD only migration option #185
Conversation
This is done so that when head only migrations are requested, operations can be skipped if an old datastream version was not migrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only looked at the code. Haven't run it yet
README.md
Outdated
A list of datastreams to migrate only the HEAD of. | ||
Only used if --head-only is specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear from this description that this option expects a file that contains one datastream id per line. Why not have this just be a comma separated list? It's unlikely to be very long.
@@ -297,12 +306,14 @@ public Integer call() throws Exception { | |||
ocflStagingDir.toPath(), migrationType, user, userUri, algorithm, disableChecksumValidation) | |||
.getObject(); | |||
|
|||
final HeadOnlyDatastreamManager headOnlyManager = new HeadOnlyDatastreamManager(headOnly, headOnlyList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should enforce that headOnly
is not false
when heaedOnlyList
is set
private static final String HEAD_ONLY_OK = "OK"; | ||
private static final String HEAD_ONLY_SKIP = "SKIP"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an enum for this rather than a string
@@ -246,6 +256,15 @@ public void processObjectVersions(final Iterable<ObjectVersionReference> version | |||
dsCreateDates.put(dsId, dv.getCreated()); | |||
datastreamStates.put(f6DsId, dv.getDatastreamInfo().getState()); | |||
} | |||
|
|||
headOnlyStates.put(f6DsId, HEAD_ONLY_OK); | |||
final var skip = headOnlyDatastreamManager.accept(dsId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this method should be called something more descriptive like shouldMigrateOnlyHead
. Personally, I find the generic accept
makes it hard to interpret the meaning of the return value.
Actually, thinking about it some more, won't this code still produce an OCFL version per F3 object version? |
@pwinckles It can create additional OCFL versions from filename changes and deletions (potentially inactive too?). The version created from RELS-INT seemed ok, as it doesn't involve the datastream itself changing. The deletion was something I was thinking about, but it didn't really feel right to omit the datastream entirely. I was hoping to get to your other comments and push changes up today but likely won't until tomorrow. |
@mikejritter But, I thought the goal of this task was to squash all F3 versions into a single OCFL version? This PR doesn't do that because it loops over the F3 object versions and creates a new OCFL version for each. I haven't thought about it in depth, but I think what you'd need to do is something like instead of iterating the versions, grab the |
Oh, the other problem is that you only want to do this for a subset of datastreams. That's tricky. Maybe your approach is right and there just needs to be some additional logic around whether or not the session is committed? Or maybe I'm wrong and people don't really care about the additional versions and they just want to cleanup their binary history -- like this PR is currently doing. |
@pwinckles I've only just been able to get back to this. I hadn't really considered a squash -- I had thought about pruning the extra versions from the session but that was before I understood how the migrator was working. I think the option of having a separate method handle the head only datastreams is interesting, it's just tricky since the RELS-INT can create versions for other datastreams, so either way there needs to be some way to resolve all the changes. Although I suppose if we know only the RELS-INT creates these changes, we could preprocess it and refer to that when handling head only datastreams (maybe handling the delete as well). I'm just thinking out loud though. Also wrt to what people are actually expecting from this, maybe we should get some feedback from the people who commented on the ticket to see what their expectations are. It wouldn't make sense to merge this and then have the same issue crop up later because people expected x and got y. |
After discussion on the call last Thursday, I've updated this to create a single version for Fedora objects being migrated. I've removed the I also realized while testing that this wouldn't be compatible with atomic migrations, so I'm having the migrator throw an |
JIRA Ticket: https://fedora-repository.atlassian.net/browse/FCREPO-3674
What does this Pull Request do?
Adds cli option for specifying HEAD only migrations and an option for passing in a list of datastream ids which should only have their HEAD migrated.
What's new?
--head-only
option for initiating HEAD only migrations--head-only-ids
option for passing in a list of datastream idsHeadOnlyDatastreamManager
for processing the datastream id list and testing if an id matchesHow should this be tested?
Depending on the dataset available, run a migration with the head only option and with the ``--head-only-ids` option.
E.g. with the brown dataset:
Then navigate to the ocfl-root of the migrated dataset and search for objects with multiple datastream versions. This can be done manually or by checking using a mix of rocfl/cli tools:
As a side note, this works for me using zsh but I haven't tested in other shells. The
find
command lists all files under thecontent
directory and filters out n-triples. Then cut/sort/uniq to trim out excess paths (e.g. to make $uuid/v1/content/ds-id and $uuid/v2/content/ds-id to both be /ds-id) and search for duplicates.Additional Notes:
I chose the Brown dataset for testing as it has RELS-INT entries which adds datastreams versions during migration if there are modifications which need to occur. My initial changes were failing here but by adding some basic state tracking, similar to the
datastreamStates
, it is easy enough to check if a datastream should have said changes or not.Also since deletions occur at the end of a migration, the
--head-only
option will still migrate the last existing version of a datastream. This feels ok, but thought I would mention it in case there are any other thoughts on how deleted states should be handled.Interested parties
@fcrepo/committers