Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to use fcrepo-ocfl-storage #110

Merged
merged 4 commits into from Sep 16, 2020

Conversation

pwinckles
Copy link
Contributor

@pwinckles pwinckles commented Aug 24, 2020

Jira: https://jira.lyrasis.org/browse/FCREPO-3403

What does this Pull Request do?

Updates the migration-utils to use the new fcrepo-storage-ocfl library. I had to add another partial implementation to handle the vanilla OCFL output. There should be no difference in the output, just a change in implementation.

How should this be tested?

Run the migration utils on a a F3 export and start F6 on the output.

Interested parties

@fcrepo4/committers

@pwinckles pwinckles changed the title Change ocfl storage Update to use fcrepo-ocfl-storage Aug 24, 2020
@dbernstein dbernstein self-requested a review September 10, 2020 22:06
Copy link
Contributor

@dbernstein dbernstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @pwinckles as always. Just a few cosmetic issues.

return new DefaultOcflObjectSessionFactory(ocflRepo, stagingDir, objectMapper, CommitType.NEW_VERSION,
"Generated by Fedora 3 to Fedora 6 migration", user, userUri);
} else {
return new VanillaOcflObjectSessionFactory(ocflRepo, stagingDir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of VanillaOcflObjectSessionFactory, may I suggest BasicOcflObjectSessionFactory or SimpleOcflObjectSessionFactory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


if ("RE".contains(dv.getDatastreamInfo().getControlGroup())) {
InputStream content = null;
// Write a file for external content only for vanilla OCFL migration
if (migrationType == MigrationType.VANILLA_OCFL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment about use of Vanilla name.


if (dv.isFirstVersionIn(ov.getObject())) {
dsCreateDates.put(dsId, dv.getCreated());
}
final var createDate = dsCreateDates.get(dsId);

final var datastreamHeaders = createDatastreamHeaders(dv, f6DsId, f6ObjectId,
datastreamFilename, mimeType, createDate, contentPath);
datastreamFilename, mimeType, createDate);

if ("RE".contains(dv.getDatastreamInfo().getControlGroup())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining the purpose of this line? Also perhaps the above "R" and "E" map keys should be made into constants and this line should reference them? Just a suggestion for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not write that code nor interact with it in this PR, but, yes, I can clean it up.

.workDir(stagingDir)
.buildMutable();

if (migrationType == MigrationType.F6_OCFL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to call the migration type "F6_OCFL" thus tying in the version to the migration type? Or should it be something like "FEDORA_OCFL"?

Copy link
Contributor Author

@pwinckles pwinckles Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my objection about the datastream ids, I personally don't think this name is inappropriate. If there is an F7 and it uses OCFL, what's to say that it uses the same format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I have less of a problem with changing this name than I do with the variable names. The problem with the variable names is that the class is working with two different versions of Fedora, so it makes sense to distinguish between versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the enum

@@ -148,68 +137,57 @@ public void processObjectVersions(final Iterable<ObjectVersionReference> version
final String dsId = dv.getDatastreamInfo().getDatastreamId();
final String f6DsId = resolveF6DatastreamId(dsId, f6ObjectId, mimeType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use fedoraDsId rather than embedding the version in the variable name unless you have an objection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the resolveF6... method name.

@@ -315,57 +357,63 @@ public void processObjectSingleVersionVanillaFormatWithExternalBinary() {
final var ds2 = datastreamVersion(dsId2, true, "E", "text/plain", "", "https://external");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - also method name change.

final int index) {
final var captor = ArgumentCaptor.forClass(InputStream.class);
verify(session, atLeastOnce()).put(eq(path), captor.capture());
private void verifyBinary(final String content, final DatastreamVersion datastreamVersion) {
try {
if ("RE".contains(datastreamVersion.getDatastreamInfo().getControlGroup())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"RE" to constant

}
}

private void verifyVanillaDescRdf(final String content,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method name change.

/**
* @author pwinckles
*/
public class VanillaOcflObjectSessionTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class name change

private Path staging;

private MutableOcflRepository ocflRepo;
private OcflObjectSessionFactory vanillaSessionFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change variable name.

@pwinckles
Copy link
Contributor Author

Hmm... I'm not sure if I agree about the variable names. All this class can claim to do is migrate from F3 to F6. What were the datastream ids in F3 are no longer the ids in F6. Calling them simply fedora does not clarify what they are.

@dbernstein dbernstein changed the base branch from master to main September 11, 2020 16:47
@dbernstein
Copy link
Contributor

I guess my concern is that we'll be able to use this tool to migrate to F7, F8, etc so we don't want the code to indicate a specific version. But I see your point as well. I'll leave it up to your professional judgement.

@dbernstein
Copy link
Contributor

@awoods : would you be willing to take a look at my suggestions here re Fedora 6 method and variable names in the code? I don't think it's a big deal if we leave as is since this tool will likely serve as a migration path from F3->F6. If the on disc format changes from F6->F7 or Fn we will need a separate migration/upgrade tool to accomplish that task. Perhaps it is in fact better to indicate Fedora6 to make it clear that this is migrating to F6 only. I only brought this up because I thought our goal with the OCFL setup was to have a file layout less subject to major changes between versions. So in theory, users of migration-utils could migrate from F3-> any Fedora readable OCFL regardless of the version.

@dbernstein
Copy link
Contributor

I tried running the following per the instructions on the README:

Daniels-MBP$ java -jar target/migration-utils-4.4.1-SNAPSHOT-driver.jar --source-type=legacy --limit=100 --target-dir=target/test/ocfl --objects-dir=src/test/resources/legacyFS/objects --datastreams-dir=src/test/resources/legacyFS/datastreams
INFO 15:05:57.804 (InternalIDResolver) No index directory specified.  Creating temporary index at "/var/folders/pt/cd6hk_2x7314mj5h1_f43v7h0000gn/T/tempfile12253528601569998631basedir/index".
INFO 15:05:57.896 (InternalIDResolver) Building an index of all the datastreams in "src/test/resources/legacyFS/datastreams"...
INFO 15:05:58.533 (Migrator) Processing "example:1"...
Resource info:fedora/example:1/AUDIT's file size does not match expectation. Expected: -1; Actual: 3481
Usage: migration-utils [-hrVx] [--debug] -a=<targetDir> [-d=<f3DatastreamsDir>]
                       [-e=<f3ExportedDir>] [-f=<f3hostname>] [-i=<indexDir>]
                       [-l=<objectLimit>] [-m=<migrationType>]
                       [-o=<f3ObjectsDir>] [-p=<pidFile>] -t=<f3SourceType>
                       [-u=<user>] [-U=<userUri>]
...
INFO 15:05:58.879 (InternalIDResolver) Deleting index directory at "/var/folders/pt/cd6hk_2x7314mj5h1_f43v7h0000gn/T/tempfile12253528601569998631basedir/index"...

Trying it with the uw-madison tar gz file @awoods sent me I got this error:

Daniels-MBP$ java -jar target/migration-utils-4.4.1-SNAPSHOT-driver.jar --source-type=legacy --limit=100 --target-dir=target/test/ocfl --objects-dir=/Users/danielbernstein/Downloads/uw-madison-fedora-3-sample/objects --datastreams-dir=/Users/danielbernstein/Downloads/uw-madison-fedora-3-sample/datastreams
INFO 15:08:36.291 (InternalIDResolver) No index directory specified.  Creating temporary index at "/var/folders/pt/cd6hk_2x7314mj5h1_f43v7h0000gn/T/tempfile17955120112426763856basedir/index".
INFO 15:08:36.437 (InternalIDResolver) Building an index of all the datastreams in "/Users/danielbernstein/Downloads/uw-madison-fedora-3-sample/datastreams"...
INFO 15:08:37.483 (Migrator) Processing "1711.dl:XMLSchema-UWDC-MODS-3.6-PEERS"...
Unable to resolve internal ID "1711.dl:XMLSchema-UWDC-MODS-3.6-PEERS+SCHEMA+SCHEMA.0"!
Usage: migration-utils [-hrVx] [--debug] -a=<targetDir> [-d=<f3DatastreamsDir>]
                       [-e=<f3ExportedDir>] [-f=<f3hostname>] [-i=<indexDir>]
                       [-l=<objectLimit>] [-m=<migrationType>]
                       [-o=<f3ObjectsDir>] [-p=<pidFile>] -t=<f3SourceType>
                       [-u=<user>] [-U=<userUri>]

@pwinckles or @awoods : do you see anything that I'm doing wrong in my command line?

@pwinckles
Copy link
Contributor Author

@dbernstein Try it with --source-type=akubra

@dbernstein
Copy link
Contributor

@pwinckles : thanks for the suggestion. I tried that:

Daniels-MBP$ java -jar target/migration-utils-4.4.1-SNAPSHOT-driver.jar --source-type=akubra --limit=100 --target-dir=target/test/ocfl --objects-dir=/Users/danielbernstein/Downloads/uw-madison-fedora-3-sample/objects --datastreams-dir=/Users/danielbernstein/Downloads/uw-madison-fedora-3-sample/datastreams
INFO 15:48:03.998 (InternalIDResolver) No index directory specified.  Creating temporary index at "/var/folders/pt/cd6hk_2x7314mj5h1_f43v7h0000gn/T/tempfile274630286553936104basedir/index".
INFO 15:48:04.131 (InternalIDResolver) Building an index of all the datastreams in "/Users/danielbernstein/Downloads/uw-madison-fedora-3-sample/datastreams"...
INFO 15:48:05.368 (Migrator) Processing "1711.dl:XMLSchema-UWDC-MODS-3.6-PEERS"...
Resource info:fedora/1711.dl:XMLSchema-UWDC-MODS-3.6-PEERS/RELS-INT's file size does not match expectation. Expected: 618; Actual: 604

I wondering if this may be a cross platform issue?

@pwinckles
Copy link
Contributor Author

Interesting... That could very well be

@dbernstein
Copy link
Contributor

What's strange is that both the uwm sample set as well as the test dataset at src/test/resource/legacyFS are running into problems, albeit they appear to be different issues.

@pwinckles
Copy link
Contributor Author

@dbernstein I'm seeing those errors now when I'm running the tests. The cause is that the storage layer was updated to verify content size after creating this PR. Some of the F3 resources are reporting incorrect content sizes, which is causing them to be rejected. Will fix

@pwinckles
Copy link
Contributor Author

@dbernstein Okay, I just pushed the fix for the content size as well as most of your pr comments. I did not change the "f6" variable names, because I think it makes the code in that class substantively harder to understand because it is simultaneously dealing with F3 and F6 identifiers.

@dbernstein
Copy link
Contributor

@pwinckles : thank you sir. I will test and assuming all goes well will merge.

@dbernstein dbernstein merged commit ebf4e5c into fcrepo-exts:main Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants