Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing a local profile path for profile validation #54

Closed
wants to merge 1 commit into from

Conversation

prettybits
Copy link

This introduces the option to specify a locally stored BagIt profile for profile validation instead of always trying to fetch the profile from the network. I was missing this when developing a new profile and testing validation, where the profile doesn't have a reachable URL yet. Will also be good to have when used in a fixed workflow where the profile JSON could just be stored locally and save the network request when validating incoming packages.

I chose --profile-path as the name for the new option.

Is this something you would be OK with introducing? Happy to discuss this further if needed.

@mikedarcy
Copy link
Collaborator

mikedarcy commented Mar 29, 2023

There is indeed no way to specify/override the bag profile in the CLI and it certainly seems reasonable to want parity between the API and CLI for this function. However, after looking at this with fresh eyes (it's been quite a while since it was first implemented), there are some additional points I'd like to raise.

The bagit-profiles-validator code (which is what is actually being used for validation) uses urllib.urlopen under the hood, which supports using local files via the file: handler. So in your bag-info.txt you can specify a local file. We use this in the unit tests, e.g.:

BagIt-Profile-Identifier: file:./profiles/bdbag-profile.json

So while I would say this is currently supported, it is not clearly stated anywhere in the documentation. In fact the current documentation for validate_bag_profile looks flat-out wrong stating that it should be "A normalized, absolute path to a BagIt-Profile file" when in reality it's a URL.

However, as I review some of this logic, it now occurs to me that providing a way to override the profile path via validate_bag_profile really shouldn't exist either, since part of the bag validation process is matching the URL used to retrieve the profile to what is specified in the bag's bag-info.txt. This is the relevant logic from bagit-profiles-validator:

            if bag_info[profile_id_tag] != self.url:
                self._fail(
                    "%s: '%s' tag does not contain this profile's URI: <%s> != <%s>"
                    % (bag, profile_id_tag, bag_info[profile_id_tag], self.url)
                )

This has got me thinking that implementing bag_profile as a overridable profile path parameter in validate_bag_profile was a mistake in the first place. It just gives you some extra rope to hang yourself since it is quite possible you specify something different than what is in the bag's bag-info.txt, and if you don't and it is actually the same value, then what's the point?

I am curious how you tested this with a local path. Did you include the local path as a string in your test bag's BagIt-Profile-Identifier? I assume so since otherwise you should have run into a ProfileValidationError with the mismatched URI.

My current thinking is now that this parameter should be deprecated, and the documentation updated to reflect that it is possible to specify a local path via file: for the BagIt-Profile-Identifier in bag-info.txt. What are your thoughts?

@prettybits
Copy link
Author

prettybits commented Mar 30, 2023

Hi @mikedarcy, thanks for taking a look so promptly and sharing your concerns!

I was aware of the possibility to pass file: URLs to urllib.urlopen() but this would only help for the local (and automated) testing scenario as you state and not for validating Bags that we receive from donors - which should of course specify the published URL in their bag-info.txt and not a local file path - against locally provided/cached profiles. Saving the network request is my main goal here as I ideally don't want to have to rely on caching being set up properly on the servers involved, and similar mechanisms are relatively common like for example XML catalogs for XML schema validation.

So while I would say this is currently supported, it is not clearly stated anywhere in the documentation. In fact the current documentation for validate_bag_profile looks flat-out wrong stating that it should be "A normalized, absolute path to a BagIt-Profile file" when in reality it's a URL.

That definitely sounds worth fixing, true.

However, as I review some of this logic, it now occurs to me that providing a way to override the profile path via validate_bag_profile really shouldn't exist either, since part of the bag validation process is matching the URL used to retrieve the profile to what is specified in the bag's bag-info.txt.

This is not currently a problem in this PR, as the URL is explicitly passed to the BDBProfile constructor and taken from another source when a local profile is provided:

# Instantiate a profile, supplying its URI.
if not profile_url:
    profile_url = bag.info.get(BAG_PROFILE_TAG, None)
    if not profile_url:
        raise bdbp.ProfileValidationError("Bag does not contain a BagIt-Profile-Identifier")

This logic was there before my changes as well but using profile_path instead of profile_url as the variable name. That's also why I didn't have to change the BagIt-Profile-Identifier of my test bags to a local path.

But it's good that you bring this up as it made me notice that the profile URL the BDBProfile is instantiated with should be conditionally taken from different sources now that local profiles could be passed:

  1. If a profile path has been given, take the URL from the profile's Bagit-Profile-Info->BagIt-Profile-Identifier
  2. If not, take the URL from the bag as before and bagit-profiles-validator will fetch the profile as the profile parameter remains None

It just gives you some extra rope to hang yourself since it is quite possible you specify something different than what is in the bag's bag-info.txt, and if you don't and it is actually the same value, then what's the point?

This way you would then indeed get an error if the specified profile identifies itself differently than the profile targeted by the bag which is as the spec intends it I believe? Leaving it as I have it currently would never fail as the source of the URL is the bag which will of course always match itself further down - arguably that's more in line with what I'd expect, having a --profile-path act as an override to validate bags against any profile I want to check it against.

Not sure which is the better way to go, deprecating the option entirely would not be a good outcome in my view. But I don't know if I made where I'm coming from understandable or if you still disagree with the goal?

@mikedarcy
Copy link
Collaborator

OK, this makes sense to me now that I've looked at it a bit more. I failed to notice that the Profile constructor takes an already instantiated dict or JSON string representing the profile and that is what effectively overrides the URL and bypasses the BagIt-Profile-Identifier URL match check. So, in principle I think the feature request is fine, but the PR as-is has some problems.

I can understand why you made some stylistic changes in places with simplifying some boolean truthiness and removing some line breaks but that stuff really doesn't belong to the feature request and makes for a more confusing PR and merge history. I take note of those and would be happy to consider them in a separate PR.

Also, you relocated the conditionalized processing of the profile validation step in bdbag_cli.py and I don't understand the point of that change. Can you provide some justification or reasoning for this modification? I realize that the order of processing the various arguments is a little odd but there is some reasoning behind the current order, and the motivation for having the profile validation near the end is so you could essentially create, validate, archive, and validate the profile (which might include requiring serialization) of a bag all in one command.

Additionally, I think renaming the existing profile_path argument to profile_url is an API breaking change and is going to potentially cause confusion when it starts semantically meaning something else, even if it was semantically broken in the first place. There really isn't a clean way to fix this, so what I would rather do is just leave the argument as-is and use it for the more appropriate local profile read/load/override, and remove the notion of being able to supply an overridable profile_url at all, which we both agree doesn't make a lot of sense. I don't suspect this will affect a lot of users but it needs to be noted that this is a behavioral change from what was currently implemented. I don't see deprecation helping here as it was basically broken to begin with. So the best way to frame this is that a long-standing implementation bug is being fixed. The good news is that the existing documentation will now actually make sense and reflect what should have been happening all along!

Lastly, this is going to need a unit test. It's a pretty simple one so I don't see that being too cumbersome.

All that being said, I would ask that you close this PR and submit a new one with just the minimum necessary changes and a test case. Other changes are welcome in other PRs. Alternatively, since I now get the gist of this feature request, you could just file an issue outlining the use-case and I can make the changes necessary, including adding a test case. I'm amenable to either path, so I will leave it up to you.

In any case thanks for bringing this up! I do agree it is a useful feature and also illuminates some mistakes in the original implementation. Cheers 👍

mikedarcy added a commit that referenced this pull request Jul 21, 2023
@mikedarcy
Copy link
Collaborator

@prettybits I've added the functionality discussed above in 3961d9e for the 1.7.0 release, along with a couple of unit tests. Please have a look (and test, if you have time) and let me know if this meets your requirements. Thanks again for pointing out the issue.

@prettybits
Copy link
Author

Thanks for taking this up again @mikedarcy, I got sidetracked a bit too much and lost sight of this for a while, sorry.

I think the implementation of validate_bag_profile is mostly as it should be now, I agree that adding an extra profile_url parameter as I initially did doesn't really make a whole lot of sense. I see that in your commit you left out the CLI part of my changes, which for my use case was also important as I wasn't intending to integrate this via Python but call it via the CLI. Is this still something you would accept changes for as well or want to adapt from this PR yourself? I would have just force-pushed my branch to reflect the minimal necessary changes but could open a new PR as well if you prefer.

I failed to notice that the Profile constructor takes an already instantiated dict or JSON string representing the profile and that is what effectively overrides the URL and bypasses the BagIt-Profile-Identifier URL match check.

The check for matching profile URLs is not overridden by the profile parameter, matches are still checked against the profile URL from the bag's bag_info.txt using the url that's passed in the first, i.e. currently that will just always match as it is the profile_url also taken from the bag and not the (local) profile. Arguably that is also a shortcoming of the backing bagit-profile library which just assumes the profile fetched from the URL (if no profile dict was passed) will have a matching BagIt-Profile-Identifier AFAICS.

I can understand why you made some stylistic changes in places with simplifying some boolean truthiness and removing some line breaks but that stuff really doesn't belong to the feature request and makes for a more confusing PR and merge history. I take note of those and would be happy to consider them in a separate PR.

You're right, sorry. It's a not ideal tendency of mine to make stylistic changes / linter fixes when I'm working on an area of code while I'm there, but I fully understand this can make seeing the essential changes harder.

Also, you relocated the conditionalized processing of the profile validation step in bdbag_cli.py and I don't understand the point of that change. Can you provide some justification or reasoning for this modification? I realize that the order of processing the various arguments is a little odd but there is some reasoning behind the current order, and the motivation for having the profile validation near the end is so you could essentially create, validate, archive, and validate the profile (which might include requiring serialization) of a bag all in one command.

That was partially motivated by the intended workflow as stated in the BagIt Profiles Specification ("specifically, it must complete the Bag if fetch.txt is present, validate the complete Bag against the Profile, and then [emphasis mine] validate the Bag against the cannonical BagIt spec.") and it also making more sense to me to fail faster in the comparatively cheap profile validation step before checksums are computed during full bag validation. I'm not sure that switching the order would affect your stated motivation but I'd have to think this through again. In any case you're right that I should at least have provided motivation for that change in my initial description.

mikedarcy added a commit that referenced this pull request Jul 24, 2023
… before bag validation in CLI execution order. See #54.

Add unit test and doc entry for above.
Fix a typo in a related logging statement.
@mikedarcy
Copy link
Collaborator

Thanks for taking this up again @mikedarcy, I got sidetracked a bit too much and lost sight of this for a while, sorry.

I think the implementation of validate_bag_profile is mostly as it should be now, I agree that adding an extra profile_url parameter as I initially did doesn't really make a whole lot of sense. I see that in your commit you left out the CLI part of my changes, which for my use case was also important as I wasn't intending to integrate this via Python but call it via the CLI. Is this still something you would accept changes for as well or want to adapt from this PR yourself? I would have just force-pushed my branch to reflect the minimal necessary changes but could open a new PR as well if you prefer.

Sorry, I just plum forgot that the CLI modification was still necessary. Since I've already integrated part of your PR, I will just integrate the CLI part as well, and add another unit test for it.

Also, you relocated the conditionalized processing of the profile validation step in bdbag_cli.py and I don't understand the point of that change. Can you provide some justification or reasoning for this modification? I realize that the order of processing the various arguments is a little odd but there is some reasoning behind the current order, and the motivation for having the profile validation near the end is so you could essentially create, validate, archive, and validate the profile (which might include requiring serialization) of a bag all in one command.

That was partially motivated by the intended workflow as stated in the BagIt Profiles Specification ("specifically, it must complete the Bag if fetch.txt is present, validate the complete Bag against the Profile, and then [emphasis mine] validate the Bag against the cannonical BagIt spec.") and it also making more sense to me to fail faster in the comparatively cheap profile validation step before checksums are computed during full bag validation. I'm not sure that switching the order would affect your stated motivation but I'd have to think this through again. In any case you're right that I should at least have provided motivation for that change in my initial description.

OK, that makes sense. Unfortunately there is a small conflict with moving this because of how the existing order allows for profile validation upon bag creation as well. Specifically, the profile serialization can be validated after a bag serialization step during the creation process. Admittedly, this logic is a bit convoluted, but I think I have a solution for this where both behaviors can be supported. See 98b716e.

@prettybits
Copy link
Author

Thanks for the prompt commit, I left a comment there as well just now.

The logic is a bit hard to follow, yes, if I'm reading this right I don't think it's currently even possible to create and validate a bag in the same invocation going by this conditional but it works with updating? Besides that, couldn't the creation of the archive file (via the --archiver argument, i.e. bag serialization) be done before any of the validation steps given that it doesn't seem to delete the bag folder and creates a new file tracked by a separate archive variable anyway? Then the profile validation could always happen first including extraction to temp_path when the bag to validate is already serialized.

@mikedarcy
Copy link
Collaborator

You are right about the conditional. The funny part is that this used to work in earlier version when doing the bag create, but the introduction of extra logic into that conditional some versions back actually broke it. I am going to try to restore that part of the conditional to allow for this sequence of events in the bag creation case. Now it will be even more convoluted ;).

I'd rather not move archiving ahead of validation since having fail-fast behavior on profile validation would abort the archiving if validation failed, saving a potentially lengthy archive and cleanup operation.

@mikedarcy
Copy link
Collaborator

Support for this has been added to the 1.7.1 release. Closing PR.

@mikedarcy mikedarcy closed this Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants