New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When making a new dataset, assert keywords exist #167
Conversation
if not metadata.keywords: | ||
raise AssertionError( | ||
"New dataset is missing keywords and cannot be archived." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of our datasets should really have keywords, but also this seems like a funny failure. What is it about the way the Datastore is constructed that makes this a breaking failure?
Once Upon A Time we relied on a UUID in the keywords rather than the DOIs to identify related archive lineages. I wonder if this is a holdover from that era somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The archiver and the datastore currently have different validation conditions. On the archiver end an archive without keywords is valid. On the datastore end it is not ingestible. I'm trying to fix this by intervening at the point of creating a totally new archive, where keywords get defined for the first time, though they can later be updated. One other option here is to address the failure on the datastore end - i.e., to not have it fail if the dataset has no keywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datastore relies on the datapackage
library and its .valid
function, which asserts that all fields in the datapackage have values by default. I'd prefer not to override that behavior as I think it generally makes sense, so it makes more sense to bake in the error in the process of creating the archive. I suggest doing it here because this would occur before the entire download and upload process occurs, so the failure would be quite fast. Using a datapackage validation method would mean doing this at the end of the archiving process, when the datapackage is uploaded to Zenodo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this was unclear! I agree that conforming to the datapackage validation expectations is the right way to go. I was worried it was some bespoke thing that we had imposed because of the UUID thing.
Current archiving infrastructure allows keywords for Zenodo datasets to be null. This poses a problem when using the datastore. This PR adds an assertion that mandates keywords for new archives to prevent this issue. You can test it by running it on MSHA, PHMSA, or EIA Water data with a
--initialize
flag.