Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When making a new dataset, assert keywords exist #167

Merged
merged 2 commits into from Sep 13, 2023

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Sep 11, 2023

Current archiving infrastructure allows keywords for Zenodo datasets to be null. This poses a problem when using the datastore. This PR adds an assertion that mandates keywords for new archives to prevent this issue. You can test it by running it on MSHA, PHMSA, or EIA Water data with a --initialize flag.

Comment on lines +193 to +196
if not metadata.keywords:
raise AssertionError(
"New dataset is missing keywords and cannot be archived."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of our datasets should really have keywords, but also this seems like a funny failure. What is it about the way the Datastore is constructed that makes this a breaking failure?

Once Upon A Time we relied on a UUID in the keywords rather than the DOIs to identify related archive lineages. I wonder if this is a holdover from that era somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The archiver and the datastore currently have different validation conditions. On the archiver end an archive without keywords is valid. On the datastore end it is not ingestible. I'm trying to fix this by intervening at the point of creating a totally new archive, where keywords get defined for the first time, though they can later be updated. One other option here is to address the failure on the datastore end - i.e., to not have it fail if the dataset has no keywords.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datastore relies on the datapackage library and its .valid function, which asserts that all fields in the datapackage have values by default. I'd prefer not to override that behavior as I think it generally makes sense, so it makes more sense to bake in the error in the process of creating the archive. I suggest doing it here because this would occur before the entire download and upload process occurs, so the failure would be quite fast. Using a datapackage validation method would mean doing this at the end of the archiving process, when the datapackage is uploaded to Zenodo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this was unclear! I agree that conforming to the datapackage validation expectations is the right way to go. I was worried it was some bespoke thing that we had imposed because of the UUID thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants