When making a new dataset, assert keywords exist #167

e-belfer · 2023-09-11T18:55:19Z

Current archiving infrastructure allows keywords for Zenodo datasets to be null. This poses a problem when using the datastore. This PR adds an assertion that mandates keywords for new archives to prevent this issue. You can test it by running it on MSHA, PHMSA, or EIA Water data with a --initialize flag.

zaneselvans · 2023-09-11T23:25:08Z

src/pudl_archiver/orchestrator.py

+        if not metadata.keywords:
+            raise AssertionError(
+                "New dataset is missing keywords and cannot be archived."
+            )


All of our datasets should really have keywords, but also this seems like a funny failure. What is it about the way the Datastore is constructed that makes this a breaking failure?

Once Upon A Time we relied on a UUID in the keywords rather than the DOIs to identify related archive lineages. I wonder if this is a holdover from that era somehow.

The archiver and the datastore currently have different validation conditions. On the archiver end an archive without keywords is valid. On the datastore end it is not ingestible. I'm trying to fix this by intervening at the point of creating a totally new archive, where keywords get defined for the first time, though they can later be updated. One other option here is to address the failure on the datastore end - i.e., to not have it fail if the dataset has no keywords.

The datastore relies on the datapackage library and its .valid function, which asserts that all fields in the datapackage have values by default. I'd prefer not to override that behavior as I think it generally makes sense, so it makes more sense to bake in the error in the process of creating the archive. I suggest doing it here because this would occur before the entire download and upload process occurs, so the failure would be quite fast. Using a datapackage validation method would mean doing this at the end of the archiving process, when the datapackage is uploaded to Zenodo.

Sorry if this was unclear! I agree that conforming to the datapackage validation expectations is the right way to go. I was worried it was some bespoke thing that we had imposed because of the UUID thing.

Check keywords before archiving new dataset

2deef74

e-belfer requested a review from zschira September 11, 2023 18:55

zaneselvans reviewed Sep 11, 2023

View reviewed changes

Merge branch 'main' into add-keyword-validation

64f24f2

e-belfer self-assigned this Sep 13, 2023

e-belfer merged commit cc8cc5d into main Sep 13, 2023
3 checks passed

e-belfer deleted the add-keyword-validation branch September 13, 2023 18:30

This was referenced Sep 13, 2023

Update keywords in new dataset metadata #80

Closed

Add keywords to Zenodo archive to fix dataset validation issue catalyst-cooperative/pudl#2850

Closed

e-belfer mentioned this pull request Oct 11, 2023

Extract raw PHMSA distribution and start of transmission data (Table A-D, H, I) catalyst-cooperative/pudl#2932

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When making a new dataset, assert keywords exist #167

When making a new dataset, assert keywords exist #167

e-belfer commented Sep 11, 2023 •

edited

zaneselvans Sep 11, 2023

e-belfer Sep 12, 2023

e-belfer Sep 12, 2023

zaneselvans Sep 13, 2023

When making a new dataset, assert keywords exist #167

When making a new dataset, assert keywords exist #167

Conversation

e-belfer commented Sep 11, 2023 • edited

zaneselvans Sep 11, 2023

Choose a reason for hiding this comment

e-belfer Sep 12, 2023

Choose a reason for hiding this comment

e-belfer Sep 12, 2023

Choose a reason for hiding this comment

zaneselvans Sep 13, 2023

Choose a reason for hiding this comment

e-belfer commented Sep 11, 2023 •

edited