Icebox: Migrate to new Zenodo InvenioRDM API #184

e-belfer · 2023-10-19T14:12:56Z

See #183 for detailed issue description. Zenodo's endpoints for their API have changed, breaking our existing archivers. Let's get them migrated.

Out of scope:

add community to post https://inveniordm.docs.cern.ch/reference/metadata/#future-directions (maybe not yet possible?)

Check that the code works for:

Updates about remaining kinks in the transition:

When testing --auto-publish on the sandbox server - I get a "Attempt to decode JSON with unexpected mimetype: text/html" 404 error. Underlying this is a 504 Gateway Timeout for the sandbox. This also means the tests fail.
The API documentation says to use the /content file link to GET a file. There are two options here: "https://zenodo.org/api/records/10064652/files/eia923-2023.zip/content" and "https://zenodo.org/records/10064652/files/eia923-2023.zip" - presuming in this PR that we prefer the latter?
Sandbox DOIs are unstable and likely to change again. It's still unclear whether the transition from 10.5072 to 10.5281 for sandbox DOIs is permanent. Sandbox DOIs are not actually being really registered right now, based on email communication with Zenodo technical support.
The creators field response != the creators field expected format. This means you can't GET a deposition and PUT its metadata back. Updating a draft record with one field (e.g. the order field for the files dictionary) also seems to wipe all the rest of the inputted values for the metadata. Very annoying behavior that means we should avoid updating records by any means other than linking once they're created.

src/pudl_archiver/depositors/zenodo.py

This reverts commit eb7e52c.

src/pudl_archiver/zenodo/entities.py

jdangerx

So you've successfully made archives in prod. Which is good, and means we have probably one of the few functioning Zenodo API clients in the world.

However, tests are failing, and there are some weird data modelling issues e.g. DepositionVersion not having a submitted field, which don't feel ready for merge into main.

There's also just lots of undocumented behavior that we're using, but I'm not sure is necessary, but we can't really tell without the documentation or lots of trial and error.

I think we have a few options here:

merge this into main, and make a big cleanup pass later when Zenodo is stabilized
keep this un-merged, wait for Zenodo to stabilize, and then make a cleanup pass before merging; in the meantime, do all manual archive creation on this branch
keep this un-merged, but investigate whether the legacy API integration that's currently on main works again - they had mentioned trying to get it back online and it appears to be at least somewhat back.
refactor this to make a "new zenodo API depositor" that shares the same public API as the legacy zenodo API depositor and lives alongside it (instead of replacing the existing zenodo depositor), so we can switch between the two as Zenodo figures out what's happening

I think that if we can't make prod archives using the legacy API integration on main and we can make prod archives with this version, we should just merge into main and fix things once zenodo calms down a bit.

It would maybe make sense to wait a few more days and see if we can get sandbox tests passing. It looks like the records are now being successfully published on sandbox, even though the publish request returns a 500. So maybe in a few days the sandbox will be less flaky and we can actually try to run these tests.

jdangerx · 2023-11-02T14:55:28Z

src/pudl_archiver/archivers/eia860.py

@@ -20,9 +20,9 @@ async def get_resources(self) -> ArchiveAwaitable:
        """Download EIA-860 resources."""
        link_pattern = re.compile(r"eia860(\d{4})(ER)*.zip")
        for link in await self.get_hyperlinks(BASE_URL, link_pattern):
-            year = link_pattern.search(link).group(1)
+            year = int(link_pattern.search(link).group(1))


Out of scope: we should stick mypy in the pre-commit config so these sorts of type errors don't slip through in the future. This will probably mean fixing a ton of type errors in a separate PR before we can do that.

jdangerx · 2023-11-02T14:56:30Z

src/pudl_archiver/cli.py

+    parser.add_argument(
+        "--refresh-metadata",
+        action="store_true",
+        help="Regenerate metadata from PUDL data source rather than existing archived metadata.",


This is the Zenodo metadata, like "creators", "communities", "DOI" etc?

Yes, creators, title, keywords, basically all metadata. The outcome is the same as if we'd run initialize except for version and link to the pre-existing depositions. We need this because there's no other way to migrate old-type depositions, but it's also nice to have in case we ever want to update metadata for an existing archive.

jdangerx · 2023-11-02T14:58:38Z

src/pudl_archiver/frictionless.py

        mt = MEDIA_TYPES[filename.suffix[1:]]
+        # Remove /api, /draft, /content from link to get stable path
+        if "/api" or "/draft" or "/content" in file.links.self:


This seems like a good case for regex...

stable_path = re.sub(r"/(api|draft|content)", "", file.links.self)

jdangerx · 2023-11-02T15:11:31Z

src/pudl_archiver/orchestrator.py

+            r"\d{6,7}$",
+            published_deposition.conceptrecid,
+            published_deposition.doi,
+        )
        if self.sandbox:


non-blocking: Our sandbox/prod DOI setup is a little awkward, I wonder if we can stop having to check sandboxiness in so many places.

jdangerx · 2023-11-02T15:12:54Z

src/pudl_archiver/zenodo/entities.py

+                "sponsor",
+                "supervisor",
+                "work package leader",
+            ]:  # Unclear to me what roles are allowed here.


Ugh, that sounds frustrating. How did you figure out this list in the first place?

By manually copying over the list of options on the Zenodo site, essentially.

jdangerx · 2023-11-02T15:13:51Z

src/pudl_archiver/zenodo/entities.py

Wow, our original data model here matched up much better with Zenodo's API! 🪦

jdangerx · 2023-11-02T19:03:11Z

tests/integration/zenodo_depositor_test.py

    at the original."""
-    draft = await depositor.get_new_version(initial_deposition)
-
+    draft = await depositor.get_new_version(initial_deposition, data_source_id="test")


Here, the initial_deposition is a DepositionVersion which doesn't have a submitted field - which then breaks when we check it in get_new_version.

The submitted field isn't anywhere in the docs for the GET /api/records/{id} endpoint, but it does get returned from the API in both the records/{id} endpoint as well as /records/{id}/versions/latest, so we could just add that to DepositionVersion.

But, submitted is not an expected response field, according to the docs.

But, the corresponding response field in the docs, is_published, is actually not being returned from the API.

And there's a third status field which seems to correspond to submitted-ness that doesn't show up in the docs either.

In any case - we should do something so that get_new_version can handle the DepositionVersion response from publish_deposition and the Deposition response. I suppose we can just work off of the actual shape of the data instead of the docs.

jdangerx · 2023-11-02T19:06:43Z

src/pudl_archiver/depositors/zenodo.py

+                    "record": "public",
+                    "files": "public",
+                },
+                "files": {"enabled": True},  # Must always be True for Zenodo


Do we just want this to be True all the time because we want to enable file attachments? Or does it have to be True because Zenodo throws a cryptic error if it's not set to True?

It throws an error if True, not documented anywhere. As far as I can tell it's technically a setting that the network configurator can change to allow metadata only records but this doesn't seem to be enabled in Zenodo. This is ok since we only ever want to upload records with files. It also matches the behavior of old Zenodo API (see the expected failure for the empty_deposition test, e.g.

jdangerx · 2023-11-02T19:10:42Z

src/pudl_archiver/depositors/zenodo.py

        )
+        # Reserve DOI for deposition
+        doi_response = await self.reserve_doi(response)


I can't find anything about this reserve_doi thing in the docs/support emails/github issues - what is this doing and how did you figure it out?

I followed the "reserve_doi" link returned in the list of links for a file. It's not listed anywhere other than being described as an option for developers using InvenioRDM to enable. Made some informed guesses about what it would take based on this publish method that wound up being right, if I remember right.

jdangerx · 2023-11-02T20:11:46Z

src/pudl_archiver/orchestrator.py

@@ -239,7 +244,9 @@ async def run(self) -> RunSummary | Unchanged:
            for name in files_to_delete
        ]

-        self.new_deposition = await self.depositor.get_record(self.new_deposition.id_)
+        self.new_deposition = await self.depositor.get_draft_record(


I'm pleased that the changes to orchestrator.py are so light / the interface to the depositor didn't change much!

Once Zenodo stabilizes and the docs exist, we can maybe think about what the depositor interface should look like going forward... something along the lines of:

create new deposition

update the deposition file contents (all the draft state management might want to be hidden from the outside)

publish the deposition

But - since we're most likely going to be tied to Zenodo for the forseeable future, and I don't expect them to be changing the API willy-nilly anytime soon, maybe that's not a super useful refactor.

jdangerx · 2023-12-19T19:46:14Z

We'll close this for now since we're not actively working on it for a minute and will probably have to redo a bunch of stuff anyways. But, this will be super useful documentation for when they finally do release the new API.

e-belfer added 2 commits October 18, 2023 17:21

WIP broken

39d27c8

more notes on what's not working

939d6ef

e-belfer self-assigned this Oct 19, 2023

e-belfer added inframundo zenodo labels Oct 19, 2023

e-belfer linked an issue Oct 19, 2023 that may be closed by this pull request

Icebox: Move to new Zenodo API #183

Open

e-belfer changed the title ~~Migrate to new Zenodo InvenioRDM API~~ WIP: Migrate to new Zenodo InvenioRDM API Oct 19, 2023

e-belfer commented Oct 19, 2023

View reviewed changes

src/pudl_archiver/depositors/zenodo.py Show resolved Hide resolved

e-belfer commented Oct 19, 2023

View reviewed changes

src/pudl_archiver/depositors/zenodo.py Outdated Show resolved Hide resolved

e-belfer commented Oct 19, 2023

View reviewed changes

src/pudl_archiver/depositors/zenodo.py Show resolved Hide resolved

e-belfer added 2 commits October 19, 2023 11:27

Add back version increment, break some other stuff

afbda4b

Fix upload method

8488c66

zaneselvans mentioned this pull request Oct 19, 2023

Second half of 2022 missing from fuel_receipts_costs_aggs_eia table catalyst-cooperative/pudl#2956

Open

e-belfer and others added 13 commits October 24, 2023 09:48

Merge branch 'main' into zenodo-migration

eb284a6

add contributor and license, fix pydantic errors

10a0e09

Fix contributors and work on create_deposition

78016bc

Clean up

0f20caf

Fix Creators call in tests, sandbox API still not working

e9e9f18

Update sandbox dois, add temp dois for wiped archives

3ac4191

Switch to concept dois

eb7e52c

Revert "Switch to concept dois"

d44fc8f

This reverts commit eb7e52c.

Fix concept dois

e268345

Fix depositionmetadata in tests

42750f1

Update archivers to work, tests still failing

f4bcda9

Update method description of ID

a1aab47

Fix concept record id

1775721

e-belfer commented Oct 31, 2023

View reviewed changes

src/pudl_archiver/zenodo/entities.py Outdated Show resolved Hide resolved

e-belfer added 3 commits November 1, 2023 13:22

Add refresh metadata flag, clean up

3a1a268

Fix file uploads

f3e5edb

Fix publisher

be9db31

e-belfer requested review from jdangerx and zschira November 1, 2023 19:18

Fix URL link and 860 archiver

f4105ba

This was referenced Nov 2, 2023

Update sources, DOI and copyright link in PUDL catalyst-cooperative/pudl#3004

Merged

Reorder files in Zenodo archiver #201

Closed

jdangerx reviewed Nov 2, 2023

View reviewed changes

Add indentation to datapackage.json

86bf3d0

e-belfer mentioned this pull request Nov 8, 2023

Minor fixes to work with new Zenodo backend #192

Merged

e-belfer changed the title ~~WIP: Migrate to new Zenodo InvenioRDM API~~ Icebox: Migrate to new Zenodo InvenioRDM API Nov 22, 2023

e-belfer added the wontfix This will not be worked on label Nov 22, 2023

jdangerx closed this Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icebox: Migrate to new Zenodo InvenioRDM API #184

Icebox: Migrate to new Zenodo InvenioRDM API #184

e-belfer commented Oct 19, 2023 •

edited

Loading

jdangerx left a comment

jdangerx Nov 2, 2023

jdangerx Nov 2, 2023

e-belfer Nov 2, 2023 •

edited

Loading

jdangerx Nov 2, 2023

jdangerx Nov 2, 2023

jdangerx Nov 2, 2023

e-belfer Nov 2, 2023

jdangerx Nov 2, 2023

jdangerx Nov 2, 2023

jdangerx Nov 2, 2023

e-belfer Nov 2, 2023

jdangerx Nov 2, 2023

e-belfer Nov 2, 2023

jdangerx Nov 2, 2023

jdangerx commented Dec 19, 2023

Icebox: Migrate to new Zenodo InvenioRDM API #184

Icebox: Migrate to new Zenodo InvenioRDM API #184

Conversation

e-belfer commented Oct 19, 2023 • edited Loading

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-belfer Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Dec 19, 2023

e-belfer commented Oct 19, 2023 •

edited

Loading

e-belfer Nov 2, 2023 •

edited

Loading