Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icebox: Migrate to new Zenodo InvenioRDM API #184

Closed
wants to merge 22 commits into from
Closed

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Oct 19, 2023

See #183 for detailed issue description. Zenodo's endpoints for their API have changed, breaking our existing archivers. Let's get them migrated.

  • update create_deposition
  • update get_deposition
  • update get_record
  • update get_new_version
  • update publish_deposition
  • update delete_deposition
  • update create_file
  • update delete_file
  • update update_file
  • update datapackage.json to use new path
  • Update creators metadata to match new format https://inveniordm.docs.cern.ch/reference/metadata/#creators-1-n
  • fix license https://inveniordm.docs.cern.ch/reference/metadata/#rights-licenses-0-n
  • deal with key names that mask python methods
  • debug delete_deposition
  • update Zenodo sandbox APIs to .5281 prefix and new values
  • check if old API still works - it does not
  • check to make sure new datapackage format for newly created archives is the same as the old version
  • fix tests to expect different response than what is posted for metadata
  • Drop /content extension from datapackage
  • Figure out how to deal with metadata updates from existing production archives - add refresh_metadata flag
  • fix datapackage.json to remove content headers
  • once reviewed: make new sandbox archives for: census_dp1tract, eia176, eia860, eiawater, epacems, ferc60 and update sandbox dois for all data sources

Out of scope:

Check that the code works for:

  • non-flagged update run
  • dataset with partitions
  • --sandbox
  • --dry-run
  • --initialize

Updates about remaining kinks in the transition:

  • When testing --auto-publish on the sandbox server - I get a "Attempt to decode JSON with unexpected mimetype: text/html" 404 error. Underlying this is a 504 Gateway Timeout for the sandbox. This also means the tests fail.
  • The API documentation says to use the /content file link to GET a file. There are two options here: "https://zenodo.org/api/records/10064652/files/eia923-2023.zip/content" and "https://zenodo.org/records/10064652/files/eia923-2023.zip" - presuming in this PR that we prefer the latter?
  • Sandbox DOIs are unstable and likely to change again. It's still unclear whether the transition from 10.5072 to 10.5281 for sandbox DOIs is permanent. Sandbox DOIs are not actually being really registered right now, based on email communication with Zenodo technical support.
  • The creators field response != the creators field expected format. This means you can't GET a deposition and PUT its metadata back. Updating a draft record with one field (e.g. the order field for the files dictionary) also seems to wipe all the rest of the inputted values for the metadata. Very annoying behavior that means we should avoid updating records by any means other than linking once they're created.

@e-belfer e-belfer self-assigned this Oct 19, 2023
@e-belfer e-belfer linked an issue Oct 19, 2023 that may be closed by this pull request
@e-belfer e-belfer changed the title Migrate to new Zenodo InvenioRDM API WIP: Migrate to new Zenodo InvenioRDM API Oct 19, 2023
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you've successfully made archives in prod. Which is good, and means we have probably one of the few functioning Zenodo API clients in the world.

However, tests are failing, and there are some weird data modelling issues e.g. DepositionVersion not having a submitted field, which don't feel ready for merge into main.

There's also just lots of undocumented behavior that we're using, but I'm not sure is necessary, but we can't really tell without the documentation or lots of trial and error.

I think we have a few options here:

  • merge this into main, and make a big cleanup pass later when Zenodo is stabilized
  • keep this un-merged, wait for Zenodo to stabilize, and then make a cleanup pass before merging; in the meantime, do all manual archive creation on this branch
  • keep this un-merged, but investigate whether the legacy API integration that's currently on main works again - they had mentioned trying to get it back online and it appears to be at least somewhat back.
  • refactor this to make a "new zenodo API depositor" that shares the same public API as the legacy zenodo API depositor and lives alongside it (instead of replacing the existing zenodo depositor), so we can switch between the two as Zenodo figures out what's happening

I think that if we can't make prod archives using the legacy API integration on main and we can make prod archives with this version, we should just merge into main and fix things once zenodo calms down a bit.

It would maybe make sense to wait a few more days and see if we can get sandbox tests passing. It looks like the records are now being successfully published on sandbox, even though the publish request returns a 500. So maybe in a few days the sandbox will be less flaky and we can actually try to run these tests.

@@ -20,9 +20,9 @@ async def get_resources(self) -> ArchiveAwaitable:
"""Download EIA-860 resources."""
link_pattern = re.compile(r"eia860(\d{4})(ER)*.zip")
for link in await self.get_hyperlinks(BASE_URL, link_pattern):
year = link_pattern.search(link).group(1)
year = int(link_pattern.search(link).group(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope: we should stick mypy in the pre-commit config so these sorts of type errors don't slip through in the future. This will probably mean fixing a ton of type errors in a separate PR before we can do that.

parser.add_argument(
"--refresh-metadata",
action="store_true",
help="Regenerate metadata from PUDL data source rather than existing archived metadata.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Zenodo metadata, like "creators", "communities", "DOI" etc?

Copy link
Member Author

@e-belfer e-belfer Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, creators, title, keywords, basically all metadata. The outcome is the same as if we'd run initialize except for version and link to the pre-existing depositions. We need this because there's no other way to migrate old-type depositions, but it's also nice to have in case we ever want to update metadata for an existing archive.

mt = MEDIA_TYPES[filename.suffix[1:]]
# Remove /api, /draft, /content from link to get stable path
if "/api" or "/draft" or "/content" in file.links.self:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good case for regex...

stable_path = re.sub(r"/(api|draft|content)", "", file.links.self)

r"\d{6,7}$",
published_deposition.conceptrecid,
published_deposition.doi,
)
if self.sandbox:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: Our sandbox/prod DOI setup is a little awkward, I wonder if we can stop having to check sandboxiness in so many places.

"sponsor",
"supervisor",
"work package leader",
]: # Unclear to me what roles are allowed here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, that sounds frustrating. How did you figure out this list in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By manually copying over the list of options on the Zenodo site, essentially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, our original data model here matched up much better with Zenodo's API! 🪦

at the original."""
draft = await depositor.get_new_version(initial_deposition)

draft = await depositor.get_new_version(initial_deposition, data_source_id="test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the initial_deposition is a DepositionVersion which doesn't have a submitted field - which then breaks when we check it in get_new_version.

The submitted field isn't anywhere in the docs for the GET /api/records/{id} endpoint, but it does get returned from the API in both the records/{id} endpoint as well as /records/{id}/versions/latest, so we could just add that to DepositionVersion.

But, submitted is not an expected response field, according to the docs.

But, the corresponding response field in the docs, is_published, is actually not being returned from the API.

And there's a third status field which seems to correspond to submitted-ness that doesn't show up in the docs either.

In any case - we should do something so that get_new_version can handle the DepositionVersion response from publish_deposition and the Deposition response. I suppose we can just work off of the actual shape of the data instead of the docs.

"record": "public",
"files": "public",
},
"files": {"enabled": True}, # Must always be True for Zenodo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we just want this to be True all the time because we want to enable file attachments? Or does it have to be True because Zenodo throws a cryptic error if it's not set to True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws an error if True, not documented anywhere. As far as I can tell it's technically a setting that the network configurator can change to allow metadata only records but this doesn't seem to be enabled in Zenodo. This is ok since we only ever want to upload records with files. It also matches the behavior of old Zenodo API (see the expected failure for the empty_deposition test, e.g.

)
# Reserve DOI for deposition
doi_response = await self.reserve_doi(response)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find anything about this reserve_doi thing in the docs/support emails/github issues - what is this doing and how did you figure it out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the "reserve_doi" link returned in the list of links for a file. It's not listed anywhere other than being described as an option for developers using InvenioRDM to enable. Made some informed guesses about what it would take based on this publish method that wound up being right, if I remember right.

@@ -239,7 +244,9 @@ async def run(self) -> RunSummary | Unchanged:
for name in files_to_delete
]

self.new_deposition = await self.depositor.get_record(self.new_deposition.id_)
self.new_deposition = await self.depositor.get_draft_record(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pleased that the changes to orchestrator.py are so light / the interface to the depositor didn't change much!

Once Zenodo stabilizes and the docs exist, we can maybe think about what the depositor interface should look like going forward... something along the lines of:

  • create new deposition
  • update the deposition file contents (all the draft state management might want to be hidden from the outside)
  • publish the deposition

But - since we're most likely going to be tied to Zenodo for the forseeable future, and I don't expect them to be changing the API willy-nilly anytime soon, maybe that's not a super useful refactor.

@e-belfer e-belfer changed the title WIP: Migrate to new Zenodo InvenioRDM API Icebox: Migrate to new Zenodo InvenioRDM API Nov 22, 2023
@e-belfer e-belfer added the wontfix This will not be worked on label Nov 22, 2023
@jdangerx
Copy link
Member

We'll close this for now since we're not actively working on it for a minute and will probably have to redo a bunch of stuff anyways. But, this will be super useful documentation for when they finally do release the new API.

@jdangerx jdangerx closed this Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inframundo wontfix This will not be worked on zenodo
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Icebox: Move to new Zenodo API
2 participants