-
Notifications
You must be signed in to change notification settings - Fork 24
WIP - Fix ORCID ingestion #754
base: master
Are you sure you want to change the base?
Conversation
To report progress and speed, I have made a generator wrapper that you could use for that: Line 120 in 96b83f8
The idea is that you can use it with any generator - it returns the same generator, with the additional reporting logic. |
1 similar comment
The latest commit uses the ORCID API dump to fill in The remaining issue is for the works themselves. We don't use DOIs and Crossref when importing the ORCID dump, therefore we make a bunch of API calls to ORCID to fetch the works for each profile. This is not included in the JSON conversion of the API dump. This however is available in a separate dump, in XML format. Maybe our best bet would be to actually preprocess this XML dump and convert it to JSON format and do the whole import process offline. I don't have numbers with querying CrossRef API (as it is currently disabled for ORCID ingestion), but with all the queries to ORCID to fetch the works, it takes about 1500s to ingest 1k profiles, which is way too long. |
Is it useful/expected to create an ORCID profile without any content other than the ORCID ID? https://dissem.in/r/187946/ursula-ellinghaus Excluding the "empty" profiles would cut the number of profile creations in half, if I read the statistics right:
|
Not sure about the usefulness of having empty profiles. This being said, the very costly part here is requests made to the ORCID API (to fetch works). This does not happen for empty profiles, which are already created without making any HTTP request with this PR, so they basically take milliseconds to be created. Given the number of ORCID profiles to process, these milliseconds might matter. But anyways, 25M ORCID profiles (half of them) with the current setup would take more than 400 days to ingest :/ |
Maybe I am to naive here. Importing from this once and then query ORCID API on demand if necessary? Maybe with a "last updated" field in our DB? |
@beckstefan the ORCID API used to return an entire profile in one HTTP request. That is no longer the case: you need to retrieve publications by batches from them. Yes it would make sense to import from the dump - it's just annoying that only XML format (and not JSON) is available now (so it means duplicating our parsing code for XML). |
Additionally to @wetneb's answer, the ORCID API dump provides everything in XML (whereas the live API we use is JSON). Only part of it (which is profile summaries) is usable with theconversion lib https://github.com/ORCID/orcid-conversion-lib/ to convert it to JSON. Therefore, we use the local dump converted to JSON for the first steps of the import (base profile infos, work summaries) and then we have to query the live API for batches of works. This is the slow part (and not easily runnable in parallel because of API rate limitations anyways), we should use the dump altogether.
Or preprocessing the dump. Given the size of the dump and the amount of useless stuff for us so far (education info, employment infos etc), I'd rather have a pre-processing chain for extracting/filtering/JSONizing the XML dump. |
Yeah, and the easiest way to do this might be to extend their utility actually… :-/ |
Note to self: it is not really practically doable to expand the full ORCID activities dump. There are 1100 folders in the summaries dump:
Expanding the ORCID API (filtering on the works only and skipping education, employment etc, one log line every 10k files extracted, stopped before finishing):
The whole ORCID API dump is then expected to be about 720GB large. The best approach would be to process folders by folder I think:
I'll start by writing some quick and dirty script to convert the XML files to JSON. |
5f2aafe
to
5d25981
Compare
Fix 'OrcidWork' object has no attribute 'reason' in ORCID dump ingestion, fix dissemin#723. Rework the whole ingestion process to: - [x] Provide a `manage.py` command - [x] Fix a bug at Orcid profile creation. It was lacking an ORCID id, therefore sending invalid queries to ORCID, with `None` instead of the ORCID id. - [x] Use activities-summary from the ORCID API dump instead of making another ORCID API query - [x] Rework the ingestion code to process the ORCID XML dump of all works.
So, I found a trick to overcome the tar gz limitations :) One of the main issues was that the top level directories (both dumps are of the form Current run time is about 10s to parse a top level directory (get a list of members) in Python, and then about 30s to extract the whole top level directory. We cannot reduce this much. Once this step is done, all the import is done offline, without any query to ORCID, which should be doable at maximum speed. Here is a trace of the import running:
were records here stand for "ORCID profiles". The numbers are not very high, but this is an issue downstream, in the ES / psql querying and this is an issue cha and @wetneb are working on at the moment, affecting as well the other import sources. https://sandbox.dissem.in/r/1636/teresa-donateo?page=1 was imported with this code for instance. In summary, I think this one should be mergeable now. It fixes #723. Not sure though that it is yet doable to import all of the ORCID dump. Rough estimate of needed time:
Total : 32 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a quick test for the bulk_import
method? How hard would it be to make a small demo ORCID dump for that? I'm trying to do it for BASE in #760.
Sure, will have a look. |
else: | ||
put_codes.append(dois_and_putcodes[idx][1]) | ||
|
||
# 2nd attempt with ORCID's own crappy metadata | ||
works = profile.fetch_works(put_codes) | ||
works = profile.fetch_works( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question rising from developper meeting: When we ingest the ORCID dump, we want to do it offline so we don't take into account the DOIs and the whole previous lines are skipped. We only use ORCID metadata (no DOI) which are incomplete.
So, ORCID dump should probably be imported after Crossref import is done and up to date. We could lookup for papers by DOIs (without queries to Crossref) if we have DOIs as well.
Any idea on the strategy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we need to rethink the general architecture to handle this in a satisfactory way… If we can rely on the fact that Crossref is already imported, then ORCID should not be fetching its metadata a second time from the API, but only adding the ORCID information to the authors of the papers (adding an OaiRecord to keep track of the provenance).
But that means going away from the architecture where each paper importer emits a pure Python object (not linked to the database) representing all the paper metadata, and then a generic backend code deals with the saving to the DB. We currently don't follow that infrastructure completely, but that is the general direction I have been pushing towards.
Brainstorming session required!
os.listdir(summaries_path), | ||
name='ORCID profiles' | ||
) | ||
for summary_file in summaries_files_with_speed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have parallel for
loop here?
@@ -0,0 +1,13 @@ | |||
from django.core.management.base import BaseCommand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same should be done for BASE import, see #760 (comment).
``/home/orcid/activities.tar.gz``. | ||
|
||
The dumps can then be imported using ``python manage.py ingest_orcid | ||
/home/orcid/activities.tar.gz /home/orcid/summaries``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add doc about BASE dump import, see #760 (comment).
Alternative view: if we had started 6 months ago, we'd be half done! |
Fix 'OrcidWork' object has no attribute 'reason' in ORCID dump
ingestion, fix #723.
This is tested and make the ORCID dump ingestion more robust. I added a bit of logging to have a better output and feedback on the progress.
I am currently running it on tulpan in a screen session.
TODO: