Skip to content
This repository has been archived by the owner on Feb 27, 2021. It is now read-only.

WIP - Fix ORCID ingestion #754

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Phyks
Copy link
Member

@Phyks Phyks commented May 9, 2019

Fix 'OrcidWork' object has no attribute 'reason' in ORCID dump
ingestion, fix #723.

This is tested and make the ORCID dump ingestion more robust. I added a bit of logging to have a better output and feedback on the progress.

I am currently running it on tulpan in a screen session.

TODO:

  • Some doc.

@wetneb
Copy link
Member

wetneb commented May 9, 2019

To report progress and speed, I have made a generator wrapper that you could use for that:

def with_speed_report(generator, name=None, report_delay=timedelta(seconds=10)):

The idea is that you can use it with any generator - it returns the same generator, with the additional reporting logic.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.9%) to 75.199% when pulling 62f819c on Phyks:orcid-no-reason into 96b83f8 on dissemin:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.9%) to 75.199% when pulling 62f819c on Phyks:orcid-no-reason into 96b83f8 on dissemin:master.

@coveralls
Copy link

coveralls commented May 9, 2019

Coverage Status

Coverage decreased (-3.2%) to 74.709% when pulling fa42b9b on Phyks:orcid-no-reason into f2dd21d on dissemin:master.

backend/orcid.py Outdated Show resolved Hide resolved
backend/orcid.py Outdated Show resolved Hide resolved
@Phyks Phyks changed the title Fix ORCID ingestion WIP: Fix ORCID ingestion May 9, 2019
@Phyks
Copy link
Member Author

Phyks commented May 9, 2019

The latest commit uses the ORCID API dump to fill in work-summaries and then avoid one API call for empty profiles.

The remaining issue is for the works themselves. We don't use DOIs and Crossref when importing the ORCID dump, therefore we make a bunch of API calls to ORCID to fetch the works for each profile. This is not included in the JSON conversion of the API dump.

This however is available in a separate dump, in XML format. Maybe our best bet would be to actually preprocess this XML dump and convert it to JSON format and do the whole import process offline.

I don't have numbers with querying CrossRef API (as it is currently disabled for ORCID ingestion), but with all the queries to ORCID to fetch the works, it takes about 1500s to ingest 1k profiles, which is way too long.

@nemobis
Copy link
Member

nemobis commented May 9, 2019

Is it useful/expected to create an ORCID profile without any content other than the ORCID ID? https://dissem.in/r/187946/ursula-ellinghaus

Excluding the "empty" profiles would cut the number of profile creations in half, if I read the statistics right:

Live ORCID iDs 6424438 iDs with external identifiers (person, org, funding, work, peer review work) 2651584

https://orcid.org/statistics

@Phyks
Copy link
Member Author

Phyks commented May 9, 2019

Not sure about the usefulness of having empty profiles. This being said, the very costly part here is requests made to the ORCID API (to fetch works). This does not happen for empty profiles, which are already created without making any HTTP request with this PR, so they basically take milliseconds to be created.

Given the number of ORCID profiles to process, these milliseconds might matter. But anyways, 25M ORCID profiles (half of them) with the current setup would take more than 400 days to ingest :/

@beckstefan
Copy link
Member

This however is available in a separate dump, in XML format. Maybe our best bet would be to actually preprocess this XML dump and convert it to JSON format and do the whole import process offline.

Maybe I am to naive here. Importing from this once and then query ORCID API on demand if necessary? Maybe with a "last updated" field in our DB?
What exactly makes the ORCID API import that slow? 1.5 seconds per profile is a lot.

@wetneb
Copy link
Member

wetneb commented May 10, 2019

@beckstefan the ORCID API used to return an entire profile in one HTTP request. That is no longer the case: you need to retrieve publications by batches from them.

Yes it would make sense to import from the dump - it's just annoying that only XML format (and not JSON) is available now (so it means duplicating our parsing code for XML).

@Phyks
Copy link
Member Author

Phyks commented May 10, 2019

What exactly makes the ORCID API import that slow? 1.5 seconds per profile is a lot.

Additionally to @wetneb's answer, the ORCID API dump provides everything in XML (whereas the live API we use is JSON). Only part of it (which is profile summaries) is usable with theconversion lib https://github.com/ORCID/orcid-conversion-lib/ to convert it to JSON. Therefore, we use the local dump converted to JSON for the first steps of the import (base profile infos, work summaries) and then we have to query the live API for batches of works. This is the slow part (and not easily runnable in parallel because of API rate limitations anyways), we should use the dump altogether.

Yes it would make sense to import from the dump - it's just annoying that only XML format (and not JSON) is available now (so it means duplicating our parsing code for XML).

Or preprocessing the dump. Given the size of the dump and the amount of useless stuff for us so far (education info, employment infos etc), I'd rather have a pre-processing chain for extracting/filtering/JSONizing the XML dump.

@wetneb
Copy link
Member

wetneb commented May 10, 2019

Or preprocessing the dump. Given the size of the dump and the amount of useless stuff for us so far (education info, employment infos etc), I'd rather have a pre-processing chain for extracting/filtering/JSONizing the XML dump.

Yeah, and the easiest way to do this might be to extend their utility actually… :-/

@Phyks
Copy link
Member Author

Phyks commented May 10, 2019

Note to self: it is not really practically doable to expand the full ORCID activities dump.

There are 1100 folders in the summaries dump:

$ ls /home/…/summaries | wc -l
1100

Expanding the ORCID API (filtering on the works only and skipping education, employment etc, one log line every 10k files extracted, stopped before finishing):

$ tar zxvf /home/…/orcidapi2.0_activities_xml_10_2018.tar.gz --wildcards activities/*/*/works/ | awk 'NR == 1 || NR % 10000 == 0'
$ ls activities/| wc -l
120
$ du -h --summarize activities
72GB

The whole ORCID API dump is then expected to be about 720GB large. The best approach would be to process folders by folder I think:

  1. Iterate on the folders in the ORCID API summary dump.
  2. For each folder, extract the corresponding activities, pre-process them (convert them from XML to JSON).
  3. Then, ingest the folder in Python, off line. Go on with next folder.

I'll start by writing some quick and dirty script to convert the XML files to JSON.

@Phyks Phyks force-pushed the orcid-no-reason branch 4 times, most recently from 5f2aafe to 5d25981 Compare May 13, 2019 13:51
Fix 'OrcidWork' object has no attribute 'reason' in ORCID dump
ingestion, fix dissemin#723.

Rework the whole ingestion process to:
- [x] Provide a `manage.py` command
- [x] Fix a bug at Orcid profile creation. It was lacking an ORCID id,
therefore sending invalid queries to ORCID, with `None` instead of the
ORCID id.
- [x] Use activities-summary from the ORCID API dump instead of making
another ORCID API query
- [x] Rework the ingestion code to process the ORCID XML dump of all
works.
@Phyks Phyks changed the title WIP: Fix ORCID ingestion Fix ORCID ingestion May 13, 2019
@Phyks
Copy link
Member Author

Phyks commented May 13, 2019

So, I found a trick to overcome the tar gz limitations :) One of the main issues was that the top level directories (both dumps are of the form TOP_LEVEL/<ORCID_ID> with 1100 top level directories) were weirdly sorted in the activities (works) dump, which I cannot extract. I iterated over this order and matched it with the summaries, which should be totally fine now.

Current run time is about 10s to parse a top level directory (get a list of members) in Python, and then about 30s to extract the whole top level directory. We cannot reduce this much. Once this step is done, all the import is done offline, without any query to ORCID, which should be doable at maximum speed.

Here is a trace of the import running:

sandbox@tulpan:~/dissemin$ ./.venv/bin/python manage.py ingest_orcid /home/pintoch/orcidapi2.0_activities_xml_10_2018.tar.gz /home/pintoch/summaries/
/home/sandbox/dissemin/.venv/lib/python3.5/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
2019-05-13 17:06:36 INFO dissemin.backend.orcid:339  Extracting folder 79X from activities archive.
2019-05-13 17:06:55 INFO dissemin.backend.utils:138  : 25, 2.409660663244269 records/sec
2019-05-13 17:07:05 INFO dissemin.backend.utils:138  : 61, 3.599820368963589 records/sec
2019-05-13 17:07:17 INFO dissemin.backend.utils:138  : 66, 0.4137381247849079 records/sec
2019-05-13 17:07:20 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0002-3392-079X.json
2019-05-13 17:07:30 INFO dissemin.backend.utils:138  : 104, 2.945755387418384 records/sec
2019-05-13 17:07:39 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:39 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:40 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:40 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:40 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:40 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:42 WARNING dissemin.backend.orcid:252  Total ignored papers: 6
2019-05-13 17:07:42 INFO dissemin.backend.utils:138  : 106, 0.17625940872723786 records/sec
2019-05-13 17:07:44 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0003-1391-279X.json
2019-05-13 17:08:13 INFO dissemin.backend.utils:138  : 122, 0.5120079627478367 records/sec
2019-05-13 17:08:33 INFO dissemin.backend.utils:138  : 141, 0.9288404570833667 records/sec
2019-05-13 17:08:44 INFO dissemin.backend.utils:138  : 167, 2.5277321381697275 records/sec
2019-05-13 17:09:00 INFO dissemin.backend.utils:138  : 178, 0.6751602646333622 records/sec
2019-05-13 17:09:26 INFO dissemin.backend.utils:138  : 189, 0.4213578478818494 records/sec
2019-05-13 17:09:38 INFO dissemin.backend.utils:138  : 198, 0.7554859613080382 records/sec
2019-05-13 17:09:48 INFO dissemin.backend.utils:138  : 239, 4.065844870168153 records/sec
2019-05-13 17:09:58 INFO dissemin.backend.utils:138  : 281, 4.141751238925993 records/sec
2019-05-13 17:10:06 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0003-2711-879X.json
2019-05-13 17:10:14 INFO dissemin.backend.utils:138  : 321, 2.47817210521154 records/sec
2019-05-13 17:10:22 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0003-4072-679X.json
2019-05-13 17:10:25 INFO dissemin.backend.utils:138  : 343, 2.0066571763943784 records/sec
2019-05-13 17:10:41 INFO dissemin.backend.utils:138  : 383, 2.5607607712960228 records/sec
2019-05-13 17:10:51 INFO dissemin.backend.utils:138  : 421, 3.7918868788340467 records/sec
2019-05-13 17:10:56 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0001-8483-179X.json
2019-05-13 17:11:04 INFO dissemin.backend.utils:138  : 468, 3.5839377669795343 records/sec
2019-05-13 17:12:01 INFO dissemin.backend.utils:138  : 484, 0.2807896711098298 records/sec

were records here stand for "ORCID profiles". The numbers are not very high, but this is an issue downstream, in the ES / psql querying and this is an issue cha and @wetneb are working on at the moment, affecting as well the other import sources.

https://sandbox.dissem.in/r/1636/teresa-donateo?page=1 was imported with this code for instance.

In summary, I think this one should be mergeable now. It fixes #723. Not sure though that it is yet doable to import all of the ORCID dump.

Rough estimate of needed time:

  • 1100 * 10 seconds to list all files from the ORCID tar gz dump file.
  • 1100 * 30 seconds to extract them.
  • Each top level directory has about 5k profiles. At an (optimistic) rate of 2 profiles per second, this would take 1100 * 5000 * 0.5 seconds.

Total : 32 days.

Copy link
Member

@wetneb wetneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a quick test for the bulk_import method? How hard would it be to make a small demo ORCID dump for that? I'm trying to do it for BASE in #760.

@Phyks
Copy link
Member Author

Phyks commented May 14, 2019

How about adding a quick test for the bulk_import method?

Sure, will have a look.

else:
put_codes.append(dois_and_putcodes[idx][1])

# 2nd attempt with ORCID's own crappy metadata
works = profile.fetch_works(put_codes)
works = profile.fetch_works(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question rising from developper meeting: When we ingest the ORCID dump, we want to do it offline so we don't take into account the DOIs and the whole previous lines are skipped. We only use ORCID metadata (no DOI) which are incomplete.

So, ORCID dump should probably be imported after Crossref import is done and up to date. We could lookup for papers by DOIs (without queries to Crossref) if we have DOIs as well.

Any idea on the strategy here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we need to rethink the general architecture to handle this in a satisfactory way… If we can rely on the fact that Crossref is already imported, then ORCID should not be fetching its metadata a second time from the API, but only adding the ORCID information to the authors of the papers (adding an OaiRecord to keep track of the provenance).

But that means going away from the architecture where each paper importer emits a pure Python object (not linked to the database) representing all the paper metadata, and then a generic backend code deals with the saving to the DB. We currently don't follow that infrastructure completely, but that is the general direction I have been pushing towards.

Brainstorming session required!

os.listdir(summaries_path),
name='ORCID profiles'
)
for summary_file in summaries_files_with_speed:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have parallel for loop here?

@@ -0,0 +1,13 @@
from django.core.management.base import BaseCommand
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same should be done for BASE import, see #760 (comment).

``/home/orcid/activities.tar.gz``.

The dumps can then be imported using ``python manage.py ingest_orcid
/home/orcid/activities.tar.gz /home/orcid/summaries``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc about BASE dump import, see #760 (comment).

@Phyks Phyks changed the title Fix ORCID ingestion WIP - Fix ORCID ingestion May 16, 2019
@nemobis
Copy link
Member

nemobis commented Oct 24, 2019

25M ORCID profiles (half of them) with the current setup would take more than 400 days to ingest :/

Alternative view: if we had started 6 months ago, we'd be half done!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No reason set for skipped works in ORCID importer
5 participants