WIP - Fix ORCID ingestion #754

Phyks · 2019-05-09T09:57:51Z

Fix 'OrcidWork' object has no attribute 'reason' in ORCID dump
ingestion, fix #723.

This is tested and make the ORCID dump ingestion more robust. I added a bit of logging to have a better output and feedback on the progress.

I am currently running it on tulpan in a screen session.

TODO:

Some doc.

wetneb · 2019-05-09T10:10:17Z

To report progress and speed, I have made a generator wrapper that you could use for that:

dissemin/backend/utils.py

Line 120 in 96b83f8

    
           def with_speed_report(generator, name=None, report_delay=timedelta(seconds=10)):

The idea is that you can use it with any generator - it returns the same generator, with the additional reporting logic.

coveralls · 2019-05-09T10:12:05Z

Coverage decreased (-2.9%) to 75.199% when pulling 62f819c on Phyks:orcid-no-reason into 96b83f8 on dissemin:master.

coveralls · 2019-05-09T10:12:05Z

Coverage decreased (-2.9%) to 75.199% when pulling 62f819c on Phyks:orcid-no-reason into 96b83f8 on dissemin:master.

coveralls · 2019-05-09T10:12:05Z

Coverage decreased (-3.2%) to 74.709% when pulling fa42b9b on Phyks:orcid-no-reason into f2dd21d on dissemin:master.

backend/orcid.py

Phyks · 2019-05-09T12:56:58Z

The latest commit uses the ORCID API dump to fill in work-summaries and then avoid one API call for empty profiles.

The remaining issue is for the works themselves. We don't use DOIs and Crossref when importing the ORCID dump, therefore we make a bunch of API calls to ORCID to fetch the works for each profile. This is not included in the JSON conversion of the API dump.

This however is available in a separate dump, in XML format. Maybe our best bet would be to actually preprocess this XML dump and convert it to JSON format and do the whole import process offline.

I don't have numbers with querying CrossRef API (as it is currently disabled for ORCID ingestion), but with all the queries to ORCID to fetch the works, it takes about 1500s to ingest 1k profiles, which is way too long.

nemobis · 2019-05-09T14:18:07Z

Is it useful/expected to create an ORCID profile without any content other than the ORCID ID? https://dissem.in/r/187946/ursula-ellinghaus

Excluding the "empty" profiles would cut the number of profile creations in half, if I read the statistics right:

Live ORCID iDs 6424438 iDs with external identifiers (person, org, funding, work, peer review work) 2651584

https://orcid.org/statistics

Phyks · 2019-05-09T16:11:58Z

Not sure about the usefulness of having empty profiles. This being said, the very costly part here is requests made to the ORCID API (to fetch works). This does not happen for empty profiles, which are already created without making any HTTP request with this PR, so they basically take milliseconds to be created.

Given the number of ORCID profiles to process, these milliseconds might matter. But anyways, 25M ORCID profiles (half of them) with the current setup would take more than 400 days to ingest :/

beckstefan · 2019-05-10T06:16:30Z

This however is available in a separate dump, in XML format. Maybe our best bet would be to actually preprocess this XML dump and convert it to JSON format and do the whole import process offline.

Maybe I am to naive here. Importing from this once and then query ORCID API on demand if necessary? Maybe with a "last updated" field in our DB?
What exactly makes the ORCID API import that slow? 1.5 seconds per profile is a lot.

wetneb · 2019-05-10T06:47:38Z

@beckstefan the ORCID API used to return an entire profile in one HTTP request. That is no longer the case: you need to retrieve publications by batches from them.

Yes it would make sense to import from the dump - it's just annoying that only XML format (and not JSON) is available now (so it means duplicating our parsing code for XML).

Phyks · 2019-05-10T06:57:07Z

What exactly makes the ORCID API import that slow? 1.5 seconds per profile is a lot.

Additionally to @wetneb's answer, the ORCID API dump provides everything in XML (whereas the live API we use is JSON). Only part of it (which is profile summaries) is usable with theconversion lib https://github.com/ORCID/orcid-conversion-lib/ to convert it to JSON. Therefore, we use the local dump converted to JSON for the first steps of the import (base profile infos, work summaries) and then we have to query the live API for batches of works. This is the slow part (and not easily runnable in parallel because of API rate limitations anyways), we should use the dump altogether.

Yes it would make sense to import from the dump - it's just annoying that only XML format (and not JSON) is available now (so it means duplicating our parsing code for XML).

Or preprocessing the dump. Given the size of the dump and the amount of useless stuff for us so far (education info, employment infos etc), I'd rather have a pre-processing chain for extracting/filtering/JSONizing the XML dump.

wetneb · 2019-05-10T07:00:03Z

Or preprocessing the dump. Given the size of the dump and the amount of useless stuff for us so far (education info, employment infos etc), I'd rather have a pre-processing chain for extracting/filtering/JSONizing the XML dump.

Yeah, and the easiest way to do this might be to extend their utility actually… :-/

Phyks · 2019-05-10T13:41:53Z

Note to self: it is not really practically doable to expand the full ORCID activities dump.

There are 1100 folders in the summaries dump:

$ ls /home/…/summaries | wc -l
1100

Expanding the ORCID API (filtering on the works only and skipping education, employment etc, one log line every 10k files extracted, stopped before finishing):

$ tar zxvf /home/…/orcidapi2.0_activities_xml_10_2018.tar.gz --wildcards activities/*/*/works/ | awk 'NR == 1 || NR % 10000 == 0'
$ ls activities/| wc -l
120
$ du -h --summarize activities
72GB

The whole ORCID API dump is then expected to be about 720GB large. The best approach would be to process folders by folder I think:

Iterate on the folders in the ORCID API summary dump.
For each folder, extract the corresponding activities, pre-process them (convert them from XML to JSON).
Then, ingest the folder in Python, off line. Go on with next folder.

I'll start by writing some quick and dirty script to convert the XML files to JSON.

Fix 'OrcidWork' object has no attribute 'reason' in ORCID dump ingestion, fix dissemin#723. Rework the whole ingestion process to: - [x] Provide a `manage.py` command - [x] Fix a bug at Orcid profile creation. It was lacking an ORCID id, therefore sending invalid queries to ORCID, with `None` instead of the ORCID id. - [x] Use activities-summary from the ORCID API dump instead of making another ORCID API query - [x] Rework the ingestion code to process the ORCID XML dump of all works.

Phyks · 2019-05-13T16:34:10Z

So, I found a trick to overcome the tar gz limitations :) One of the main issues was that the top level directories (both dumps are of the form TOP_LEVEL/<ORCID_ID> with 1100 top level directories) were weirdly sorted in the activities (works) dump, which I cannot extract. I iterated over this order and matched it with the summaries, which should be totally fine now.

Current run time is about 10s to parse a top level directory (get a list of members) in Python, and then about 30s to extract the whole top level directory. We cannot reduce this much. Once this step is done, all the import is done offline, without any query to ORCID, which should be doable at maximum speed.

Here is a trace of the import running:

sandbox@tulpan:~/dissemin$ ./.venv/bin/python manage.py ingest_orcid /home/pintoch/orcidapi2.0_activities_xml_10_2018.tar.gz /home/pintoch/summaries/
/home/sandbox/dissemin/.venv/lib/python3.5/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
2019-05-13 17:06:36 INFO dissemin.backend.orcid:339  Extracting folder 79X from activities archive.
2019-05-13 17:06:55 INFO dissemin.backend.utils:138  : 25, 2.409660663244269 records/sec
2019-05-13 17:07:05 INFO dissemin.backend.utils:138  : 61, 3.599820368963589 records/sec
2019-05-13 17:07:17 INFO dissemin.backend.utils:138  : 66, 0.4137381247849079 records/sec
2019-05-13 17:07:20 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0002-3392-079X.json
2019-05-13 17:07:30 INFO dissemin.backend.utils:138  : 104, 2.945755387418384 records/sec
2019-05-13 17:07:39 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:39 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:40 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:40 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:40 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:40 WARNING dissemin.backend.orcid:242  Work skipped due to incorrect metadata.
 NO_TITLE
2019-05-13 17:07:42 WARNING dissemin.backend.orcid:252  Total ignored papers: 6
2019-05-13 17:07:42 INFO dissemin.backend.utils:138  : 106, 0.17625940872723786 records/sec
2019-05-13 17:07:44 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0003-1391-279X.json
2019-05-13 17:08:13 INFO dissemin.backend.utils:138  : 122, 0.5120079627478367 records/sec
2019-05-13 17:08:33 INFO dissemin.backend.utils:138  : 141, 0.9288404570833667 records/sec
2019-05-13 17:08:44 INFO dissemin.backend.utils:138  : 167, 2.5277321381697275 records/sec
2019-05-13 17:09:00 INFO dissemin.backend.utils:138  : 178, 0.6751602646333622 records/sec
2019-05-13 17:09:26 INFO dissemin.backend.utils:138  : 189, 0.4213578478818494 records/sec
2019-05-13 17:09:38 INFO dissemin.backend.utils:138  : 198, 0.7554859613080382 records/sec
2019-05-13 17:09:48 INFO dissemin.backend.utils:138  : 239, 4.065844870168153 records/sec
2019-05-13 17:09:58 INFO dissemin.backend.utils:138  : 281, 4.141751238925993 records/sec
2019-05-13 17:10:06 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0003-2711-879X.json
2019-05-13 17:10:14 INFO dissemin.backend.utils:138  : 321, 2.47817210521154 records/sec
2019-05-13 17:10:22 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0003-4072-679X.json
2019-05-13 17:10:25 INFO dissemin.backend.utils:138  : 343, 2.0066571763943784 records/sec
2019-05-13 17:10:41 INFO dissemin.backend.utils:138  : 383, 2.5607607712960228 records/sec
2019-05-13 17:10:51 INFO dissemin.backend.utils:138  : 421, 3.7918868788340467 records/sec
2019-05-13 17:10:56 WARNING dissemin.backend.orcid:380  Invalid profile: /home/pintoch/summaries/79X/0000-0001-8483-179X.json
2019-05-13 17:11:04 INFO dissemin.backend.utils:138  : 468, 3.5839377669795343 records/sec
2019-05-13 17:12:01 INFO dissemin.backend.utils:138  : 484, 0.2807896711098298 records/sec

were records here stand for "ORCID profiles". The numbers are not very high, but this is an issue downstream, in the ES / psql querying and this is an issue cha and @wetneb are working on at the moment, affecting as well the other import sources.

https://sandbox.dissem.in/r/1636/teresa-donateo?page=1 was imported with this code for instance.

In summary, I think this one should be mergeable now. It fixes #723. Not sure though that it is yet doable to import all of the ORCID dump.

Rough estimate of needed time:

1100 * 10 seconds to list all files from the ORCID tar gz dump file.
1100 * 30 seconds to extract them.
Each top level directory has about 5k profiles. At an (optimistic) rate of 2 profiles per second, this would take 1100 * 5000 * 0.5 seconds.

Total : 32 days.

wetneb

How about adding a quick test for the bulk_import method? How hard would it be to make a small demo ORCID dump for that? I'm trying to do it for BASE in #760.

Phyks · 2019-05-14T08:10:01Z

How about adding a quick test for the bulk_import method?

Sure, will have a look.

Phyks · 2019-05-14T08:12:39Z

backend/orcid.py

                else:
                    put_codes.append(dois_and_putcodes[idx][1])

        # 2nd attempt with ORCID's own crappy metadata
-        works = profile.fetch_works(put_codes)
+        works = profile.fetch_works(


Question rising from developper meeting: When we ingest the ORCID dump, we want to do it offline so we don't take into account the DOIs and the whole previous lines are skipped. We only use ORCID metadata (no DOI) which are incomplete.

So, ORCID dump should probably be imported after Crossref import is done and up to date. We could lookup for papers by DOIs (without queries to Crossref) if we have DOIs as well.

Any idea on the strategy here?

I feel like we need to rethink the general architecture to handle this in a satisfactory way… If we can rely on the fact that Crossref is already imported, then ORCID should not be fetching its metadata a second time from the API, but only adding the ORCID information to the authors of the papers (adding an OaiRecord to keep track of the provenance).

But that means going away from the architecture where each paper importer emits a pure Python object (not linked to the database) representing all the paper metadata, and then a generic backend code deals with the saving to the DB. We currently don't follow that infrastructure completely, but that is the general direction I have been pushing towards.

Brainstorming session required!

Phyks · 2019-05-15T12:52:19Z

backend/orcid.py

+                        os.listdir(summaries_path),
+                        name='ORCID profiles'
+                    )
+                    for summary_file in summaries_files_with_speed:


Would it make sense to have parallel for loop here?

Phyks · 2019-05-15T14:16:55Z

backend/management/commands/ingest_orcid.py

@@ -0,0 +1,13 @@
+from django.core.management.base import BaseCommand


Same should be done for BASE import, see #760 (comment).

Phyks · 2019-05-15T14:17:13Z

doc/sphinx/dataingestion.rst

+``/home/orcid/activities.tar.gz``.
+
+The dumps can then be imported using ``python manage.py ingest_orcid
+/home/orcid/activities.tar.gz /home/orcid/summaries``.


Add doc about BASE dump import, see #760 (comment).

nemobis · 2019-10-24T21:40:01Z

25M ORCID profiles (half of them) with the current setup would take more than 400 days to ingest :/

Alternative view: if we had started 6 months ago, we'd be half done!

Phyks force-pushed the orcid-no-reason branch from 62f819c to 7766de5 Compare May 9, 2019 10:14

Phyks commented May 9, 2019

View reviewed changes

backend/orcid.py Outdated Show resolved Hide resolved

Phyks commented May 9, 2019

View reviewed changes

backend/orcid.py Outdated Show resolved Hide resolved

Phyks changed the title ~~Fix ORCID ingestion~~ WIP: Fix ORCID ingestion May 9, 2019

Phyks force-pushed the orcid-no-reason branch 4 times, most recently from 5f2aafe to 5d25981 Compare May 13, 2019 13:51

Phyks force-pushed the orcid-no-reason branch from 5d25981 to 7017216 Compare May 13, 2019 16:30

Phyks changed the title ~~WIP: Fix ORCID ingestion~~ Fix ORCID ingestion May 13, 2019

Add description of enumerated items: ORCID profiles

fa42b9b

wetneb reviewed May 13, 2019

View reviewed changes

Phyks commented May 14, 2019

View reviewed changes

Phyks mentioned this pull request May 15, 2019

BASE dump import and optimizations for speed #760

Open

Phyks commented May 15, 2019

View reviewed changes

Phyks changed the title ~~Fix ORCID ingestion~~ WIP - Fix ORCID ingestion May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - Fix ORCID ingestion #754

WIP - Fix ORCID ingestion #754

Phyks commented May 9, 2019 •

edited

wetneb commented May 9, 2019

coveralls commented May 9, 2019

coveralls commented May 9, 2019

coveralls commented May 9, 2019 •

edited

Phyks commented May 9, 2019

nemobis commented May 9, 2019 •

edited

Phyks commented May 9, 2019

beckstefan commented May 10, 2019

wetneb commented May 10, 2019

Phyks commented May 10, 2019

wetneb commented May 10, 2019

Phyks commented May 10, 2019

Phyks commented May 13, 2019 •

edited

wetneb left a comment

Phyks commented May 14, 2019

Phyks May 14, 2019

wetneb May 15, 2019

Phyks May 15, 2019

Phyks May 15, 2019

Phyks May 15, 2019

nemobis commented Oct 24, 2019

		@@ -0,0 +1,13 @@
		from django.core.management.base import BaseCommand

WIP - Fix ORCID ingestion #754

Are you sure you want to change the base?

WIP - Fix ORCID ingestion #754

Conversation

Phyks commented May 9, 2019 • edited

wetneb commented May 9, 2019

coveralls commented May 9, 2019

coveralls commented May 9, 2019

coveralls commented May 9, 2019 • edited

Phyks commented May 9, 2019

nemobis commented May 9, 2019 • edited

Phyks commented May 9, 2019

beckstefan commented May 10, 2019

wetneb commented May 10, 2019

Phyks commented May 10, 2019

wetneb commented May 10, 2019

Phyks commented May 10, 2019

Phyks commented May 13, 2019 • edited

wetneb left a comment

Choose a reason for hiding this comment

Phyks commented May 14, 2019

Phyks May 14, 2019

Choose a reason for hiding this comment

wetneb May 15, 2019

Choose a reason for hiding this comment

Phyks May 15, 2019

Choose a reason for hiding this comment

Phyks May 15, 2019

Choose a reason for hiding this comment

Phyks May 15, 2019

Choose a reason for hiding this comment

nemobis commented Oct 24, 2019

Phyks commented May 9, 2019 •

edited

coveralls commented May 9, 2019 •

edited

nemobis commented May 9, 2019 •

edited

Phyks commented May 13, 2019 •

edited