[API Integration] Wellcome Collection #243

annatuma · 2020-01-13T21:51:09Z

Provider API Endpoint / Documentation

https://developers.wellcomecollection.org

Provider description

Most of the metadata we need is readily available (the license, attribution info, a link, a thumbnail, etc.). They have something we could use for the description meta_data field (which we like for search indexing).

Two considerations to look into further prior to integration:

Not much by way of tags.
Unclear if/how we can get only the newest data (vs having to pull the entire DB for every sync, which would mean less frequent syncs).

SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION

Licenses Provided

Provider API Technical info

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

The text was updated successfully, but these errors were encountered:

ChariniNana · 2020-03-17T11:03:26Z

@mathemancer Can I work on this as part of my GSoC project?

mathemancer · 2020-03-19T11:02:24Z

@ChariniNana Yes, but before beginning development, it's important to answer the following:

What licenses do they use, and how can we find that info?
What metadata do they provide, and how well does it match up with the interface of the ImageStore class?
@annatuma mentioned that it's not clear how to pull just the info updated on a given day from their API. Can you confirm this? What is the total volume of the collection? Is it feasible to pull the whole thing, e.g., monthly?

ChariniNana · 2020-03-24T11:45:14Z

They seem to provide images of different license where the details of it is provided in the json response as follows (please try the GET request https://api.wellcomecollection.org/catalogue/v2/works/mtkdctvn)

"license": {
"id": "cc-by",
"label": "Attribution 4.0 International (CC BY 4.0)",
"type": "License",
"url": "http://creativecommons.org/licenses/by/4.0/"
}

Apart from the image description, I cannot see anything contained in the response that could go into the meta data field (it does not return information such as date created , views, etc.)

Need to explore a bit more to find if it's possible to pull just the info updated on a given day (no obvious method)

mathemancer · 2020-03-25T09:52:39Z

@ChariniNana It's okay if that's the case, as long as the overall volume isn't to large, and the speed is fast enough. The main thing is to have some strategy to get all the data over time.

mathemancer · 2020-03-25T09:54:04Z

@annatuma I'm a bit concerned about this line in their developer docs:

There are some licensing restrictions, as different parts of the data may have different licenses. If it’s data that has been created by us, it’s CC0; if it’s not created by us, then it isn’t. We are working to make data licensing clear on a per work basis; in the meantime, if this is a concern, please get in touch.
(emphasis mine)

Have we been in touch with them regarding that?

allen505 · 2020-03-30T01:31:25Z

In this section, it says that Europeana has used their APIs to include the images into Europeana. If this is the case, when Europeana is integrated into cc-search there could be data duplication.

mathemancer · 2020-04-01T08:06:45Z

@allen505 It's totally possible to have data duplication with our current set up. We'd need to choose on a case-by-case basis whether to keep the data from Europeana (if it's usefully enriched, for example), or from the upstream provider. Does the Europeana API give an ID of the upstream provider? That would make future deduplication comparatively easy.

allen505 · 2020-04-02T04:11:18Z

@mathemancer The provider field in the response gives the value of the Provider which should be
"provider": [ "Wellcome Collection" ] in this case.

The following query gives all the items which belong to Wellcome Collection:
https://www.europeana.eu/api/v2/search.json?wskey=API_KEY&query=*:*&qf=PROVIDER:%22Wellcome+Collection%22

mathemancer · 2020-04-03T19:42:01Z

That's awesome. @annatuma I believe I remember that the folks at Europeana said that some of their providers were more reliable than others when it came to license labeling. Do we have records about which providers those are?

We could use the same 'provider' vs 'source' scheme that we do for commoncrawl-sourced images for these aggregators.

@allen505 This is great info, thanks!

annatuma · 2020-06-04T11:40:14Z

That's awesome. @annatuma I believe I remember that the folks at Europeana said that some of their providers were more reliable than others when it came to license labeling. Do we have records about which providers those are?

We could use the same 'provider' vs 'source' scheme that we do for commoncrawl-sourced images for these aggregators.

@allen505 This is great info, thanks!

@mathemancer sorry I missed this - we don't have information on that, but we'll check the record count for Wellcome once our Europeana integration is live, and ensure we're getting a full collection.

annatuma created this issue from a note in CC Catalog Pipeline (Prioritized) Jan 13, 2020

annatuma added this to To Be Prioritized in Backlog via automation Jan 13, 2020

annatuma moved this from To Be Prioritized to Q2 2020 in Backlog Jan 13, 2020

mathemancer added the not ready for work label Feb 19, 2020

annatuma added ticket work required and removed not ready for work labels Feb 24, 2020

annatuma changed the title ~~Wellcome Collection~~ [API Integration] Wellcome Collection Feb 24, 2020

annatuma added enhancement providers labels Feb 24, 2020

kgodey removed this from Q2 2020 in Backlog Feb 28, 2020

annatuma added the blocked label Jun 4, 2020

kgodey added 🚧 status: blocked Blocked & therefore, not ready for work ✨ goal: improvement Improvement to an existing feature 🧹 status: ticket work required Needs more details before it can be worked on and removed blocked labels Sep 22, 2020

cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020

kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey added this to Pending Review in Backlog Dec 2, 2020

kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020

kgodey closed this as completed Dec 16, 2020

kgodey moved this from Pending Review to Done in Backlog Dec 16, 2020

obulat mentioned this issue Apr 17, 2023

Wellcome Collection WordPress/openverse#1753

Open

3 tasks

TimidRobot removed this from New Integrations (Ready for Work) in CC Catalog Pipeline Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Integration] Wellcome Collection #243

[API Integration] Wellcome Collection #243

annatuma commented Jan 13, 2020 •

edited

ChariniNana commented Mar 17, 2020

mathemancer commented Mar 19, 2020

ChariniNana commented Mar 24, 2020

mathemancer commented Mar 25, 2020

mathemancer commented Mar 25, 2020

allen505 commented Mar 30, 2020

mathemancer commented Apr 1, 2020

allen505 commented Apr 2, 2020

mathemancer commented Apr 3, 2020

annatuma commented Jun 4, 2020

[API Integration] Wellcome Collection #243

[API Integration] Wellcome Collection #243

Comments

annatuma commented Jan 13, 2020 • edited

Provider API Endpoint / Documentation

Provider description

Licenses Provided

Provider API Technical info

General Recommendations for implementation

Examples of other Provider API Scripts

ChariniNana commented Mar 17, 2020

mathemancer commented Mar 19, 2020

ChariniNana commented Mar 24, 2020

mathemancer commented Mar 25, 2020

mathemancer commented Mar 25, 2020

allen505 commented Mar 30, 2020

mathemancer commented Apr 1, 2020

allen505 commented Apr 2, 2020

mathemancer commented Apr 3, 2020

annatuma commented Jun 4, 2020

annatuma commented Jan 13, 2020 •

edited