Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

[API Integration] Wellcome Collection #243

Closed
annatuma opened this issue Jan 13, 2020 · 10 comments
Closed

[API Integration] Wellcome Collection #243

annatuma opened this issue Jan 13, 2020 · 10 comments
Labels
✨ goal: improvement Improvement to an existing feature providers 🙅 status: discontinued Not suitable for work as repo is in maintenance

Comments

@annatuma
Copy link
Contributor

annatuma commented Jan 13, 2020

Provider API Endpoint / Documentation

https://developers.wellcomecollection.org

Provider description

Most of the metadata we need is readily available (the license, attribution info, a link, a thumbnail, etc.). They have something we could use for the description meta_data field (which we like for search indexing).

Two considerations to look into further prior to integration:

  1. Not much by way of tags.
  2. Unclear if/how we can get only the newest data (vs having to pull the entire DB for every sync, which would mean less frequent syncs).

SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION

Licenses Provided

Provider API Technical info

General Recommendations for implementation

  • The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
  • The script should have a test suite in the same directory.
  • The script must use the ImageStore class (Import this from
    src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
  • The script should use the DelayedRequester class (Import this from
    src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
  • The script must not use anything from
    src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
    that module is deprecated.
  • If the provider API has can be queried by 'upload date' or something similar,
    the script should take a --date parameter when run as a script, giving the
    date for which we should collect images. The form should be YYYY-MM-DD (so,
    the script can be run via python my_favorite_provider.py --date 2018-01-01).
  • The script must provide a main function that takes the same parameters as from
    the CLI. In our example from above, we'd then have a main function
    my_favorite_provider.main(date). The main should do the same thing calling
    from the CLI would do.
  • The script must conform to PEP8. Please use pycodestyle (available via
    pip install pycodestyle) to check for compliance.
  • The script should use small, testable functions.
  • The test suite for the script may break PEP8 rules regarding long lines where
    appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

  • src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
  • src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
  • src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
  • src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.
@annatuma annatuma created this issue from a note in CC Catalog Pipeline (Prioritized) Jan 13, 2020
@annatuma annatuma added this to To Be Prioritized in Backlog via automation Jan 13, 2020
@annatuma annatuma moved this from To Be Prioritized to Q2 2020 in Backlog Jan 13, 2020
@annatuma annatuma changed the title Wellcome Collection [API Integration] Wellcome Collection Feb 24, 2020
@kgodey kgodey removed this from Q2 2020 in Backlog Feb 28, 2020
@ChariniNana
Copy link
Contributor

@mathemancer Can I work on this as part of my GSoC project?

@mathemancer
Copy link
Contributor

@ChariniNana Yes, but before beginning development, it's important to answer the following:

  • What licenses do they use, and how can we find that info?
  • What metadata do they provide, and how well does it match up with the interface of the ImageStore class?
  • @annatuma mentioned that it's not clear how to pull just the info updated on a given day from their API. Can you confirm this? What is the total volume of the collection? Is it feasible to pull the whole thing, e.g., monthly?

@ChariniNana
Copy link
Contributor

They seem to provide images of different license where the details of it is provided in the json response as follows (please try the GET request https://api.wellcomecollection.org/catalogue/v2/works/mtkdctvn)

"license": {
"id": "cc-by",
"label": "Attribution 4.0 International (CC BY 4.0)",
"type": "License",
"url": "http://creativecommons.org/licenses/by/4.0/"
}

Apart from the image description, I cannot see anything contained in the response that could go into the meta data field (it does not return information such as date created , views, etc.)

Need to explore a bit more to find if it's possible to pull just the info updated on a given day (no obvious method)

@mathemancer
Copy link
Contributor

@ChariniNana It's okay if that's the case, as long as the overall volume isn't to large, and the speed is fast enough. The main thing is to have some strategy to get all the data over time.

@mathemancer
Copy link
Contributor

@annatuma I'm a bit concerned about this line in their developer docs:

There are some licensing restrictions, as different parts of the data may have different licenses. If it’s data that has been created by us, it’s CC0; if it’s not created by us, then it isn’t. We are working to make data licensing clear on a per work basis; in the meantime, if this is a concern, please get in touch.
(emphasis mine)

Have we been in touch with them regarding that?

@allen505
Copy link
Contributor

In this section, it says that Europeana has used their APIs to include the images into Europeana. If this is the case, when Europeana is integrated into cc-search there could be data duplication.

@mathemancer
Copy link
Contributor

@allen505 It's totally possible to have data duplication with our current set up. We'd need to choose on a case-by-case basis whether to keep the data from Europeana (if it's usefully enriched, for example), or from the upstream provider. Does the Europeana API give an ID of the upstream provider? That would make future deduplication comparatively easy.

@allen505
Copy link
Contributor

allen505 commented Apr 2, 2020

@mathemancer The provider field in the response gives the value of the Provider which should be
"provider": [ "Wellcome Collection" ] in this case.

The following query gives all the items which belong to Wellcome Collection:
https://www.europeana.eu/api/v2/search.json?wskey=API_KEY&query=*:*&qf=PROVIDER:%22Wellcome+Collection%22

@mathemancer
Copy link
Contributor

That's awesome. @annatuma I believe I remember that the folks at Europeana said that some of their providers were more reliable than others when it came to license labeling. Do we have records about which providers those are?

We could use the same 'provider' vs 'source' scheme that we do for commoncrawl-sourced images for these aggregators.

@allen505 This is great info, thanks!

@annatuma
Copy link
Contributor Author

annatuma commented Jun 4, 2020

That's awesome. @annatuma I believe I remember that the folks at Europeana said that some of their providers were more reliable than others when it came to license labeling. Do we have records about which providers those are?

We could use the same 'provider' vs 'source' scheme that we do for commoncrawl-sourced images for these aggregators.

@allen505 This is great info, thanks!

@mathemancer sorry I missed this - we don't have information on that, but we'll check the record count for Wellcome once our Europeana integration is live, and ensure we're getting a full collection.

@kgodey kgodey added 🚧 status: blocked Blocked & therefore, not ready for work ✨ goal: improvement Improvement to an existing feature 🧹 status: ticket work required Needs more details before it can be worked on and removed blocked labels Sep 22, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey added this to Pending Review in Backlog Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🏷 status: label work required Needs proper labelling before it can be worked on 🚧 status: blocked Blocked & therefore, not ready for work 🧹 status: ticket work required Needs more details before it can be worked on labels Dec 16, 2020
@kgodey kgodey closed this as completed Dec 16, 2020
@kgodey kgodey moved this from Pending Review to Done in Backlog Dec 16, 2020
@TimidRobot TimidRobot removed this from New Integrations (Ready for Work) in CC Catalog Pipeline Jan 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
✨ goal: improvement Improvement to an existing feature providers 🙅 status: discontinued Not suitable for work as repo is in maintenance
Development

No branches or pull requests

6 participants