-
Notifications
You must be signed in to change notification settings - Fork 60
[API Integration] Wellcome Collection #243
Comments
@mathemancer Can I work on this as part of my GSoC project? |
@ChariniNana Yes, but before beginning development, it's important to answer the following:
|
They seem to provide images of different license where the details of it is provided in the json response as follows (please try the GET request https://api.wellcomecollection.org/catalogue/v2/works/mtkdctvn) "license": { Apart from the image description, I cannot see anything contained in the response that could go into the meta data field (it does not return information such as date created , views, etc.) Need to explore a bit more to find if it's possible to pull just the info updated on a given day (no obvious method) |
@ChariniNana It's okay if that's the case, as long as the overall volume isn't to large, and the speed is fast enough. The main thing is to have some strategy to get all the data over time. |
@annatuma I'm a bit concerned about this line in their developer docs:
Have we been in touch with them regarding that? |
In this section, it says that Europeana has used their APIs to include the images into Europeana. If this is the case, when Europeana is integrated into cc-search there could be data duplication. |
@allen505 It's totally possible to have data duplication with our current set up. We'd need to choose on a case-by-case basis whether to keep the data from Europeana (if it's usefully enriched, for example), or from the upstream provider. Does the Europeana API give an ID of the upstream provider? That would make future deduplication comparatively easy. |
@mathemancer The provider field in the response gives the value of the Provider which should be The following query gives all the items which belong to Wellcome Collection: |
That's awesome. @annatuma I believe I remember that the folks at Europeana said that some of their providers were more reliable than others when it came to license labeling. Do we have records about which providers those are? We could use the same 'provider' vs 'source' scheme that we do for commoncrawl-sourced images for these aggregators. @allen505 This is great info, thanks! |
@mathemancer sorry I missed this - we don't have information on that, but we'll check the record count for Wellcome once our Europeana integration is live, and ensure we're getting a full collection. |
Provider API Endpoint / Documentation
https://developers.wellcomecollection.org
Provider description
Most of the metadata we need is readily available (the license, attribution info, a link, a thumbnail, etc.). They have something we could use for the description meta_data field (which we like for search indexing).
Two considerations to look into further prior to integration:
SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION
Licenses Provided
Provider API Technical info
General Recommendations for implementation
src/cc_catalog_airflow/dags/provider_api_scripts/
directory.ImageStore
class (Import this fromsrc/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py
).DelayedRequester
class (Import this fromsrc/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py
).src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py
, sincethat module is deprecated.
the script should take a
--date
parameter when run as a script, giving thedate for which we should collect images. The form should be
YYYY-MM-DD
(so,the script can be run via
python my_favorite_provider.py --date 2018-01-01
).the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date)
. The main should do the same thing callingfrom the CLI would do.
pycodestyle
(available viapip install pycodestyle
) to check for compliance.appropriate (e.g., long strings for testing).
Examples of other Provider API Scripts
For example Provider API Scripts and accompanying test suites, please see
src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py
andsrc/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py
, orsrc/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py
andsrc/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py
.The text was updated successfully, but these errors were encountered: