[API Integration] Europeana #279

annatuma · 2020-02-24T20:26:57Z

Provider API Endpoint / Documentation

https://pro.europeana.eu/resources/apis

For this provider, an API key is required. It is not possible to use CC’s org key, as a community contributor. The preferred way to work with it is to use an environment variable, such as PROVIDERNAME_API_KEY for the key and read it in the script. CC has obtained a key for a community contributor to run tests with. Contact @mathemancer or @annatuma or @kgodey to obtain the key when necessary.

Provider description

Europeana is home to over 58M artworks, artefacts, books, films and music from European museums, galleries, libraries, and archives.

Licenses Provided

Europeana integrates with dozens of EU GLAM institutions. It is possible that some of these institutions have objects that are not CC-licensed. Our integration with Europeana must be restricted to those objects that have a CC license (including CC0 or the CC Public Domain Mark) indicated in the "Rights" field.

SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION

Provider API Technical info

General Recommendations for implementation

The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.

The text was updated successfully, but these errors were encountered:

gauravahlawat81 · 2020-02-24T21:14:13Z

What does ticket work required means ?

kgodey · 2020-02-24T21:15:44Z

@gauravahlawat81 it means the issue needs more detail fleshed out. if you hover over the label, you can see the description in the tooltip.

gauravahlawat81 · 2020-02-24T21:17:08Z

Okay, thanks for the info

mathemancer · 2020-02-25T09:59:16Z

@gauravahlawat81 If you're feeling adventurous, feel free to help us flesh out the API technical info. In particular, we would like to know

rate limits
overall volume of the collection
verify the API is designed such that we can ingest the whole collection
determine if it's possible to ingest the collection in a scheduled way (e.g., ingest only the info for images updated on a given day).
using the rate limits and overall volume, determine about what percentage of the collection it's possible to ingest per day.

allen505 · 2020-03-06T01:56:48Z

Hey @mathemancer, I've checked the Europeana link provided for info on the Europeana REST API,

No rate limits currently, but has been limited to 10,000 per 24hrs previously
Volume- 50 million cultural heritage items which includes images, text, image, video, sound,etc

I am interested in working on this isssue. Can I work on it?

mathemancer · 2020-03-06T12:39:37Z

@allen505 Have you determined how many works can be scraped per request? At a rate of 10,000 per 24 hours, 50 million records would take 5000 days to retrieve (more than 13.5 years). We'll therefore need to determine if it's possible to get, say, 100 records per request (reducing the time to 50 days, which is manageable).

mathemancer · 2020-03-06T12:41:24Z

@allen505 Do you have an idea of whether it's possible to systematically retrieve their entire collection (i.e., separate by date uploaded)?

allen505 · 2020-03-06T15:48:10Z

I have been researching with respect to the Europeana Search API
Following details with respect to timestamp are possible:
timestamp_created, timestamp_update both formatted in ISO 8601 or UNIX epoch timestamp
Sort based on timestamp is possible in both ascending and descending.

With respect to the number of results per request, Regular pagination restricts to 1,000 results. However there is also cursor based pagination which returns all the results Reference

mathemancer · 2020-03-06T18:25:17Z

@allen505 Have you looked at the licenses offered, and how we can get that data? Is it possible to filter for them? @annatuma, @kgodey Do I remember correctly that there was some problem there?

allen505 · 2020-03-07T01:34:13Z

The Search APIs reusability parameter can be set to

open which gives:
PDM, CC0, CC BY, CC BY-SA

restricted which gives:
CC BY-NC, CC BY-NC-SA, CC BY-NC-ND, CC BY-ND, CC OOC-NC
But it also gives resources with other licenses such as InC-EDU, which will have to be filtered out.

Reference

mathemancer · 2020-03-09T20:13:59Z

This is great info! I think that's everything needed to get started on it, go for it! Thanks for gathering the necessary pieces.

If you have any questions, just @ me on this issue. In particular, let me know if something about the general instructions doesn't seem applicable. Thanks!

mathemancer · 2020-03-09T20:14:40Z

Removing ticket work required since all necessary has been gathered.

allen505 · 2020-03-10T17:28:21Z

Thank you @mathemancer . Just to clarify, is this issue assigned to me now, because it doesn't show so on the issue.

mathemancer · 2020-03-11T16:01:56Z

@allen505 We're not always assigning issue that are in progress, in the hopes that people will work together on them when possible :)

allen505 · 2020-03-30T05:45:03Z

I was testing out the Europeana API, and I found out that not all resources have a direct link to the image
The total number of records with a direct link to the resource (images in our case) is 20,526,325 and the difference between how many have direct link vs those which don't is 9,817,237. That means about 32% of resources don't have a direct link to the resource.

mathemancer · 2020-04-01T08:08:29Z

I think that's okay, we can just filter out those that don't have a direct link. Is it possible to filter them out in the request, or would we have to do the filtering after pulling the data?

allen505 · 2020-04-02T03:09:27Z

Yes, there is a parameter 'provider_aggregation_edm_isShownBy:*' when passed returns only those with the direct link.

mathemancer · 2020-04-08T08:02:57Z

Research by @allen505 indicates we can actually only get 100 records per request in actuality.

allen505 · 2020-04-08T17:04:36Z

Testing the API for the date 2014-02-26 (random date), I got 11,157 as the number of image records.
When queried for the entire collection with direct links to the resource, 20,544,899 was returned as the number of image records.
After discussions with @mathemancer it was concluded that 10,000 requests per 24 hours with 100 records per request is sufficient for our use case, and that at this rate we can pull the whole collection every month if needed.

allen505 · 2020-04-20T16:09:51Z

Several fields in the response are represented as arrays, fields such as dcDescription, edmIsShownAt, edmIsShownBy, rights which are used while saving the metadata. After discussion with @mathemancer, the following were concluded:

For LangAware fields such as dcLanguageLangAware when available the english(en) version is to be taken.
Original language(def), can be optionally saved
For other fields like edmIsShownAt which don't have corresponding LangAware, the zeroth element from the array is to be choosen.

annatuma created this issue from a note in CC Catalog Pipeline (New Integrations (Ready for Work)) Feb 24, 2020

annatuma added enhancement providers labels Feb 24, 2020

mathemancer removed the ticket work required label Mar 9, 2020

mathemancer added this to Ready for Development in Active Sprint via automation Mar 9, 2020

mathemancer moved this from Ready for Development to In Progress (Community) in Active Sprint Mar 9, 2020

mathemancer added the in progress label Mar 9, 2020

kgodey assigned allen505 Mar 11, 2020

allen505 mentioned this issue Apr 21, 2020

API Integration for Europeana #368

Merged

7 tasks

mathemancer closed this as completed in #368 Jun 2, 2020

Active Sprint automation moved this from In Progress (Community) to Done Jun 2, 2020

kss682 mentioned this issue Jul 20, 2020

[Infrastructure] Implement reingestion strategy for Europeana #412

Closed

TimidRobot removed this from Done in Active Sprint Jan 12, 2022

TimidRobot removed this from New Integrations (Ready for Work) in CC Catalog Pipeline Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Integration] Europeana #279

[API Integration] Europeana #279

annatuma commented Feb 24, 2020 •

edited

gauravahlawat81 commented Feb 24, 2020

kgodey commented Feb 24, 2020

gauravahlawat81 commented Feb 24, 2020

mathemancer commented Feb 25, 2020 •

edited

allen505 commented Mar 6, 2020

mathemancer commented Mar 6, 2020

mathemancer commented Mar 6, 2020

allen505 commented Mar 6, 2020

mathemancer commented Mar 6, 2020

allen505 commented Mar 7, 2020 •

edited

mathemancer commented Mar 9, 2020

mathemancer commented Mar 9, 2020

allen505 commented Mar 10, 2020 •

edited

mathemancer commented Mar 11, 2020

allen505 commented Mar 30, 2020 •

edited

mathemancer commented Apr 1, 2020

allen505 commented Apr 2, 2020

mathemancer commented Apr 8, 2020

allen505 commented Apr 8, 2020

allen505 commented Apr 20, 2020

[API Integration] Europeana #279

[API Integration] Europeana #279

Comments

annatuma commented Feb 24, 2020 • edited

Provider API Endpoint / Documentation

Provider description

Licenses Provided

Provider API Technical info

General Recommendations for implementation

Examples of other Provider API Scripts

gauravahlawat81 commented Feb 24, 2020

kgodey commented Feb 24, 2020

gauravahlawat81 commented Feb 24, 2020

mathemancer commented Feb 25, 2020 • edited

allen505 commented Mar 6, 2020

mathemancer commented Mar 6, 2020

mathemancer commented Mar 6, 2020

allen505 commented Mar 6, 2020

mathemancer commented Mar 6, 2020

allen505 commented Mar 7, 2020 • edited

mathemancer commented Mar 9, 2020

mathemancer commented Mar 9, 2020

allen505 commented Mar 10, 2020 • edited

mathemancer commented Mar 11, 2020

allen505 commented Mar 30, 2020 • edited

mathemancer commented Apr 1, 2020

allen505 commented Apr 2, 2020

mathemancer commented Apr 8, 2020

allen505 commented Apr 8, 2020

allen505 commented Apr 20, 2020

annatuma commented Feb 24, 2020 •

edited

mathemancer commented Feb 25, 2020 •

edited

allen505 commented Mar 7, 2020 •

edited

allen505 commented Mar 10, 2020 •

edited

allen505 commented Mar 30, 2020 •

edited