Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

[API Integration] Europeana #279

Closed
annatuma opened this issue Feb 24, 2020 · 20 comments · Fixed by #368
Closed

[API Integration] Europeana #279

annatuma opened this issue Feb 24, 2020 · 20 comments · Fixed by #368
Assignees

Comments

@annatuma
Copy link
Contributor

annatuma commented Feb 24, 2020

Provider API Endpoint / Documentation

https://pro.europeana.eu/resources/apis

For this provider, an API key is required. It is not possible to use CC’s org key, as a community contributor. The preferred way to work with it is to use an environment variable, such as PROVIDERNAME_API_KEY for the key and read it in the script. CC has obtained a key for a community contributor to run tests with. Contact @mathemancer or @annatuma or @kgodey to obtain the key when necessary.

Provider description

Europeana is home to over 58M artworks, artefacts, books, films and music from European museums, galleries, libraries, and archives.

Licenses Provided

Europeana integrates with dozens of EU GLAM institutions. It is possible that some of these institutions have objects that are not CC-licensed. Our integration with Europeana must be restricted to those objects that have a CC license (including CC0 or the CC Public Domain Mark) indicated in the "Rights" field.

SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION

Provider API Technical info

General Recommendations for implementation

  • The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
  • The script should have a test suite in the same directory.
  • The script must use the ImageStore class (Import this from
    src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
  • The script should use the DelayedRequester class (Import this from
    src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
  • The script must not use anything from
    src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
    that module is deprecated.
  • If the provider API has can be queried by 'upload date' or something similar,
    the script should take a --date parameter when run as a script, giving the
    date for which we should collect images. The form should be YYYY-MM-DD (so,
    the script can be run via python my_favorite_provider.py --date 2018-01-01).
  • The script must provide a main function that takes the same parameters as from
    the CLI. In our example from above, we'd then have a main function
    my_favorite_provider.main(date). The main should do the same thing calling
    from the CLI would do.
  • The script must conform to PEP8. Please use pycodestyle (available via
    pip install pycodestyle) to check for compliance.
  • The script should use small, testable functions.
  • The test suite for the script may break PEP8 rules regarding long lines where
    appropriate (e.g., long strings for testing).

Examples of other Provider API Scripts

For example Provider API Scripts and accompanying test suites, please see

  • src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
  • src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
  • src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
  • src/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py.
@annatuma annatuma created this issue from a note in CC Catalog Pipeline (New Integrations (Ready for Work)) Feb 24, 2020
@gauravahlawat81
Copy link

What does ticket work required means ?

@kgodey
Copy link
Contributor

kgodey commented Feb 24, 2020

@gauravahlawat81 it means the issue needs more detail fleshed out. if you hover over the label, you can see the description in the tooltip.

@gauravahlawat81
Copy link

Okay, thanks for the info

@mathemancer
Copy link
Contributor

mathemancer commented Feb 25, 2020

@gauravahlawat81 If you're feeling adventurous, feel free to help us flesh out the API technical info. In particular, we would like to know

  • rate limits
  • overall volume of the collection
  • verify the API is designed such that we can ingest the whole collection
  • determine if it's possible to ingest the collection in a scheduled way (e.g., ingest only the info for images updated on a given day).
  • using the rate limits and overall volume, determine about what percentage of the collection it's possible to ingest per day.

@allen505
Copy link
Contributor

allen505 commented Mar 6, 2020

Hey @mathemancer, I've checked the Europeana link provided for info on the Europeana REST API,

  • No rate limits currently, but has been limited to 10,000 per 24hrs previously

  • Volume- 50 million cultural heritage items which includes images, text, image, video, sound,etc

I am interested in working on this isssue. Can I work on it?

@mathemancer
Copy link
Contributor

@allen505 Have you determined how many works can be scraped per request? At a rate of 10,000 per 24 hours, 50 million records would take 5000 days to retrieve (more than 13.5 years). We'll therefore need to determine if it's possible to get, say, 100 records per request (reducing the time to 50 days, which is manageable).

@mathemancer
Copy link
Contributor

@allen505 Do you have an idea of whether it's possible to systematically retrieve their entire collection (i.e., separate by date uploaded)?

@allen505
Copy link
Contributor

allen505 commented Mar 6, 2020

I have been researching with respect to the Europeana Search API
Following details with respect to timestamp are possible:
timestamp_created, timestamp_update both formatted in ISO 8601 or UNIX epoch timestamp
Sort based on timestamp is possible in both ascending and descending.

With respect to the number of results per request, Regular pagination restricts to 1,000 results. However there is also cursor based pagination which returns all the results Reference

@mathemancer
Copy link
Contributor

@allen505 Have you looked at the licenses offered, and how we can get that data? Is it possible to filter for them? @annatuma, @kgodey Do I remember correctly that there was some problem there?

@allen505
Copy link
Contributor

allen505 commented Mar 7, 2020

The Search APIs reusability parameter can be set to

  • open which gives:
    PDM, CC0, CC BY, CC BY-SA
  • restricted which gives:
    CC BY-NC, CC BY-NC-SA, CC BY-NC-ND, CC BY-ND, CC OOC-NC
    But it also gives resources with other licenses such as InC-EDU, which will have to be filtered out.

Reference

@mathemancer
Copy link
Contributor

This is great info! I think that's everything needed to get started on it, go for it! Thanks for gathering the necessary pieces.

If you have any questions, just @ me on this issue. In particular, let me know if something about the general instructions doesn't seem applicable. Thanks!

@mathemancer
Copy link
Contributor

Removing ticket work required since all necessary has been gathered.

@mathemancer mathemancer added this to Ready for Development in Active Sprint via automation Mar 9, 2020
@mathemancer mathemancer moved this from Ready for Development to In Progress (Community) in Active Sprint Mar 9, 2020
@allen505
Copy link
Contributor

allen505 commented Mar 10, 2020

Thank you @mathemancer . Just to clarify, is this issue assigned to me now, because it doesn't show so on the issue.

@mathemancer
Copy link
Contributor

@allen505 We're not always assigning issue that are in progress, in the hopes that people will work together on them when possible :)

@allen505
Copy link
Contributor

allen505 commented Mar 30, 2020

I was testing out the Europeana API, and I found out that not all resources have a direct link to the image
The total number of records with a direct link to the resource (images in our case) is 20,526,325 and the difference between how many have direct link vs those which don't is 9,817,237. That means about 32% of resources don't have a direct link to the resource.

@mathemancer
Copy link
Contributor

I think that's okay, we can just filter out those that don't have a direct link. Is it possible to filter them out in the request, or would we have to do the filtering after pulling the data?

@allen505
Copy link
Contributor

allen505 commented Apr 2, 2020

Yes, there is a parameter 'provider_aggregation_edm_isShownBy:*' when passed returns only those with the direct link.

@mathemancer
Copy link
Contributor

Research by @allen505 indicates we can actually only get 100 records per request in actuality.

@allen505
Copy link
Contributor

allen505 commented Apr 8, 2020

Testing the API for the date 2014-02-26 (random date), I got 11,157 as the number of image records.
When queried for the entire collection with direct links to the resource, 20,544,899 was returned as the number of image records.
After discussions with @mathemancer it was concluded that 10,000 requests per 24 hours with 100 records per request is sufficient for our use case, and that at this rate we can pull the whole collection every month if needed.

@allen505
Copy link
Contributor

Several fields in the response are represented as arrays, fields such as dcDescription, edmIsShownAt, edmIsShownBy, rights which are used while saving the metadata. After discussion with @mathemancer, the following were concluded:

  • For LangAware fields such as dcLanguageLangAware when available the english(en) version is to be taken.
  • Original language(def), can be optionally saved
  • For other fields like edmIsShownAt which don't have corresponding LangAware, the zeroth element from the array is to be choosen.

Active Sprint automation moved this from In Progress (Community) to Done Jun 2, 2020
@TimidRobot TimidRobot removed this from Done in Active Sprint Jan 12, 2022
@TimidRobot TimidRobot removed this from New Integrations (Ready for Work) in CC Catalog Pipeline Jan 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging a pull request may close this issue.

5 participants