-
Notifications
You must be signed in to change notification settings - Fork 60
[API Integration] Europeana #279
Comments
What does ticket work required means ? |
@gauravahlawat81 it means the issue needs more detail fleshed out. if you hover over the label, you can see the description in the tooltip. |
Okay, thanks for the info |
@gauravahlawat81 If you're feeling adventurous, feel free to help us flesh out the API technical info. In particular, we would like to know
|
Hey @mathemancer, I've checked the Europeana link provided for info on the Europeana REST API,
I am interested in working on this isssue. Can I work on it? |
@allen505 Have you determined how many works can be scraped per request? At a rate of 10,000 per 24 hours, 50 million records would take 5000 days to retrieve (more than 13.5 years). We'll therefore need to determine if it's possible to get, say, 100 records per request (reducing the time to 50 days, which is manageable). |
@allen505 Do you have an idea of whether it's possible to systematically retrieve their entire collection (i.e., separate by date uploaded)? |
I have been researching with respect to the Europeana Search API With respect to the number of results per request, Regular pagination restricts to 1,000 results. However there is also cursor based pagination which returns all the results Reference |
The Search APIs reusability parameter can be set to
|
This is great info! I think that's everything needed to get started on it, go for it! Thanks for gathering the necessary pieces. If you have any questions, just @ me on this issue. In particular, let me know if something about the general instructions doesn't seem applicable. Thanks! |
Removing |
Thank you @mathemancer . Just to clarify, is this issue assigned to me now, because it doesn't show so on the issue. |
@allen505 We're not always assigning issue that are in progress, in the hopes that people will work together on them when possible :) |
I was testing out the Europeana API, and I found out that not all resources have a direct link to the image |
I think that's okay, we can just filter out those that don't have a direct link. Is it possible to filter them out in the request, or would we have to do the filtering after pulling the data? |
Yes, there is a parameter 'provider_aggregation_edm_isShownBy:*' when passed returns only those with the direct link. |
Research by @allen505 indicates we can actually only get 100 records per request in actuality. |
Testing the API for the date 2014-02-26 (random date), I got 11,157 as the number of image records. |
Several fields in the response are represented as arrays, fields such as dcDescription, edmIsShownAt, edmIsShownBy, rights which are used while saving the metadata. After discussion with @mathemancer, the following were concluded:
|
Provider API Endpoint / Documentation
https://pro.europeana.eu/resources/apis
For this provider, an API key is required. It is not possible to use CC’s org key, as a community contributor. The preferred way to work with it is to use an environment variable, such as PROVIDERNAME_API_KEY for the key and read it in the script. CC has obtained a key for a community contributor to run tests with. Contact @mathemancer or @annatuma or @kgodey to obtain the key when necessary.
Provider description
Europeana is home to over 58M artworks, artefacts, books, films and music from European museums, galleries, libraries, and archives.
Licenses Provided
Europeana integrates with dozens of EU GLAM institutions. It is possible that some of these institutions have objects that are not CC-licensed. Our integration with Europeana must be restricted to those objects that have a CC license (including CC0 or the CC Public Domain Mark) indicated in the "Rights" field.
SECTIONS BELOW THIS POINT NEED TO BE WORKED ON PRIOR TO INTEGRATION
Provider API Technical info
General Recommendations for implementation
src/cc_catalog_airflow/dags/provider_api_scripts/
directory.ImageStore
class (Import this fromsrc/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py
).DelayedRequester
class (Import this fromsrc/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py
).src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py
, sincethat module is deprecated.
the script should take a
--date
parameter when run as a script, giving thedate for which we should collect images. The form should be
YYYY-MM-DD
(so,the script can be run via
python my_favorite_provider.py --date 2018-01-01
).the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date)
. The main should do the same thing callingfrom the CLI would do.
pycodestyle
(available viapip install pycodestyle
) to check for compliance.appropriate (e.g., long strings for testing).
Examples of other Provider API Scripts
For example Provider API Scripts and accompanying test suites, please see
src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py
andsrc/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py
, orsrc/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py
andsrc/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py
.The text was updated successfully, but these errors were encountered: