Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new image sources to the API #3393

Closed
obulat opened this issue Nov 23, 2023 · 3 comments
Closed

Add new image sources to the API #3393

obulat opened this issue Nov 23, 2023 · 3 comments
Assignees
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🔒 staff only Restricted to staff members

Comments

@obulat
Copy link
Contributor

obulat commented Nov 23, 2023

Problem

Some image sources were added to the catalog, but they are not available at https://api.openverse.engineering/v1/images/stats because they were not added to the API as ContentProvider.

Description

The following sources exist in the Elasticsearch, but not as ContentProvider models.

added source_name API result count display_name collection_url source_url
N/A archief_alkmaar 73 Regionaal Archief Alkmaar https://www.flickr.com/photos/public-domain-archief-alkmaar/ https://www.regionaalarchiefalkmaar.nl/
[x] bib_gulbenkian 6104 Gulbenkian Art Library https://www.flickr.com/photos/biblarte/ https://gulbenkian.pt/biblioteca-arte/
N/A east_riding 11 East Riding Archives https://www.flickr.com/photos/erarchives/ http://www2.eastriding.gov.uk/leisure/archives-family-and-local-history/
[x] finnish_heritage_agency 266832 Finnish Heritage Agency No URL for collection https://www.museovirasto.fi/en/
[x] finnish_satakunnan_museum 12517 Finnish Satakunnan Museum No URL fo collection https://satakunnanmuseo.pori.fi/en/
[x] justtakeitfree 166 Just take it free https://justtakeitfree.com/ https://justtakeitfree.com/
[x] national_museum_of_finland 407 National Museum of Finland No URL for collection https://www.kansallismuseo.fi/en/kansallismuseo
N/A waltersartmuseum 16948 The Walters Art Museum https://art.thewalters.org/ https://thewalters.org/
[x] wellcome_collection 92478 Wellcome Collection https://wellcomecollection.org/search/images https://wellcomecollection.org/

Note: The Walters Art Museum has items in ES, but there's no ContentProvider, and no results are returned by the API. Why?
Flora-on content is hidden in the Django Admin.

Description

The https://api.openverse.engineering/v1/images/stats returns stats only for the sources that have a ContentProvider record in the database.
The ES stats are saved to the Redis cache. Here's the current value:

{'wordpress': 10506, 'woc_tech': 268, 'wikimedia': 62655099, 'wellcome_collection': 92478, 'waltersartmuseum': 16948, 'thingiverse': 32395, 'svgsilh': 358942, 'stocksnap': 38802, 'spacex': 1360, 'smk': 39954, 'smithsonian_zoo_and_conservation': 462, 'smithsonian_postal_museum': 7155, 'smithsonian_portrait_gallery': 16212, 'smithsonian_national_museum_of_natural_history': 3687662, 'smithsonian_libraries': 55, 'smithsonian_institution_archives': 8313, 'smithsonian_hirshhorn_museum': 878, 'smithsonian_gardens': 8387, 'smithsonian_freer_gallery_of_art': 7086, 'smithsonian_cooper_hewitt_museum': 75037, 'smithsonian_anacostia_museum': 601, 'smithsonian_american_indian_museum': 246, 'smithsonian_american_history_museum': 13718, 'smithsonian_american_art_museum': 12498, 'smithsonian_air_and_space_museum': 6513, 'smithsonian_african_art_museum': 365, 'smithsonian_african_american_history_museum': 10895, 'sketchfab': 37872, 'sciencemuseum': 107979, 'rijksmuseum': 29999, 'rawpixel': 177650, 'phylopic': 8394, 'nypl': 1277, 'national_museum_of_finland': 407, 'nasa': 125092, 'nappy': 2211, 'museumsvictoria': 168080, 'met': 401340, 'justtakeitfree': 167, 'inaturalist': 158267579, 'geographorguk': 1090119, 'floraon': 55010, 'flickr': 505853539, 'finnish_satakunnan_museum': 12517, 'finnish_heritage_agency': 266832, 'europeana': 9453878, 'east_riding': 11, 'digitaltmuseum': 289769, 'clevelandmuseum': 39138, 'brooklynmuseum': 69924, 'bio_diversity': 247665, 'bib_gulbenkian': 6104, 'archief_alkmaar': 73, 'animaldiversity': 15554, 'WoRMS': 19783, 'CAPL': 15143}

Alternatives

Create an automated process for regularly comparing the results from ES stats query and the Django API ContentProviders that would send a notification to add a new provider. The addition would still require manual intervention to add the correct URL.

Additional context

There are some sources that have very few items in the ES (archief_alkmaar - 73, east_riding - 11). @WordPress/openverse-catalog, do you think we should remove them as a separate source? They might have more items available, but not yet ingested, so I'm not sure.

@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🔒 staff only Restricted to staff members 🧱 stack: api Related to the Django API labels Nov 23, 2023
@AetherUnbound
Copy link
Contributor

Thanks for identifying this @obulat! Can I ask, what sort of process did you use for finding that these providers were missing?

We talked in the weekly community meeting this week - I think that it's worth enabling separate providers when the providers themselves have over 150 images. While looking this over, I made #3470 to try and make it easier for us to remember why certain providers are disabled in the future.

I see two additional issues that could be made from this, too:

  • The addition of a collection URL field, as mentioned in the community meeting as well (and separate from the source URL)
  • A DAG which follows the same process you did to arrive at this issue in an automated way, and reports it to the maintainers to check on.

@obulat
Copy link
Contributor Author

obulat commented Dec 6, 2023

Can I ask, what sort of process did you use for finding that these providers were missing?

The stats endpoint checks the Redis cache for the value of sources-<mediaType> key, and if it's not available, it sends an ES request for source aggregation and caches it.

sources = cache.get(key=source_cache_name)

I got the value that's saved in cache, and then compared it to the value that's returned by the /images/stats endpoint (the one that only returns sources that have a ContentProvider and does not have "Hide Content" set to true)

@obulat
Copy link
Contributor Author

obulat commented Dec 18, 2023

I added ContentProvider records to the API. You can see the updated list of image sources here: https://api.openverse.engineering/v1/images/stats/?format=api

I also opened issues for investigating Walters Art museum and creating the DAG for suggesting new sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🔒 staff only Restricted to staff members
Projects
Archived in project
Development

No branches or pull requests

2 participants