Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill Internet Archive Book Provider source #3490

Closed
stacimc opened this issue Dec 7, 2023 · 4 comments
Closed

Backfill Internet Archive Book Provider source #3490

stacimc opened this issue Dec 7, 2023 · 4 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Contributor

stacimc commented Dec 7, 2023

Problem

Blocked by #3489

#3441 added Internet Archive Book Images as a subprovider for flickr. Now that it has been merged, any new records that are ingested for this user will set it as their source. Old records that have already been ingested for this user, however, are not automatically updated -- and the flickr reingestion DAG is paused due to rate limiting issues.

Description

We should do a batched_update to select all records with this creator (126377022@N07) and set their source to the new source string internet_archive_book_images. We should run a test locally before doing the production batched update.

Additional context

This batched update is blocked on the batched update in #3489, which also updates Flickr. That one takes priority as it fixes a production bug.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Dec 7, 2023
@stacimc stacimc self-assigned this Dec 13, 2023
@stacimc
Copy link
Contributor Author

stacimc commented Dec 13, 2023

Tested locally with this batched_update configuration:

{
"batch_size":10000
"dry_run":false
"query_id":"flickr_internet_archive"
"resume_update":false
"select_query":"WHERE provider='flickr' AND creator_url='https://www.flickr.com/photos/126377022@N07';"
"table_name":"image"
"update_query":"SET source='internet_archive_book_images'"
"update_timeout":3600
}

Note that I did the select using the creator_url. That's because for Flickr we don't directly store the Flickr username as creator, rather using the ownername field that I'm not 100% sure is unique. However we can for certain uniquely select these records off the creator_url, which contains the user id.

@stacimc
Copy link
Contributor Author

stacimc commented Dec 15, 2023

The backfill ran, and identified 0 records in our existing catalog from this user.

I did some investigation and found that this user reports a total of 4,767,558 cc0 licensed images. The most recently uploaded image was posted on September 11, 2015 -- so I think it makes sense that none of these records have been ingested previously (flickr was only backfilled to 2020, although we also have some older results I know less about). Unfortunately since nothing new has been ingested since then, it's very unlikely that we'll see any results added by the DAG moving forward -- our best bet would be to try to ingest the existing data.

Due to the unreliability of the API that's going to be extremely difficult using the DAG in its current state. For example, my initial idea was to suggest that we manually run the DAG for each day in September of 2015 to get at least some of the data. The Flickr API reports ~100k records for this user from September. I tried running the DAG locally for September 10 and ingested 81k records -- but 0 for this user. Meanwhile if I request records from the API for that date and specify the user id in the query params, I get ~10k records 🤷‍♀️

I think we could get a good amount of data by adjusting the DAG to run over a longer period (rather than a single day) and include the user in the query params. I think it would even be pretty cool to add a feature to all provider DAGs for a conf option for extra query params that gets merged in with the params on each request. But while I think it could be done, it's also much more work than expected and I can't even guarantee it will work well. We certainly won't get the full ~4million records advertised.

Consequently I think the best move is to revert the change to add this user as a source, at least for now, and add the user id to the nsids_to_skip list here so that the Sub provider Audit DAG doesn't keep alerting on it. We should include a comment explaining the reasoning.

This is unfortunate because this really does seem like a great subprovider, and I'd love to get that data. I'll also create issues for the features we'd need to be able to modify the Flickr DAG to ingest them more easily, and hopefully we'll get around to them.

@AetherUnbound does that sound reasonable?

@AetherUnbound
Copy link
Contributor

That sounds like the right move, thank you for laying it all out Staci!

@stacimc
Copy link
Contributor Author

stacimc commented Dec 15, 2023

Issues #3533 and #3534 created for the steps we'd need to do a backfill. #3532 opened to remove the source for now.

@stacimc stacimc closed this as completed Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

No branches or pull requests

2 participants