New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backfill Internet Archive Book Provider source #3490
Comments
Tested locally with this batched_update configuration:
Note that I did the select using the creator_url. That's because for Flickr we don't directly store the Flickr username as |
The backfill ran, and identified 0 records in our existing catalog from this user. I did some investigation and found that this user reports a total of 4,767,558 cc0 licensed images. The most recently uploaded image was posted on September 11, 2015 -- so I think it makes sense that none of these records have been ingested previously (flickr was only backfilled to 2020, although we also have some older results I know less about). Unfortunately since nothing new has been ingested since then, it's very unlikely that we'll see any results added by the DAG moving forward -- our best bet would be to try to ingest the existing data. Due to the unreliability of the API that's going to be extremely difficult using the DAG in its current state. For example, my initial idea was to suggest that we manually run the DAG for each day in September of 2015 to get at least some of the data. The Flickr API reports ~100k records for this user from September. I tried running the DAG locally for September 10 and ingested 81k records -- but 0 for this user. Meanwhile if I request records from the API for that date and specify the user id in the query params, I get ~10k records 🤷♀️ I think we could get a good amount of data by adjusting the DAG to run over a longer period (rather than a single day) and include the user in the query params. I think it would even be pretty cool to add a feature to all provider DAGs for a conf option for extra query params that gets merged in with the params on each request. But while I think it could be done, it's also much more work than expected and I can't even guarantee it will work well. We certainly won't get the full ~4million records advertised. Consequently I think the best move is to revert the change to add this user as a source, at least for now, and add the user id to the This is unfortunate because this really does seem like a great subprovider, and I'd love to get that data. I'll also create issues for the features we'd need to be able to modify the Flickr DAG to ingest them more easily, and hopefully we'll get around to them. @AetherUnbound does that sound reasonable? |
That sounds like the right move, thank you for laying it all out Staci! |
Problem
Blocked by #3489
#3441 added Internet Archive Book Images as a subprovider for flickr. Now that it has been merged, any new records that are ingested for this user will set it as their source. Old records that have already been ingested for this user, however, are not automatically updated -- and the flickr reingestion DAG is paused due to rate limiting issues.
Description
We should do a
batched_update
to select all records with this creator (126377022@N07
) and set their source to the new source stringinternet_archive_book_images
. We should run a test locally before doing the production batched update.Additional context
This batched update is blocked on the batched update in #3489, which also updates Flickr. That one takes priority as it fixes a production bug.
The text was updated successfully, but these errors were encountered: