Backfill Internet Archive Book Provider source #3490

stacimc · 2023-12-07T22:51:45Z

Problem

Blocked by #3489

#3441 added Internet Archive Book Images as a subprovider for flickr. Now that it has been merged, any new records that are ingested for this user will set it as their source. Old records that have already been ingested for this user, however, are not automatically updated -- and the flickr reingestion DAG is paused due to rate limiting issues.

Description

We should do a batched_update to select all records with this creator (126377022@N07) and set their source to the new source string internet_archive_book_images. We should run a test locally before doing the production batched update.

Additional context

This batched update is blocked on the batched update in #3489, which also updates Flickr. That one takes priority as it fixes a production bug.

The text was updated successfully, but these errors were encountered:

stacimc · 2023-12-13T21:01:30Z

Tested locally with this batched_update configuration:

{
"batch_size":10000
"dry_run":false
"query_id":"flickr_internet_archive"
"resume_update":false
"select_query":"WHERE provider='flickr' AND creator_url='https://www.flickr.com/photos/126377022@N07';"
"table_name":"image"
"update_query":"SET source='internet_archive_book_images'"
"update_timeout":3600
}

Note that I did the select using the creator_url. That's because for Flickr we don't directly store the Flickr username as creator, rather using the ownername field that I'm not 100% sure is unique. However we can for certain uniquely select these records off the creator_url, which contains the user id.

stacimc · 2023-12-15T00:22:01Z

The backfill ran, and identified 0 records in our existing catalog from this user.

I did some investigation and found that this user reports a total of 4,767,558 cc0 licensed images. The most recently uploaded image was posted on September 11, 2015 -- so I think it makes sense that none of these records have been ingested previously (flickr was only backfilled to 2020, although we also have some older results I know less about). Unfortunately since nothing new has been ingested since then, it's very unlikely that we'll see any results added by the DAG moving forward -- our best bet would be to try to ingest the existing data.

Due to the unreliability of the API that's going to be extremely difficult using the DAG in its current state. For example, my initial idea was to suggest that we manually run the DAG for each day in September of 2015 to get at least some of the data. The Flickr API reports ~100k records for this user from September. I tried running the DAG locally for September 10 and ingested 81k records -- but 0 for this user. Meanwhile if I request records from the API for that date and specify the user id in the query params, I get ~10k records 🤷‍♀️

I think we could get a good amount of data by adjusting the DAG to run over a longer period (rather than a single day) and include the user in the query params. I think it would even be pretty cool to add a feature to all provider DAGs for a conf option for extra query params that gets merged in with the params on each request. But while I think it could be done, it's also much more work than expected and I can't even guarantee it will work well. We certainly won't get the full ~4million records advertised.

Consequently I think the best move is to revert the change to add this user as a source, at least for now, and add the user id to the nsids_to_skip list here so that the Sub provider Audit DAG doesn't keep alerting on it. We should include a comment explaining the reasoning.

This is unfortunate because this really does seem like a great subprovider, and I'd love to get that data. I'll also create issues for the features we'd need to be able to modify the Flickr DAG to ingest them more easily, and hopefully we'll get around to them.

@AetherUnbound does that sound reasonable?

AetherUnbound · 2023-12-15T17:34:54Z

That sounds like the right move, thank you for laying it all out Staci!

stacimc · 2023-12-15T20:31:21Z

Issues #3533 and #3534 created for the steps we'd need to do a backfill. #3532 opened to remove the source for now.

stacimc mentioned this issue Dec 7, 2023

Add Internet Archive Book Images source in Django admin #3491

Closed

stacimc self-assigned this Dec 13, 2023

This was referenced Dec 15, 2023

Remove Internet Archive Book Images sub provider #3532

Merged

Add optional additional_query_params config to provider DAGs #3533

Closed

Make TimeDelineatedProviderDataIngester handle date ranges larger than a single day #3534

Open

stacimc closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backfill Internet Archive Book Provider source #3490

Backfill Internet Archive Book Provider source #3490

stacimc commented Dec 7, 2023

stacimc commented Dec 13, 2023

stacimc commented Dec 15, 2023

AetherUnbound commented Dec 15, 2023

stacimc commented Dec 15, 2023

Backfill Internet Archive Book Provider source #3490

Backfill Internet Archive Book Provider source #3490

Comments

stacimc commented Dec 7, 2023

Problem

Description

Additional context

stacimc commented Dec 13, 2023

stacimc commented Dec 15, 2023

AetherUnbound commented Dec 15, 2023

stacimc commented Dec 15, 2023