Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Popularity calculation optimizations (Matview refresh) #433

Closed
2 tasks done
obulat opened this issue Feb 18, 2023 · 18 comments
Closed
2 tasks done

Popularity calculation optimizations (Matview refresh) #433

obulat opened this issue Feb 18, 2023 · 18 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🧭 project: thread An issue used to track a project and its progress 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented Feb 18, 2023

Start Date ETA Project Lead Actual Ship Date
2023-03-13 @stacimc 2023-09-13

Description

Currently, re-calculating popularity values during data refresh takes a long time. We need to optimize the calculations to make data refresh shorter. This is currently scheduled for December, but if some data refresh threshold is crossed, we might have to re-prioritize this.

Documents

Issues

Prior Art

@obulat obulat added the 🧭 project: thread An issue used to track a project and its progress label Feb 18, 2023
@AetherUnbound
Copy link
Contributor

One thought I had while thinking about this for the iNaturalist data refresh complications is that we could potentially split this up from a single matview into a separate table that gets joined on in a non materialized view which is used during the data refresh. Right now, the popularity matview is the one that the data refresh uses to copy data from the catalog to the API. This means that if the matview is not refreshed, data that exists in the underlying table will not make it to the API. This makes the popularity calculation a prerequisite to any new data being added to the API, which seems like it might not be necessary.

Instead, we could have the image_view be a standard view which is a left join between image and image_popularity. The latter would be a materialized view which would contain popularity values, but the difference here is that it could be refreshed independently of the data refresh itself! This would free us up from depending on the popularity calculation for bringing new data into the API.

@rwidom
Copy link
Collaborator

rwidom commented Feb 25, 2023

I thought that part of why we don't, for example, have a separate tags table, is because we don't want to join image to anything. Is the idea that this would be faster because image_popularity is much smaller?

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Feb 25, 2023

I thought that part of why we don't, for example, have a separate tags table, is because we don't want to join image to anything. Is the idea that this would be faster because image_popularity is much smaller?

That's a really good point I had overlooked, I ran a query and found that the standardized popularity field is way more populated than I had initially expected...Perhaps a join like that will have a higher impact than is desirable 😕

deploy@localhost:openledger> select count(*) from image_view where standardized_popularity is not NULL;
+-----------+
| count     |
|-----------|
| 523503895 |
+-----------+
SELECT 1
Time: 1725.256s (28 minutes 45 seconds), executed in: 1725.251s (28 minutes 45 seconds)

Edit: Okay this may just be because the calculation inserts a value for all possible records where we have any popularity info, I'm running a different query to exclude 0 values as well and will post those results!

Edit2: Still high!

deploy@localhost:openledger> select count(*) from image_view where standardized_popularity is not NULL and standardized_popularity != 0;
+-----------+
| count     |
|-----------|
| 461473575 |
+-----------+
SELECT 1
Time: 1795.650s (29 minutes 55 seconds), executed in: 1795.647s (29 minutes 55 seconds)

@zackkrida zackkrida changed the title Popularity calculation optimizations Popularity calculation optimizations (Matview refresh) Mar 8, 2023
@stacimc
Copy link
Contributor

stacimc commented Apr 11, 2023

Update 2023-04-11

The project proposal has been posted and is in the Clarification round.

Done

  • Project proposal drafted

Next

This week I will be making revisions to the project proposal and putting it into the Decision round. I am working on prototyping/testing some ideas which will be detailed in the implementation plan.

Blockers

None.

@stacimc
Copy link
Contributor

stacimc commented Apr 27, 2023

Update 2023-04-27

The implementation plan is in the Clarification round and has already received some good feedback.

Because the image data refresh was last successfully run 2 months ago, and the most recent attempt at a run timed out after 21 days trying to refresh the materialized view, I have also initiated a new strategy for getting an image data refresh to complete. The recreate_image_popularity_calculation DAG has been triggered, which will drop and fully recreate the matview instead of attempting to refresh it. This plan can be tried in the background while we work on the IP and subsequent tickets.

@stacimc
Copy link
Contributor

stacimc commented May 23, 2023

Update 2023-05-23

All issues are available in the milestone created for this project.

Work is well under way. The bulk of the implementation for this project is in two tickets:

Most of the remaining issues are for allowing DAGs to run in the background (to backfill data), or small cleanup steps. The first of these issues has been completed, and standardized_popularity is now a column on the media tables which is being calculated at ingestion!

I am working on creating the popularity refresh DAG, and hoping to do so in such a way that we can reuse the logic for other batched updates.

There was a catalog mini-meet-up Monday, May 22nd at which we began investigating some hypotheses for improving catalog performance, including potentially speeding up the batched updates. That investigation is still under way, with notes to come soon.

@stacimc
Copy link
Contributor

stacimc commented Jun 5, 2023

Update 2023-06-05

Calculate popularity at ingestion was merged, so standardized popularity is now being calculated at ingestion 🎉 What remains is to refresh existing records.

I am working on creating the popularity refresh DAG, and hoping to do so in such a way that we can reuse the logic for other batched updates.

This work is being resumed after a delay to investigate performance issues in the catalog. We will be moving forward with the batched update DAGs as planned and potentially iterating further, pending an analysis of their performance.

A batched_update DAG which can be configured for popularity refreshes as well as arbitrary updates will be the first step, followed by a second PR to generate the popularity DAGs themselves.

@openverse-bot
Copy link
Collaborator

Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@stacimc
Copy link
Contributor

stacimc commented Jun 29, 2023

Work has been paused on this while some other investigations, including catalog performance testing, continue.

However the batched_update DAG was merged in #2331 🎉

I've added a new issue to the milestone (#2507), after realizing that we'll need to make the batched_update DAG resilient to failures caused by catalog deployments, as it's unrealistic to wait to deploy the catalog until long-running batch updates are finished. Other than that, all that remains is to generate the popularity refresh DAGs themselves.

@openverse-bot
Copy link
Collaborator

Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@zackkrida
Copy link
Member

Update: Before leaving for a week of AFK @stacimc produced two PRs which are ready for review:

@openverse-bot
Copy link
Collaborator

Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@stacimc
Copy link
Contributor

stacimc commented Aug 1, 2023

The batched_update DAG was updated to be resilient to mid-update failures (this allows it to resume safely after deploying the catalog, for instance), and the popularity refresh DAGs were also merged. The popularity refreshes were both triggered. The audio popularity refresh completed in about 5 minutes! The image popularity refresh has completed for all providers except Flickr, which is currently 45% done. Based on its current rate of progress, I hope that it will finish sometime this weekend, and take about 9 days total.

I have a PR as well for some quick fast-follows for issues identified during these initial runs (mostly adjusting poke_intervals and updating logging and message formatting).

New update: In local testing, I realized something that was missed during the implementation planning -- the audioset_view is created from the audio_view, which means that we cannot simply drop the matview without addressing that. Work is currently halted for this week and likely next week due to other team obligations, but I'll need to do a little bit of work to clarify this before we can continue with cleanup steps. I do not anticipate this being a large concern.

@obulat obulat added 🧱 stack: catalog Related to the catalog and Airflow DAGs 💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature labels Aug 8, 2023
@openverse-bot
Copy link
Collaborator

Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@stacimc
Copy link
Contributor

stacimc commented Aug 29, 2023

All planned code for this project has been merged, including a fix to the audioset_view issue mentioned in the previous update. All that remains for code changes is a final cleanup issue for removing dead code/etc.

The project was delayed by an unanticipated performance degradation in refreshing the popularity constants views, addressed in this quick fix. The PR adds popularity constants to the metrics tables instead of having a separate constants view, and updates the popularity refresh DAGs to update the metrics table directly. This is an alteration from the original plan and needs to be monitored closely.

An audio data refresh, decoupled from popularity, was successfully run from the audio table!

What is left:

  • Run an image data refresh (potentially blocked by a catalog deploy)
  • Code cleanup
  • Merge the constants fix and try running the popularity refreshes again, using the new workflow

@openverse-bot
Copy link
Collaborator

Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@stacimc
Copy link
Contributor

stacimc commented Sep 13, 2023

I believe this project can be closed!

A late, unplanned addition was made to the popularity constants due to an (unrelated) issue, so we're now running an additional image popularity refresh to ensure this all works.

This project is complete, but will be considered definitely successful once an additional audio and image popularity refresh run (fully testing the constants work and refactors).

@stacimc stacimc closed this as completed Sep 13, 2023
@stacimc
Copy link
Contributor

stacimc commented Sep 18, 2023

Update:

We're currently running an image popularity and data refresh concurrently, to observe whether there is any performance degradation to either batched updates or the copy data step caused by running at the same time.

The copy data step for the image data refresh completed in 6 hours and 40 minutes. That’s very similar to past results (6 hrs 33 mins was the last one). The batch update time for the concurrently running Flickr update also did not seem to suffer; I checked in periodically over the day and they were consistently taking ~15 seconds for 10k records, which is what we’ve seen in previous runs. So it looks like if there is any performance degradation, it’s negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🧭 project: thread An issue used to track a project and its progress 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

No branches or pull requests

6 participants