-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Popularity calculation optimizations (Matview refresh) #433
Comments
One thought I had while thinking about this for the iNaturalist data refresh complications is that we could potentially split this up from a single matview into a separate table that gets joined on in a non materialized view which is used during the data refresh. Right now, the popularity matview is the one that the data refresh uses to copy data from the catalog to the API. This means that if the matview is not refreshed, data that exists in the underlying table will not make it to the API. This makes the popularity calculation a prerequisite to any new data being added to the API, which seems like it might not be necessary. Instead, we could have the |
I thought that part of why we don't, for example, have a separate tags table, is because we don't want to join |
That's a really good point I had overlooked, I ran a query and found that the standardized popularity field is way more populated than I had initially expected...Perhaps a join like that will have a higher impact than is desirable 😕
Edit: Okay this may just be because the calculation inserts a value for all possible records where we have any popularity info, I'm running a different query to exclude Edit2: Still high!
|
Update 2023-04-11The project proposal has been posted and is in the Clarification round. Done
NextThis week I will be making revisions to the project proposal and putting it into the Decision round. I am working on prototyping/testing some ideas which will be detailed in the implementation plan. BlockersNone. |
Update 2023-04-27The implementation plan is in the Clarification round and has already received some good feedback. Because the image data refresh was last successfully run 2 months ago, and the most recent attempt at a run timed out after 21 days trying to refresh the materialized view, I have also initiated a new strategy for getting an image data refresh to complete. The |
Update 2023-05-23All issues are available in the milestone created for this project. Work is well under way. The bulk of the implementation for this project is in two tickets: Most of the remaining issues are for allowing DAGs to run in the background (to backfill data), or small cleanup steps. The first of these issues has been completed, and I am working on creating the popularity refresh DAG, and hoping to do so in such a way that we can reuse the logic for other batched updates. There was a catalog mini-meet-up Monday, May 22nd at which we began investigating some hypotheses for improving catalog performance, including potentially speeding up the batched updates. That investigation is still under way, with notes to come soon. |
Update 2023-06-05Calculate popularity at ingestion was merged, so standardized popularity is now being calculated at ingestion 🎉 What remains is to refresh existing records.
This work is being resumed after a delay to investigate performance issues in the catalog. We will be moving forward with the batched update DAGs as planned and potentially iterating further, pending an analysis of their performance. A |
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Work has been paused on this while some other investigations, including catalog performance testing, continue. However the batched_update DAG was merged in #2331 🎉 I've added a new issue to the milestone (#2507), after realizing that we'll need to make the |
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Update: Before leaving for a week of AFK @stacimc produced two PRs which are ready for review: |
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
The I have a PR as well for some quick fast-follows for issues identified during these initial runs (mostly adjusting New update: In local testing, I realized something that was missed during the implementation planning -- the |
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
All planned code for this project has been merged, including a fix to the audioset_view issue mentioned in the previous update. All that remains for code changes is a final cleanup issue for removing dead code/etc. The project was delayed by an unanticipated performance degradation in refreshing the popularity constants views, addressed in this quick fix. The PR adds popularity constants to the metrics tables instead of having a separate constants view, and updates the popularity refresh DAGs to update the metrics table directly. This is an alteration from the original plan and needs to be monitored closely. An audio data refresh, decoupled from popularity, was successfully run from the What is left:
|
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
I believe this project can be closed! A late, unplanned addition was made to the popularity constants due to an (unrelated) issue, so we're now running an additional image popularity refresh to ensure this all works. This project is complete, but will be considered definitely successful once an additional audio and image popularity refresh run (fully testing the constants work and refactors). |
Update: We're currently running an image popularity and data refresh concurrently, to observe whether there is any performance degradation to either batched updates or the copy data step caused by running at the same time. The copy data step for the image data refresh completed in 6 hours and 40 minutes. That’s very similar to past results (6 hrs 33 mins was the last one). The batch update time for the concurrently running Flickr update also did not seem to suffer; I checked in periodically over the day and they were consistently taking ~15 seconds for 10k records, which is what we’ve seen in previous runs. So it looks like if there is any performance degradation, it’s negligible. |
Description
Currently, re-calculating popularity values during data refresh takes a long time. We need to optimize the calculations to make data refresh shorter.
This is currently scheduled for December, but if some data refresh threshold is crossed, we might have to re-prioritize this.Documents
Issues
Prior Art
The text was updated successfully, but these errors were encountered: