Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report data refresh count change by provider #1404

Closed
1 task
AetherUnbound opened this issue Oct 14, 2022 · 3 comments · Fixed by #4067
Closed
1 task

Report data refresh count change by provider #1404

AetherUnbound opened this issue Oct 14, 2022 · 3 comments · Fixed by #4067
Assignees
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 💾 tech: postgres Involves PostgreSQL 🐍 tech: python Involves Python
Projects

Comments

@AetherUnbound
Copy link
Contributor

Description

PR WordPress/openverse-catalog#636 added record difference reporting for changes before & after the data refresh. These stats are very useful, but it may also be helpful to get more specific information on which providers contributed to the change. In addition to including the total change, we could report the number of new records per-provider.

This query would need to be updated:

https://github.com/WordPress/openverse-catalog/blob/d4dbf4d0617aeee9610adadbbee12be641174c0b/openverse_catalog/dags/data_refresh/dag_factory.py#L184-L196

Unfortunately this more optimized query would make grouping by provider impossible. We'd need to go with a query like the following, which is not optimized in the same way:

SELECT provider, count(*)
FROM {data_refresh.media_type}
GROUP BY 1;

It's worth noting that this query runs against the API database, not the catalog DB, and we do have an index on provider in that database. The other good news is that this step gets run at the start of the data refresh DAG, concurrently with the matview refresh (which for our larger table, takes about 12 hours). The rest of the steps are dependent on this task, but even if the query takes several hours it will complete before the matview refresh and not block any downstream tasks. The final reporting of the record counts will take longer after the refresh is complete, but the improved information from the report is probably worth it 🙂

Additional context

On a code level, I'd consider this an easier issue to dive into. The difficult comes when we want to test this locally, as one also must set up the API stack in order to trigger the data refresh. If you're interested in taking on this issue please let us know! We're happy to help you walk through the steps to get the API set up 🙂

Implementation

  • 🙋 I would be interested in implementing this feature.
@AetherUnbound AetherUnbound added good first issue New-contributor friendly help wanted Open to participation from the community ✨ goal: improvement Improvement to an existing user-facing feature 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 💾 tech: postgres Involves PostgreSQL 🔧 tech: airflow Involves Apache Airflow 🟩 priority: low Low priority and doesn't need to be rushed labels Oct 14, 2022
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@AetherUnbound
Copy link
Contributor Author

Alternatively, we could use the stats page for this instead! https://api.openverse.engineering/v1/images/stats/

Querying that before and after the data refresh would potentially be the easiest, quickest, and most accurate way to get media counts!

@mjtung
Copy link
Contributor

mjtung commented Apr 8, 2024

At PyCascades 2023, would love to work on this!

@mjtung
Copy link
Contributor

mjtung commented Apr 11, 2024

@AetherUnbound I've completed the work and submitted a pull request, though I don't seem to be able to add the stack:catalog label to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 💾 tech: postgres Involves PostgreSQL 🐍 tech: python Involves Python
Projects
Archived in project
Openverse
  
Backlog
Development

Successfully merging a pull request may close this issue.

3 participants