Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken relationship between audio and audio_set at ingestion in the production API DB #1534

Closed
obulat opened this issue Jun 21, 2022 · 8 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🟧 priority: high Stalls work on the project or its dependents 🟨 priority: medium Not blocking but should be addressed soon
Projects

Comments

@obulat
Copy link
Contributor

obulat commented Jun 21, 2022

Problem

The API was expecting audio_set_foreign_id at the top level, but the catalog did not expose it. It was a part of the audio_set JSONB column.

Description

We should add the audio_set_foreign_id to the materialized view.

@krysal krysal added the 🧹 status: ticket work required Needs more details before it can be worked on label Jun 21, 2022
@obulat obulat self-assigned this Jun 23, 2022
@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents 🟨 priority: medium Not blocking but should be addressed soon 💻 aspect: code Concerns the software code in the repository labels Jun 23, 2022
@obulat obulat removed the 🧹 status: ticket work required Needs more details before it can be worked on label Jun 23, 2022
@krysal krysal changed the title Presumably, we have no data of any audio set (?) Broken relationship between audio and audio_set at ingestion in the production API DB Jun 24, 2022
@krysal
Copy link
Member

krysal commented Jun 24, 2022

@obulat Thanks for investigating and finding the root cause of the issue! I changed the linking of issues before noticing you had updated this. Do you believe that the data refresh process will automatically update the existing data or do we have to do it manually? We can close this once confirming the existing audio files are correctly linked.

@obulat
Copy link
Contributor Author

obulat commented Jun 28, 2022

I'm not sure if the data refresh DAG updates the materialized view. I'm sure @stacimc knows, though :) If it does, then a data refresh will fix the issue automatically. Otherwise, we will need to refresh the matview separately, before doing the data refresh.

@stacimc
Copy link
Contributor

stacimc commented Jun 28, 2022

On a regular data refresh, the matview gets refreshed. This is the SQL that runs: https://github.com/WordPress/openverse-catalog/blob/main/openverse_catalog/dags/common/popularity/sql.py#L374

I'm not sure I follow what needs to happen here -- do you just need the view refreshed, or does a column need to be added? The audio matview is created/defined here and the audioset matview here. If changes need to be made to the views structure, we would make those changes there and then run the recreate_audio_popularity_calculation DAG to completely recreate everything.

If a view refresh is all that's needed, then yes this would be resolved with a data refresh. The data refreshes are not currently running, however (pending API latency investigation).

@obulat
Copy link
Contributor Author

obulat commented Jun 29, 2022

Thank you for clarification, @stacimc!
We will need to run recreate_audio_popularity_calculation, not simply refresh the matview. This is because we will need to recreate the columns to add the new audio_set_foreign_id column. I ran the recreate_audio_popularity_calculation DAG locally to test that this works.

Do we normally run this manually?

@zackkrida
Copy link
Member

Do we normally run this manually?

@obulat yes that's not scheduled, only triggered manually in airflow.

@zackkrida
Copy link
Member

Is this resolved from a code perspective? Or do we want to wait to close this until we've run recreate_audio_popularity_calculation in production?

@krysal
Copy link
Member

krysal commented Jul 6, 2022

Yes, it remains open until confirmed it's fixed once we run the data refresh again.

@AetherUnbound
Copy link
Contributor

We've added the column to the upstream catalog view, run an audio data refresh, and can now see audio_set_foreign_identifier values in the production API DB!

deploy@localhost:openledger> select count(*) from audio;
+--------+
| count  |
|--------|
| 817198 |
+--------+
SELECT 1
Time: 0.110s
deploy@localhost:openledger> select count(*) from audio where audio_set_foreign_identifier IS NOT NULL;
+--------+
| count  |
|--------|
| 705981 |
+--------+
SELECT 1
Time: 0.242s

With that I believe this issue can be closed 😄

@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🟧 priority: high Stalls work on the project or its dependents 🟨 priority: medium Not blocking but should be addressed soon
Projects
Archived in project
Openverse
  
Done!
Development

No branches or pull requests

5 participants