Create the new data refresh DAG factory and move initial steps into Airflow #4146

stacimc · 2024-04-17T23:10:01Z

Problem

This issue tracks the creation of a new data refresh DAG factory to generate the new data refresh DAGs that will not rely on the ingestion server, and moving the initial steps (described below) into the DAGs. At the end of this step the DAGs will not be functional/able to run a full refresh.

Description

We’ll create a new data refresh DAG factory to generate data refresh DAGs for each existing media_type and environment. Currently these four will be generated:

staging_audio_data_refresh
staging_image_data_refresh
production_audio_data_refresh
production_image_data_refresh

Because the environment is added as a prefix, there will be no collision with the existing DAG ids. In this initial step, we we will add only a small portion of the logic in order to make the PR easier to review. The first steps are already implemented in the current data refresh and can simply be copied:

Get the current record count from the target API table; this must be modified to take the environment as an argument
Perform concurrency checks on the other data refreshes and conflicting DAGs; this must be modified to include the now larger list of data refresh DAG ids
Get the name of the Elasticsearch index currently mapped to the target_alias
Generate the new index suffix

We will include new tasks to perform the initial few steps of the ingestion server’s work:

Copy Data: this should be a TaskGroup that will have multiple tasks for creating the FDW from the upstream DB to the downstream DB, running the copy_data query, and so on. It should fully replace the implementation of refresh_api_table in the ingestion server. All steps in this section are SQL queries that can be implemented using the existing PostgresHook and PGExecuteQueryOperator.
Create Index: we can use our existing Elasticsearch tasks to create the new elasticsearch index with the index suffix generated in the previous task.

Additional context

See this section of the IP

stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 17, 2024

stacimc added this to the Removal of the data refresh server milestone Apr 17, 2024

This was referenced Apr 17, 2024

Implement local distributed reindexing #4148

Open

Removal of the ingestion server #3925

Open

stacimc self-assigned this Apr 23, 2024

This was referenced May 3, 2024

Add new data refresh factory #4259

Merged

Rename old data refresh to legacy_data_refresh #4260

Merged

stacimc closed this as completed in #4259 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create the new data refresh DAG factory and move initial steps into Airflow #4146

Create the new data refresh DAG factory and move initial steps into Airflow #4146

stacimc commented Apr 17, 2024

Create the new data refresh DAG factory and move initial steps into Airflow #4146

Create the new data refresh DAG factory and move initial steps into Airflow #4146

Comments

stacimc commented Apr 17, 2024

Problem

Description

Additional context