-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add catalog_cleaner
DAG
#4610
Merged
Merged
Add catalog_cleaner
DAG
#4610
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
dec0b52
Add straightforward batched catalog_cleaner DAG (old)
krysal d600457
Load S3 file into temporary table
krysal b5256ff
Update batches with dynamic task mapping
krysal 9de628a
Cleanup
krysal bf5b67d
Add notifications and expand DAG description
krysal e480fce
Add variable to set max concurrent update tasks
krysal a6faadc
Add notification with count before the update
krysal fa00e5e
Add target column in initial notification
krysal File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
216 changes: 216 additions & 0 deletions
216
catalog/dags/database/catalog_cleaner/catalog_cleaner.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,216 @@ | ||
""" | ||
Catalog Data Cleaner DAG | ||
|
||
Use the TSV files created during the cleaning step of the ingestion process to bring | ||
the changes into the catalog database and make the updates permanent. | ||
|
||
The DAG has a structure similar to the batched_update DAG, but with a few key | ||
differences: | ||
1. Given the structure of the TSV, it updates a single column at a time. | ||
2. The batch updates are parallelized to speed up the process. The maximum number of | ||
active tasks is limited to 3 (at first to try it out and) to avoid overwhelming | ||
the database. | ||
3. It needs slightly different SQL queries to update the data. One change for example, | ||
is that it only works with the `image` table given that is the only one where the | ||
cleaning steps are applied to in the ingestion server. | ||
""" | ||
|
||
import logging | ||
from datetime import timedelta | ||
|
||
from airflow.decorators import dag, task | ||
from airflow.models import Variable | ||
from airflow.models.abstractoperator import AbstractOperator | ||
from airflow.models.param import Param | ||
from airflow.operators.python import get_current_context | ||
from airflow.utils.trigger_rule import TriggerRule | ||
|
||
from common import slack | ||
from common.constants import DAG_DEFAULT_ARGS, POSTGRES_CONN_ID | ||
from common.sql import ( | ||
RETURN_ROW_COUNT, | ||
PGExecuteQueryOperator, | ||
PostgresHook, | ||
single_value, | ||
) | ||
from database.batched_update.batched_update import run_sql | ||
from database.catalog_cleaner import constants | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@task | ||
def count_dirty_rows(temp_table_name: str, task: AbstractOperator = None): | ||
"""Get the number of rows in the temp table before the updates.""" | ||
count = run_sql.function( | ||
dry_run=False, | ||
sql_template=f"SELECT COUNT(*) FROM {temp_table_name}", | ||
query_id=f"{temp_table_name}_count", | ||
handler=single_value, | ||
task=task, | ||
) | ||
logger.info(f"Found {count:,} rows in the `{temp_table_name}` table.") | ||
return count | ||
|
||
|
||
@task | ||
def get_batches(total_row_count: int, batch_size: int) -> list[tuple[int, int]]: | ||
"""Return a list of tuples with the start and end row_id for each batch.""" | ||
return [(i, i + batch_size) for i in range(0, total_row_count, batch_size)] | ||
|
||
|
||
@task(map_index_template="{{ index_template }}") | ||
def update_batch( | ||
batch: tuple[int, int], | ||
temp_table_name: str, | ||
column: str, | ||
task: AbstractOperator = None, | ||
): | ||
batch_start, batch_end = batch | ||
logger.info(f"Going through row_id {batch_start:,} to {batch_end:,}.") | ||
|
||
# Includes the formatted batch range in the context to be used as the index | ||
# template for easier identification of the tasks in the UI. | ||
context = get_current_context() | ||
context["index_template"] = f"{batch_start}__{batch_end}" | ||
Comment on lines
+77
to
+80
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Woah, this is neat!! I didn't know this could be evaluated after the task had started 😮 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same, TIL!! |
||
|
||
pg = PostgresHook( | ||
postgres_conn_id=POSTGRES_CONN_ID, | ||
default_statement_timeout=PostgresHook.get_execution_timeout(task), | ||
) | ||
query = constants.UPDATE_SQL.format( | ||
column=column, | ||
temp_table_name=temp_table_name, | ||
batch_start=batch_start, | ||
batch_end=batch_end, | ||
) | ||
count = pg.run(query, handler=RETURN_ROW_COUNT) | ||
return count | ||
|
||
|
||
@task | ||
def sum_up_counts(counts: list[int]) -> int: | ||
return sum(counts) | ||
|
||
|
||
@task | ||
def notify_slack(text): | ||
slack.send_message( | ||
text=text, | ||
username=constants.SLACK_USERNAME, | ||
icon_emoji=constants.SLACK_ICON, | ||
dag_id=constants.DAG_ID, | ||
) | ||
|
||
|
||
@dag( | ||
dag_id=constants.DAG_ID, | ||
default_args={ | ||
**DAG_DEFAULT_ARGS, | ||
"retries": 0, | ||
"execution_timeout": timedelta(days=7), | ||
}, | ||
schedule=None, | ||
catchup=False, | ||
tags=["database"], | ||
doc_md=__doc__, | ||
render_template_as_native_obj=True, | ||
params={ | ||
"s3_bucket": Param( | ||
default="openverse-catalog", | ||
type="string", | ||
description="The S3 bucket where the TSV file is stored.", | ||
), | ||
"s3_path": Param( | ||
default="shared/data-refresh-cleaned-data/<file_name>.tsv", | ||
type="string", | ||
description="The S3 path to the TSV file within the bucket.", | ||
), | ||
"column": Param( | ||
type="string", | ||
enum=["url", "creator_url", "foreign_landing_url"], | ||
description="The column of the table to apply the updates.", | ||
), | ||
"batch_size": Param( | ||
default=10000, | ||
type="integer", | ||
description="The number of records to update per batch.", | ||
), | ||
}, | ||
) | ||
def catalog_cleaner(): | ||
aws_region = Variable.get("AWS_DEFAULT_REGION", default_var="us-east-1") | ||
max_concurrent_tasks = Variable.get( | ||
"CLEANER_MAX_CONCURRENT_DB_UPDATE_TASKS", default_var=3, deserialize_json=True | ||
) | ||
|
||
column = "{{ params.column }}" | ||
temp_table_name = f"temp_cleaned_image_{column}" | ||
|
||
create_table = PGExecuteQueryOperator( | ||
task_id="create_temp_table", | ||
postgres_conn_id=POSTGRES_CONN_ID, | ||
sql=constants.CREATE_TEMP_TABLE_SQL.format( | ||
temp_table_name=temp_table_name, column=column | ||
), | ||
) | ||
|
||
load = PGExecuteQueryOperator( | ||
task_id="load_temp_table_from_s3", | ||
postgres_conn_id=POSTGRES_CONN_ID, | ||
sql=constants.IMPORT_SQL.format( | ||
temp_table_name=temp_table_name, | ||
column=column, | ||
bucket="{{ params.s3_bucket }}", | ||
s3_path_to_file="{{ params.s3_path }}", | ||
aws_region=aws_region, | ||
), | ||
) | ||
|
||
create_index = PGExecuteQueryOperator( | ||
task_id="create_temp_table_index", | ||
postgres_conn_id=POSTGRES_CONN_ID, | ||
sql=constants.CREATE_INDEX_SQL.format(temp_table_name=temp_table_name), | ||
) | ||
|
||
count = count_dirty_rows(temp_table_name) | ||
|
||
batches = get_batches(total_row_count=count, batch_size="{{ params.batch_size }}") | ||
|
||
updates = ( | ||
update_batch.override( | ||
max_active_tis_per_dag=max_concurrent_tasks, | ||
retries=0, | ||
) | ||
.partial(temp_table_name=temp_table_name, column=column) | ||
.expand(batch=batches) | ||
) | ||
|
||
total = sum_up_counts(updates) | ||
|
||
drop = PGExecuteQueryOperator( | ||
task_id="drop_temp_tables", | ||
postgres_conn_id=POSTGRES_CONN_ID, | ||
sql=constants.DROP_SQL.format(temp_table_name=temp_table_name), | ||
execution_timeout=timedelta(minutes=1), | ||
) | ||
|
||
notify_success = notify_slack.override(task_id="notify_success")( | ||
f"Upstream cleaning was completed successfully updating column `{column}` for" | ||
f" {total} rows.", | ||
) | ||
|
||
notify_failure = notify_slack.override( | ||
task_id="notify_failure", trigger_rule=TriggerRule.ONE_FAILED | ||
)("Upstream cleaning failed. Check the logs for more information.") | ||
|
||
create_table >> load >> create_index >> count | ||
|
||
# Make explicit the dependency from total (sum_up_counts task) to show it in the graph | ||
updates >> [drop, total] >> notify_success | ||
|
||
drop >> notify_failure | ||
|
||
|
||
catalog_cleaner() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
DAG_ID = "catalog_cleaner" | ||
SLACK_USERNAME = "Catalog Cleaner DAG" | ||
SLACK_ICON = ":disk-cleanup:" | ||
|
||
CREATE_TEMP_TABLE_SQL = """ | ||
DROP TABLE IF EXISTS {temp_table_name}; | ||
CREATE UNLOGGED TABLE {temp_table_name} ( | ||
row_id SERIAL, | ||
identifier uuid NOT NULL, | ||
{column} TEXT NOT NULL | ||
); | ||
""" | ||
|
||
# See https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.S3Import.html#aws_s3.table_import_from_s3 | ||
IMPORT_SQL = """ | ||
SELECT aws_s3.table_import_from_s3( | ||
'{temp_table_name}', 'identifier, {column}', 'DELIMITER E''\t'' CSV', | ||
'{bucket}', '{s3_path_to_file}', '{aws_region}' | ||
); | ||
""" | ||
|
||
CREATE_INDEX_SQL = "CREATE INDEX ON {temp_table_name}(row_id);" | ||
|
||
UPDATE_SQL = """ | ||
UPDATE image SET {column} = tmp.{column}, updated_on = NOW() | ||
FROM {temp_table_name} AS tmp | ||
WHERE image.identifier = tmp.identifier AND image.identifier IN ( | ||
SELECT identifier FROM {temp_table_name} | ||
WHERE row_id > {batch_start} AND row_id <= {batch_end} | ||
FOR UPDATE SKIP LOCKED | ||
); | ||
""" | ||
|
||
DROP_SQL = "DROP TABLE {temp_table_name};" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: since
run_sql
is already a@task
and this merely adds alogger.info
line (with information which could be pieced together from the XComs and arguments), maybe it makes more sense to userun_sql
directly with a.override(task_id="count_dirty_rows")
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation! This reminds me that I left that to be developed later. I can do that change and add the slack notification before the update.