Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify add_license_url DAG to use batched_update #4370

Merged
merged 6 commits into from
May 28, 2024
Merged

Conversation

krysal
Copy link
Member

@krysal krysal commented May 21, 2024

Fixes

Fixes #4348 by @krysal

Description

As expressed in the title, this PR changes the add_license_url DAG to trigger one batched_update DAG run per group of licenses found to backfill. It also installs python-tabulate to improve the format of Slack messages and logs, so numbers are more easily readable, and the license group is clearly identified with its mapped tasks index number.

Testing Instructions

  1. just built
  2. just catalog/pgcli
  3. Remove the field from some rows. Select any identifier you want, e.g.:
UPDATE image SET meta_data = '{}' WHERE identifier IN (
	'cdbd3bf6-1745-45bb-b399-61ee149cd58a',
	'b840de61-fb9d-4ec5-9572-8d778875869f',
	'0e3315c5-3328-4a99-80ab-567ac32f685f',
	'3c98150c-51a8-4175-a47f-acef10e784f7',
 	'aeba0547-61da-42ee-b561-27c8fc817d5a'
);
  1. Unpause the batched_update in the Airflow UI
  2. Trigger the add_license_url DAG and wait for it to finish
  3. Verify previous rows now have the meta_data->license_url field
SELECT identifier, license, license_version, meta_data, created_on, updated_on FROM image
WHERE identifier IN (
	'cdbd3bf6-1745-45bb-b399-61ee149cd58a',
	'b840de61-fb9d-4ec5-9572-8d778875869f',
	'0e3315c5-3328-4a99-80ab-567ac32f685f',
	'3c98150c-51a8-4175-a47f-acef10e784f7',
 	'aeba0547-61da-42ee-b561-27c8fc817d5a'
);

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@krysal krysal requested a review from a team as a code owner May 21, 2024 21:29
@krysal krysal requested review from obulat and stacimc May 21, 2024 21:29
@github-actions github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label May 21, 2024
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels May 21, 2024
@krysal krysal marked this pull request as draft May 21, 2024 21:31
@krysal krysal marked this pull request as ready for review May 23, 2024 21:33
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Only a few comments, nothing blocking. Thanks for the helpful testing instructions, so glad we can leverage the batched update for this 🚀

catalog/dags/maintenance/add_license_url.py Outdated Show resolved Hide resolved
),
# Merge existing metadata with the new license_url
"update_query": f"SET meta_data = ({Json(license_url_dict)}::jsonb || meta_data), updated_on = now()",
"update_timeout": 259200, # 3 days in seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could maybe be better expressed as:

Suggested change
"update_timeout": 259200, # 3 days in seconds
"update_timeout": 60 * 60 * 24 * 3, # 3 days in seconds

Comment on lines 191 to 198
trigger = TriggerDagRunOperator.partial(
task_id="trigger_batched_update",
trigger_dag_id=BATCHED_UPDATE_DAG_ID,
wait_for_completion=True,
execution_timeout=timedelta(hours=5),
max_active_tis_per_dag=1,
retries=0,
).expand(conf=get_confs(licenses, batch_size="{{ params.batch_size }}"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could do map_index_template here? Similar to this:

@task(map_index_template="{{ task.op_kwargs['upstream_table_name'] }}")

That way we could see the licenses as part of the mapped index/task name!

Copy link
Member Author

@krysal krysal May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it in the previous DAG version as suggested in the Airflow documentation, but an upstream issue prevented it from working with the parameters of the task: apache/airflow#29366.

Do you see a workaround here?

Edit: I must have been doing something wrong before because now I made it work 😄

krysal and others added 2 commits May 27, 2024 14:25
Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed testing instructions! They worked well locally.
I added a non-blocking suggestion for code clarity inline.

@task
def get_confs(licenses, batch_size: int) -> list[dict]:
if not licenses:
raise AirflowSkipException("No config required.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds confusing. What does "No config required." mean here? Should it be the opposite, "License config required."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no licenses to backfill, then the DAG stops here. There is no need to create a set of configurations for the batched_update DAG. I rephrased it; I hope it's clearer now!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely clearer :) Thank you!

)
report_completion(updated, query)
updated >> report_failed_license_pairs()
licenses = get_licenses(query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would be easier to understand the code flow if the query is moved inside get_licenses function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the previous version of the DAG reused this query for two tasks. Now this can be in there as you suggest 👍

@krysal krysal merged commit 3329d30 into main May 28, 2024
43 checks passed
@krysal krysal deleted the fix/add_license_dag branch May 28, 2024 14:54
@krysal
Copy link
Member Author

krysal commented May 28, 2024

Thanks for the reviews and suggestions folks!

@krysal krysal mentioned this pull request Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

The add_license_url DAG keeps timing out
4 participants