Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG regex flag in backfill command #23870

Merged
merged 7 commits into from
Aug 8, 2022

Conversation

domagojrazum
Copy link
Contributor

When this parameter is set, the dag_id string will be used as a regular expression to match the available DAGs.
This enables multiple DAG backfills to be consecutively executed under the same backfill session.
This is often used in the DAG factory based approach where more than one DAG have same task names, which are based on some ID, therefore it's commonly combined with task-regex.

@boring-cyborg boring-cyborg bot added area:CLI area:Scheduler including HA (high availability) scheduler labels May 23, 2022
@potiuk
Copy link
Member

potiuk commented May 31, 2022

There are few more tests will fail if you change "dag" into "dags".

I think this is a bit of compatibility issue to replace it this way - if anyone used that command externally this is a bit unexpected change. I think better approach will be to add separate "dags" field rather than replace "dag" and fail the command if both parameters are specified.

Also it would be great to set some deterministic order of processing the dags. Right now it's a bit random (depends on sequence of hashmap of values in dagbag which depends on sequence of dags returned by SQL query which is not deterministic). Likely sortting the dag_id would be a bit more "stable".

@domagojrazum domagojrazum reopened this Jun 1, 2022
@domagojrazum
Copy link
Contributor Author

There are few more tests will fail if you change "dag" into "dags".

I think this is a bit of compatibility issue to replace it this way - if anyone used that command externally this is a bit unexpected change. I think better approach will be to add separate "dags" field rather than replace "dag" and fail the command if both parameters are specified.

Also it would be great to set some deterministic order of processing the dags. Right now it's a bit random (depends on sequence of hashmap of values in dagbag which depends on sequence of dags returned by SQL query which is not deterministic). Likely sortting the dag_id would be a bit more "stable".

Thank you for the reply.
Required modifications are applied.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@potiuk
Copy link
Member

potiuk commented Jun 6, 2022

But another approval woudl be great as this is part of the core

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 6, 2022
@github-actions
Copy link

github-actions bot commented Jun 6, 2022

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk potiuk force-pushed the dag-regex-in-backfills branch 2 times, most recently from 9c5e582 to 4499490 Compare June 13, 2022 22:02
@potiuk
Copy link
Member

potiuk commented Jun 19, 2022

Rebased to check if Helm tests are fixed.

@potiuk
Copy link
Member

potiuk commented Jun 20, 2022

Docs failing

@domagojrazum
Copy link
Contributor Author

Docs failing

Hello @potiuk, after running build-docs locally I'm getting:
image

Incorrect spelling in hooks/oracle.py

@potiuk
Copy link
Member

potiuk commented Jun 28, 2022

yes you need to fix the incorrect spelling

@potiuk
Copy link
Member

potiuk commented Jun 28, 2022

or (as usual) REBASE when you see unrelated problem.

image

@domagojrazum
Copy link
Contributor Author

@potiuk, thank you. All checks have passed now.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check if we have more than one dag returned and ask for confirmation (optionally with a flag that should assume you are ok with it).

I also think --dag-regex flag is wrong name. It is supprising that this is a boolean flag, rather than string argument (as opposed to --task-regex ). This is very inconsistent and confusing.

I think it should be maybe --treat-dag-as-regex - yeah, longer name, but far less confusing - maybe you will come up with better name that will indicate better its boolean flag nature.

Or maybe it should be --dag-regex REGEX in which case dag id should not be specified (and become optional).

Both solutions are better than the current one.

@uranusjr
Copy link
Member

uranusjr commented Jun 30, 2022

I wonder if it’s even viable to automatically detect whether the passed value is regex. DAG IDs can only contain a very limited set of characters, and the only character that has special meanings is ., which is not that useful on its own either. So we could try to detect whether the user input looks like a regex, and eliminate the need of --treat-dag-as-regex (or whatever) most of the time. Of course, we can still keep the flag for explicit override.

@domagojrazum
Copy link
Contributor Author

@potiuk @uranusjr Thank you for the feedback.
@uranusjr regarding your comment, regex is possible to be set as prefix of a DAG (without any special characters), and it would be impossible to detect if user meant it to set as regex, or as an exact dag_id. If I understood it correctly.
@potiuk I added the suggestion with --treat-dag-as-regex flag. It seems to me that it's more intuitive for users to set a flag for already needed dag_id, instead of excluding and writing it (dag_id) as an argument for dag-regex flag, or writing dag_id twice.
And also, yes/no question is added for detected dag list by regex search.

@domagojrazum domagojrazum requested a review from potiuk July 1, 2022 11:56
@potiuk
Copy link
Member

potiuk commented Jul 4, 2022

LGTM. WDYT @uranusjr ? I think "Explicit is better than implicit" so even if we could detect it automatically, I think we should not.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are just doing this sequentially (rather than all dags in a single BackfillJob) I'm not even all that sure we need this feature. Isn't this very very similar to a loop in bash:

for dag_id in dag1 dag2 dag3 do;
    airflow dags backfil "$dag_id"
done

(Slight difference of explicit list of IDs vs a regex, yes)

(This point is a discussion to talk about, the comment below is a must fix)

airflow/cli/commands/dag_command.py Outdated Show resolved Hide resolved
@potiuk potiuk requested a review from ashb July 7, 2022 18:50
@domagojrazum domagojrazum requested review from Bowrna and potiuk and removed request for Bowrna August 4, 2022 07:26
@potiuk
Copy link
Member

potiuk commented Aug 4, 2022

@ashb?

@potiuk
Copy link
Member

potiuk commented Aug 4, 2022

I think you need to rebase and solve conflicts @domagojrazum - there were some main fixes that need to be taken into account.

@domagojrazum
Copy link
Contributor Author

@potiuk, rebased and pushed.

@potiuk
Copy link
Member

potiuk commented Aug 8, 2022

The failing test is fixed in main already (was broken by me :( )

@potiuk potiuk merged commit 72a6ac5 into apache:main Aug 8, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 8, 2022

Awesome work, congrats on your first merged pull request!

@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Aug 15, 2022
@ephraimbuddy ephraimbuddy added this to the Airflow 2.4.0 milestone Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants