Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #9387: Add State types for tasks and DAGs #15285

Merged
merged 1 commit into from Jul 6, 2021

Conversation

andrewgodwin
Copy link
Contributor

@andrewgodwin andrewgodwin commented Apr 8, 2021

This adds TaskState and DagState enum types that contain all possible states, makes all other core state constants derive their values from them, and adds a couple of initial type hints that use the new enums (with the plan being that we can add signficantly more later).

closes: #9387

@uranusjr
Copy link
Member

uranusjr commented Apr 8, 2021

Switching TaskState and DagState to actually be an enum, and have to go rewrite a large chunk of core code and potentially break compatability with many providers since it'll change the values of all the states to be of type EnumValue.

Would it help to initially make them a StrEnum, and slowly convert the type hints to take an enum value instead of str?

@dstandish
Copy link
Contributor

dstandish commented Apr 14, 2021

it's possible to have a stringy enum class, did you explore that?

see here https://stackoverflow.com/a/63028809/2761682

one concern i have with this PR is the creation of a TaskState class.

there has been periodic discussion on mailing list re adding support for stateful tasks and of late some renewed interest in the topic. adding DagState and TaskState classes would potentially be in conflict with that. if we do create stateful tasks, i think that TaskState as the model through which we store state (i.e. an artifact representin the task's state) would be a better usage of the word than an enum representing the execution state.

i should add, it would probably be more accurate to describe these as TaskInstanceState and DagRunState -- the task and dag themselves don't actually have states of any kind at this time. but the same concern applies.

if you want to make this an enum, why can't you just keep the single State class and make it enum, or enum-like?

@uranusjr
Copy link
Member

if you want to make this an enum, why can't you just keep the single State class and make it enum, or enum-like?

I think that’d take away most of the reason of having an enum class. The main advantage of enums is to prevent invalid values from being passed (they’d result in a typing or runtime error). Combining all the possible state values into one enum means not all state values make sense for all contexts (some states are impossible on task instances, for example), and kind of defeats the purpose.

@ashb
Copy link
Member

ashb commented Apr 14, 2021

if you want to make this an enum, why can't you just keep the single State class and make it enum, or enum-like?

I think that’d take away most of the reason of having an enum class. The main advantage of enums is to prevent invalid values from being passed (they’d result in a typing or runtime error). Combining all the possible state values into one enum means not all state values make sense for all contexts (some states are impossible on task instances, for example), and kind of defeats the purpose.

Exactly this -- there are some states that only apply to TaskInstances, but that don't apply to DagRuns or (Base)Job classes -- "upstream_failed" for instance (there are more).

@ashb
Copy link
Member

ashb commented Apr 14, 2021

there has been periodic discussion on mailing list re adding support for stateful tasks and of late some renewed interest in the topic. adding DagState and TaskState classes would potentially be in conflict with that. if we do create stateful tasks, i think that TaskState as the model through which we store state (i.e. an artifact representin the task's state) would be a better usage of the word than an enum representing the execution state.

I think the "stateful" tasks doesn't conflict with this -- Airflow essentially has a state-machine (if a very poorly defined one) for what states a TI/DR can be in, so adding a whole new state is more work than just changing the value in the DB column, so by making this an explicit enum class it would hopefully make it clearer that you can't make up your own values to store in ti.state and expect it to Just Work.

@dstandish
Copy link
Contributor

dstandish commented Apr 14, 2021

I think the "stateful" tasks doesn't conflict with this -- Airflow essentially has a state-machine (if a very poorly defined one) for what states a TI/DR can be in, so adding a whole new state is more work than just changing the value in the DB column, so by making this an explicit enum class it would hopefully make it clearer that you can't make up your own values to store in ti.state and expect it to Just Work.

I don't mean that the two concepts conflict just that I would like to reserve the name TaskState and DagState, potentially, for use with "stateful" tasks.

just trying to think if there's a name we could use for the execution states enum which would not clobber that.

but, maybe owing to the fact that there already is an "execution state" concept, if "stateful tasks" is ever implememted it might be prudent to use the name "watermark" or "artifact" instead.

But I think potentially TaskState could be a good name for a stateful tasks framework, and if we called this enum something like TaskRunState or TaskExecutionState -- or perhaps more accurately TaskInstanceExecutionState there would be no conflict.

Another idea would be to nest the different execution state enums under a States class like States.dag_run.RUNNING etc

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably would have make State a str-enum class like we did here

class DagRunType(str, enum.Enum):
"""Class with DagRun types"""
BACKFILL_JOB = "backfill"
SCHEDULED = "scheduled"
MANUAL = "manual"
def __str__(self) -> str: # pylint: disable=invalid-str-returned
return self.value
@staticmethod
def from_run_id(run_id: str) -> "DagRunType":
"""Resolved DagRun type from run_id."""
for run_type in DagRunType:
if run_id and run_id.startswith(f"{run_type.value}__"):
return run_type
return DagRunType.MANUAL

Is there any downside to doing that?

@andrewgodwin
Copy link
Contributor Author

Not at all - I was not familiar with the ability to add arbitrary type bases to Enum! I'll try that, it should be significantly better.

@andrewgodwin
Copy link
Contributor Author

I've reworked this to use a str-type Enum, but the enum values are still odd - calling str() on them outputs the name of the value as TaskState.FAILED, not just failed. I don't think the code is going to like it.

I suspect it's going to massively blow up the test suite, but I want to push it up and run here just to be sure.

@dstandish
Copy link
Contributor

No interest in calling it TaskExecutionState instead of TaskState? 🙂

@andrewgodwin
Copy link
Contributor Author

No interest in calling it TaskExecutionState instead of TaskState? 🙂

I'd rather we call it TaskInstanceState honestly, but right now I'm just trying to see if this will even work type-wise :)

@dstandish
Copy link
Contributor

dstandish commented Apr 27, 2021

I'm just trying to see if this will even work type-wise

Totally makes sense :)

I'd rather we call it TaskInstanceState honestly

Let me offer an argument for not this.

It's possible to use xcom as a store of task instance state, because it is at the grain of task instance (task + execution date).

So we can imagine that XCom could have reasonably been called TaskInstanceState.

What this enum represents is something very different. It's the TaskInstance's execution state -- not a way of storing state data for a task instance.

That's why I like the idea of adding the word execution or run.

Separately, I think we could justify going without the word instance because this enum represents possible values for task execution states in the abstract, and it is maybe irrelevant that the when a task gets combined with an execution date it is called a task instance.

@andrewgodwin
Copy link
Contributor Author

I get your point, but we also can't escape that it is the possible values of the state column on TaskInstance. I agree the naming that's there already isn't amazing, but there's also something to be said for the principle of least surprise :)

airflow/utils/state.py Outdated Show resolved Hide resolved
@ashb
Copy link
Member

ashb commented Jun 10, 2021

@andrewgodwin ping -- when you have a few spare cycles would be good to revisit this PR

@andrewgodwin andrewgodwin force-pushed the enum-states branch 2 times, most recently from 77abeba to 101b424 Compare June 11, 2021 19:48
@ashb ashb self-requested a review June 16, 2021 08:31
@ashb
Copy link
Member

ashb commented Jun 16, 2021

Code looks good now -- @dstandish Are you happy with these names (TaskState and DagState)?

@dstandish
Copy link
Contributor

dstandish commented Jun 16, 2021

Honestly do not like them for the same reasons, namely that I think there wants to be a mechanism of state persistence in airflow and these names could collide with that

But I concede the alternative names for the enums are maybe not the most appealing e.g. TaskExecutionState or DagExecutionState (though at the moment they seem prefereable to me).

And if anything materializes re state persistence mechanism, I'm sure we'll be able to figure something out. Perhaps that would mean calling it Watermark or Artifact. Or ProcessState. Or perhaps there would not need to be a class like that.

Anyway carry on, if you guys think it's the right idea, it's probably not that bad of an idea :)

@andrewgodwin
Copy link
Contributor Author

Yeah, I personally think the boat has sailed on reclaiming the word "State" in the codebase right now - these enums reflect the possible values of the state columns on their models, so I think the name is "good enough" given that.

@dstandish
Copy link
Contributor

these enums reflect the possible values of the state columns on their models

By that logic shouldn't these enums be TaskInstanceState and DagRunState? There actually is no Task model and I think DAG doesn't have a state.

@andrewgodwin
Copy link
Contributor Author

That is an excellent point. Happy to change them if we all agree.

Copy link
Contributor

@dstandish dstandish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed few small little docstring things while looking it over

i'm deferring to you and others on the class naming question

airflow/utils/state.py Outdated Show resolved Hide resolved
airflow/utils/state.py Outdated Show resolved Hide resolved
@andrewgodwin
Copy link
Contributor Author

I decided to rename the classes under the banner of being more explicit.

@andrewgodwin andrewgodwin force-pushed the enum-states branch 2 times, most recently from dbf3508 to 0776ded Compare June 29, 2021 22:20
Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -96,17 +127,17 @@ def color_fg(cls, state):
return 'white'
return 'black'

running = frozenset([RUNNING, SENSING])
running: FrozenSet[TaskInstanceState] = frozenset([TaskInstanceState.RUNNING, TaskInstanceState.SENSING])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, surprised this type hint is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be, I just like adding explicit type hints where I think it helps the reader of the code understand what's going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I also often do that even if not technically needed to make it explicit. I think code should be much more optimised for reader's time than writer's time to write it ;).

@github-actions
Copy link

github-actions bot commented Jul 6, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jul 6, 2021
@kaxil kaxil merged commit 2b7c596 into apache:main Jul 6, 2021
@ashb
Copy link
Member

ashb commented Jul 7, 2021

D'oh, we had two changes in flight -- this one and @ephraimbuddy's #16401, so we'll need to add Queued state to DagRuns too

@ashb ashb mentioned this pull request Jul 7, 2021
jhtimmins pushed a commit that referenced this pull request Aug 12, 2021
This adds TaskState and DagState enum types that contain all possible states, makes all other core state constants derive their values from them, and adds a couple of initial type hints that use the new enums (with the plan being that we can add signficantly more later).

closes: #9387
(cherry picked from commit 2b7c596)
@jhtimmins jhtimmins added this to the Airflow 2.1.3 milestone Aug 12, 2021
jhtimmins pushed a commit that referenced this pull request Aug 12, 2021
This adds TaskState and DagState enum types that contain all possible states, makes all other core state constants derive their values from them, and adds a couple of initial type hints that use the new enums (with the plan being that we can add signficantly more later).

closes: #9387
(cherry picked from commit 2b7c596)
jhtimmins pushed a commit that referenced this pull request Aug 13, 2021
This adds TaskState and DagState enum types that contain all possible states, makes all other core state constants derive their values from them, and adds a couple of initial type hints that use the new enums (with the plan being that we can add signficantly more later).

closes: #9387
(cherry picked from commit 2b7c596)
kaxil pushed a commit that referenced this pull request Aug 17, 2021
This adds TaskState and DagState enum types that contain all possible states, makes all other core state constants derive their values from them, and adds a couple of initial type hints that use the new enums (with the plan being that we can add signficantly more later).

closes: #9387
(cherry picked from commit 2b7c596)
jhtimmins pushed a commit that referenced this pull request Aug 17, 2021
This adds TaskState and DagState enum types that contain all possible states, makes all other core state constants derive their values from them, and adds a couple of initial type hints that use the new enums (with the plan being that we can add signficantly more later).

closes: #9387
(cherry picked from commit 2b7c596)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use enum for task_states/dag_states
7 participants