Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent templated field logic checks in __init__ of operators automatically #33786

Merged
merged 2 commits into from Jan 20, 2024

Conversation

shahar1
Copy link
Contributor

@shahar1 shahar1 commented Aug 26, 2023

Edited: Dec. 16, 2023

closes: #29069

Concept

We want to add a pre-commit that checks for the templated field "invalid logic" within the constructor of providers' operators to avoid issues like #27328. "Invalid logic" is defined according to the following rules (thanks to @potiuk for the clarification):

  1. Using template field argument for something other* than just setting it as a template field.
  2. A template field is defined for an operator but not set in the constructor as a field.

Scope

We'll check that defined template fields are directly assigned with a corresponding parameter with an identical name passed from the constructor.

Implementation

Creating a pre-commit Python script that will enforce the limitations below.

Added limitations

  1. Templated fields' corresponding parameters passed into the constructor must be named exactly
    as the fields. The following example is invalid, as the parameter passed into the constructor is not the same as the
    templated field:
        class HelloOperator(BaseOperator):
            template_fields = ("field_a")

            def __init__(field_a_id)-> None: # <- should be def __init__(field_a)-> None
              self.field_a = field_a_id # <- should be self.field_a = field_a
  1. Templated fields' instance members must be assigned with their corresponding parameter from the constructor,
    either by a direct assignment or by calling the parent's constructor (in which these fields are
    defined as template_fields) with a direct assignment of the parameter.
    The following example is invalid, as the instance member self.field_a is not assigned at all, despite being a templated field:
        class HelloOperator(BaseOperator):
            template_fields = ("field_a", "field_b")

            def __init__(field_a, field_b)-> None:
              self.field_b = field_b

The following example is also invalid, as the instance member self.field_a of MyHelloOperator is initialized
implicitly as part of the kwargs passed to its parent constructor:

        class HelloOperator(BaseOperator):
            template_fields = ("field_a")

            def __init__(field_a)-> None:
              self.field_a = field_a

        class MyHelloOperator(HelloOperator):
            template_fields = ("field_a", "field_b")

            def __init__(field_b, **kwargs)-> None: # <- should be def __init__((field_a, field_b, **kwargs)
              super().__init__(**kwargs) # <- should be super().__init__(field_a=field_a, **kwargs)
              self.field_b = field_b
  1. Applying actions on the parameter during the assignment in the constructor is not allowed. Any action on the value should be applied in the execute() method. For example, the following example is invalid:
        class HelloOperator(BaseOperator):
            template_fields = ("field_a")

            def __init__(field_a)-> None:
              self.field_a = field_a.lower() # <- assignment should be only self.field_a = field_a

When an operator inherits from a base operator and does not have a constructor defined on its own, the limitations above do not apply. However, the templated fields must be set properly in the parent according to those limitations (currently not enforced by the pre-commit, see caveats).

Thus, the following example is valid:

        class HelloOperator(BaseOperator):
            template_fields = ("field_a")

            def __init__(field_a)-> None:
              self.field_a = field_a

        class MyHelloOperator(HelloOperator):
            template_fields = ("field_a")

Caveats

  • Creating an inheritance tree in AST using modules from different files is rather complicated; therefore:
    • Detection of a "base operator" instance in the pre-commit is done by checking the parent's name ending with BaseOperator (simple string comparison).
    • We assume that when calling the parent's constructor - then the parent's templated_fields are defined properly.

Future work

  • Using dynamic imports to handle inheritance caveats better.
  • Detecting if hooks are initialized in the constructors.

TODOs

  • Implement
  • Add documentation
  • Patch existing operators with invalid template field definitions - will be done in separate PRs, after agreeing to the limitations introduced by this PR.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@shahar1 shahar1 changed the title Validate operators init Prevent templated field logic checks in __init__ of operators automatically Aug 26, 2023
@potiuk
Copy link
Member

potiuk commented Aug 26, 2023

One that includes the pre-commit Python script and its YAML definition in pre-commit-config.yaml - I excluded files that should be fixed in later PR; see next.
Another that includes modifications to specific operators, either to align with the requirements immediately or by inserting a FIXME comment to do it later (in case the complexity seems too high to address at the moment). If the reviewers deem it necessary, I can create a separate pull request for this. Regardless, I'll create a new issue for the necessary fixes.

I think it should be approached differently.. This PR should fail. It should not have exclusioins. It should only be merged when we (or others) fixed all the problems. This is what we do normally. Then this PR open should serve as reminder and incentivisation to fix all the places. So I think in general, the sequence should be reverse:

  1. add PR that fails and reminds others when they cross the border
  2. make sure (it does not have to be you - you can ping other people to do it) to fix alll the places
  3. continue rebasing this PR seeing less and less problems as you iterate
  4. merge it once all the problems are fixed.

This is the aproach we've used for other cases that were similar.

@shahar1
Copy link
Contributor Author

shahar1 commented Aug 26, 2023

One that includes the pre-commit Python script and its YAML definition in pre-commit-config.yaml - I excluded files that should be fixed in later PR; see next.
Another that includes modifications to specific operators, either to align with the requirements immediately or by inserting a FIXME comment to do it later (in case the complexity seems too high to address at the moment). If the reviewers deem it necessary, I can create a separate pull request for this. Regardless, I'll create a new issue for the necessary fixes.

I think it should be approached differently.. This PR should fail. It should not have exclusioins. It should only be merged when we (or others) fixed all the problems. This is what we do normally. Then this PR open should serve as reminder and incentivisation to fix all the places. So I think in general, the sequence should be reverse:

  1. add PR that fails and reminds others when they cross the border
  2. make sure (it does not have to be you - you can ping other people to do it) to fix alll the places
  3. continue rebasing this PR seeing less and less problems as you iterate
  4. merge it once all the problems are fixed.

This is the aproach we've used for other cases that were similar.

Sounds good to me :)
I removed the exceptions, yet I kept the second commit for inspiration - I'd be happy for feedback regarding the logic that should (not) be in the constructors.

@potiuk
Copy link
Member

potiuk commented Aug 27, 2023

I am not sure if that's the right set of validations. I do not find loops or if/return particularly troubling. What I was really hoping for was a check that will check for "typical" errors made in operators that are not often caught during review, but they have some negative side effects.

I am not sure if we want to limit operator's authors from raising warnings/running returns etc. This is far too much IMHO. We do not want to tell the authors "this is how you shoudl do your logic in constructor" - we want to make them aware when they made an obvious mistake. Generally I think we should only run validations that are very clearly violating some not even best but "reasonbly only possible" practice in Airflow.

I think the rule of thumb should be "You should be able to tell the author EXACTLY how to correct their code and explain why it is wrong". And it should be a mistake that "newbie" contributors are likely to make.

The issue #29069 was really about this one:

  • when you are using templated arg for something else than just settng it as templated field - Generally speaking the only thing you should do for templated field is you should set it as field. The reason is that this field will not have the right value before "execute" command is fired.

otherwise having some logic for non-templated fields is perfectly fine. I would not want us to be a "dictactor" to tell people how to write their constructors.

And there are few other things that are OBVIOUSLY wrong:

  • when you have templated_filed defined for operator but it is not set in the constructor as field, the field is not really templated - it will be missing when templated fields are processed. This is obviously wrong

  • when you try to instantiate Hook in the operator's constructor, that's also obviously wrong. You should never do that, because such Hook instantiation will also happen in the DAGFileProcessor when the DAG is parsed.

I'd start with these three rules, rather than trying to present some constructs that are not obviously wrong.

@shahar1 shahar1 marked this pull request as draft August 27, 2023 18:38
@shahar1
Copy link
Contributor Author

shahar1 commented Aug 27, 2023

I am not sure if that's the right set of validations. I do not find loops or if/return particularly troubling. What I was really hoping for was a check that will check for "typical" errors made in operators that are not often caught during review, but they have some negative side effects.

I am not sure if we want to limit operator's authors from raising warnings/running returns etc. This is far too much IMHO. We do not want to tell the authors "this is how you shoudl do your logic in constructor" - we want to make them aware when they made an obvious mistake. Generally I think we should only run validations that are very clearly violating some not even best but "reasonbly only possible" practice in Airflow.

I think the rule of thumb should be "You should be able to tell the author EXACTLY how to correct their code and explain why it is wrong". And it should be a mistake that "newbie" contributors are likely to make.

The issue #29069 was really about this one:

  • when you are using templated arg for something else than just settng it as templated field - Generally speaking the only thing you should do for templated field is you should set it as field. The reason is that this field will not have the right value before "execute" command is fired.

otherwise having some logic for non-templated fields is perfectly fine. I would not want us to be a "dictactor" to tell people how to write their constructors.

And there are few other things that are OBVIOUSLY wrong:

  • when you have templated_filed defined for operator but it is not set in the constructor as field, the field is not really templated - it will be missing when templated fields are processed. This is obviously wrong
  • when you try to instantiate Hook in the operator's constructor, that's also obviously wrong. You should never do that, because such Hook instantiation will also happen in the DAGFileProcessor when the DAG is parsed.

I'd start with these three rules, rather than trying to present some constructs that are not obviously wrong.

Thanks for the detailed feedback - now I have a better sense of the issue.
Drafting this PR for now and I'll try fixing it according to these rules.

Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is too big, IMHO it should be split in small PRs before adding the pre-commit hook.

airflow/providers/amazon/aws/operators/ecs.py Outdated Show resolved Hide resolved
airflow/providers/amazon/aws/operators/sagemaker.py Outdated Show resolved Hide resolved
airflow/providers/amazon/aws/operators/sagemaker.py Outdated Show resolved Hide resolved
@shahar1
Copy link
Contributor Author

shahar1 commented Sep 8, 2023

The PR is too big, IMHO it should be split in small PRs before adding the pre-commit hook.

Yes, I'll split it - the files from the providers were here temporarily only for demonstration.
As I got the original idea quite wrong, the current implementation is rather excessive. Therefore, I converted it to a draft - I'll convert it back to "ready" after I fix the pre commit and separate the relevant providers to another PR (if it would still be necessary).

@shahar1 shahar1 marked this pull request as ready for review September 9, 2023 23:28
@potiuk
Copy link
Member

potiuk commented Dec 23, 2023

@shahar1 I see that the pre-commit found some invalid usages which is awesome.
I suggest to handle them in separated PR(s) then once all merged you can rebase this one and it will be green

I need some help from the community to implement all required fixes :)

Absolutely. The best thing to do in this case is to somehow extract all those cases automatically and build an issue with

  • problem 1
  • problem 2

... and raise it to devlist. and mark it as a good first issue.

We have quite a number of the "code quality freaks" that would likely jump into it and start fixing those. Myself including :D

@shahar1
Copy link
Contributor Author

shahar1 commented Dec 29, 2023

@shahar1 I see that the pre-commit found some invalid usages which is awesome.
I suggest to handle them in separated PR(s) then once all merged you can rebase this one and it will be green

I need some help from the community to implement all required fixes :)

Absolutely. The best thing to do in this case is to somehow extract all those cases automatically and build an issue with

  • problem 1
  • problem 2

... and raise it to devlist. and mark it as a good first issue.

We have quite a number of the "code quality freaks" that would likely jump into it and start fixing those. Myself including :D

Eventually, I created this list manually in #36484 :)

@eladkal
Copy link
Contributor

eladkal commented Jan 20, 2024

I think we can also improve the process here as it takes longer to finish the list of operators to handle.

@shahar1 you can add exclusion list of operators to the pre-commit (the ones that are not ready). This should give us green build here thus we can merge this PR and enjoy the protection it brings.

In the mean time @romsharon98 can continue to work on removing items from the exclusion list (we won't need to maintain #36484 any longer as the exclusion list will be on the pre-commit itself)

@eladkal eladkal added this to the Airflow 2.8.2 milestone Jan 20, 2024
@eladkal eladkal added the type:doc-only Changelog: Doc Only label Jan 20, 2024
@shahar1
Copy link
Contributor Author

shahar1 commented Jan 20, 2024

d exclusion list of operators to the pre-commit (the ones that are not ready). This should give us green build here thus we can merge this PR and enjoy the protection it brings.

Done: https://github.com/apache/airflow/blob/de6a4e0e68f5464ae2a921f8501dce2440f52beb/.pre-commit-config.yaml

@eladkal eladkal merged commit 8961bab into apache:main Jan 20, 2024
80 checks passed
flacode pushed a commit to flacode/airflow that referenced this pull request Jan 22, 2024
…ically (apache#33786)

* Implement validation of missing template fields

* Update output static checks
potiuk added a commit to potiuk/airflow that referenced this pull request Jan 26, 2024
potiuk pushed a commit that referenced this pull request Feb 7, 2024
…ically (#33786)

* Implement validation of missing template fields

* Update output static checks

(cherry picked from commit 8961bab)
ephraimbuddy pushed a commit that referenced this pull request Feb 22, 2024
…ically (#33786)

* Implement validation of missing template fields

* Update output static checks

(cherry picked from commit 8961bab)
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
…ically (apache#33786)

* Implement validation of missing template fields

* Update output static checks
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent templated field logic checks in __init__ of operators automatically
5 participants