Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a check for not templateable fields #29821

Merged
merged 7 commits into from Mar 4, 2023

Conversation

hussein-awala
Copy link
Member

@hussein-awala hussein-awala commented Feb 28, 2023

closes: #29819


There are some fields processed by the scheduler and they are not templateable, but currently we don't have any check for these fields. In this PR extract the parameters list for the method __init__ of BaseOperator and I check if there is a field in the template_fields in this list to raise an exception, where all the BaseOperator parameters are processed and used by the scheduler during processing time.

>>> import inspect
>>> from airflow.models import BaseOperator
>>> set(inspect.signature(BaseOperator.__init__).parameters.keys())
{'start_date', 'doc', 'executor_config', 'pre_execute', 'email', 'retry_exponential_backoff', 'depends_on_past', 'inlets', 'wait_for_downstream', 'email_on_failure', 'outlets', 'doc_json', 'execution_timeout', 'task_id', 'sla', 'on_success_callback', 'trigger_rule', 'ignore_first_depends_on_past', 'doc_md', 'kwargs', 'pool', 'doc_rst', 'on_failure_callback', 'email_on_retry', 'weight_rule', 'self', 'priority_weight', 'pool_slots', 'run_as_user', 'task_group', 'doc_yaml', 'on_retry_callback', 'do_xcom_push', 'max_retry_delay', 'retries', 'owner', 'params', 'default_args', 'queue', 'resources', 'task_concurrency', 'max_active_tis_per_dag', 'wait_for_past_depends_before_skipping', 'retry_delay', 'on_execute_callback', 'dag', 'end_date', 'post_execute'}

@pierrejeambrun pierrejeambrun added the type:misc/internal Changelog: Misc changes that should appear in change log label Mar 1, 2023
@potiuk potiuk merged commit 7963360 into apache:main Mar 4, 2023
@potiuk
Copy link
Member

potiuk commented Mar 4, 2023

Nice one!

pierrejeambrun pushed a commit that referenced this pull request Mar 7, 2023
pierrejeambrun pushed a commit that referenced this pull request Mar 8, 2023
@Inetov
Copy link

Inetov commented Jun 29, 2023

@hussein-awala what if my custom operator needs to use templating for params field?
how to bypass the restriction introduced in this commit?

@hussein-awala
Copy link
Member Author

@hussein-awala what if my custom operator needs to use templating for params field?
how to bypass the restriction introduced in this commit?

@Inetov This is not possible for now, I will check if templating params (and other parameters) is 100% safe, and if it's the case, maybe we can remove them from the excluded fields list.

@eladkal
Copy link
Contributor

eladkal commented Jun 29, 2023

@Inetov @hussein-awala see feature request for this #31801

@gdavoian
Copy link
Contributor

gdavoian commented Nov 6, 2023

@hussein-awala Unfortunately, this change is going to break our Airflow environment.

Is there any reason why BaseOperator.email can't be templated? E.g., we store different email lists as Airflow variables, and we don't want to use Variable.get(...) on the module-level of our scripts because it's actually an anti-pattern (opening a DB connection, making a request, etc.). So that's why we had to implement a function for adding email to template_fields for all subclasses of BaseOperator dynamically at run-time, which allowed us to use Jinja templates in default_args["email"] of our DAGs.

Actually, it's working just fine! But you're going to prohibit that in future versions, and I'd like to know the rationale behind that decision. Thank you!

@potiuk
Copy link
Member

potiuk commented Nov 6, 2023

I think that's a good idea to make a PR to allow to exclude some of the fields from that check. I think PR for that would be good @gdavoian

@gdavoian
Copy link
Contributor

gdavoian commented Nov 8, 2023

@potiuk do you have any ideas on how it would fit into BaseOperator? Because I can only think about something like templateable_fields (the exact naming isn't important at the moment) to exclude them from the check. But simultaneously having two entities like template_fields and templateable_fields would be very misleading: the former is a list of fields that can be templated and the latter is a list of fields that are allowed to be templated, meaning that some fields (e.g., BaseOperator.email) have to be added to both lists at the same time... Again, the exact naming doesn't matter here as soon as the whole idea is weird.

And that actually leads to my initial question: why did we ever need that backward-incompatible check in the first place? BaseOperator doesn't define any template_fields by default, leaving that up to developers of third-party/custom operators to do that, which is completely fine. Thus, it's the responsibility of those developers to ensure their operators are valid and serializable, isn't it? If their operators are working properly and they are happy with that, why break their code?

@hussein-awala
Copy link
Member Author

@gdavoian I agree with Jarek.

This PR was created to avoid the rendering issues with BaseOperator attributes like execution_timeout.

IMHO, you can create a test that loops over the excluded attributes, try to use a templated value, and check the value in runtime after rendering it. To know what attributes could be excluded, you can start by excluding all the params and keeping only the params that pass the test.

@gdavoian
Copy link
Contributor

gdavoian commented Nov 8, 2023

This PR was created to avoid the rendering issues with BaseOperator attributes like execution_timeout.

But how does this PR help solve the rendering issue? I understand that it tries to warn the users against their potentially mistaken intentions, though in fact it simply replaces one type of exception raised with another, reduces flexibility, and introduces a backward-incompatible change as a side effect. So the person who tried to add execution_timeout to template_fields still won't be able to do that, whereas those who are using Jinja templates for rendering some of the BaseOperator fields (e.g., email, pool, doc_md) in their custom operators simply won't be able to upgrade their Airflow environments since their working code will suddenly turn out to be broken. Thus, honestly, I don't see any benefits from this change at all...

IMHO, you can create a test that loops over the excluded attributes, try to use a templated value, and check the value in runtime after rendering it. To know what attributes could be excluded, you can start by excluding all the params and keeping only the params that pass the test.

This sounds like a workaround for a workaround :) For some reason, you've prohibited the rendering of all the BaseOperator fields (I'm still not convinced of the need for this change), and now you're asking me to exclude some of the fields from that rule (the rule I don't even agree and the rule you can't really justify)...

@potiuk
Copy link
Member

potiuk commented Nov 8, 2023

@gdavoidan - what I proposed and I believe @hussein-awala agrees with is that you define which fields of BaseOperator you exclude from the check. This is what my proposal is about. Find out which of those fields CAN be templated and remove them from the check. Basically allow-list of fields you want to allow templating. Rather than allow all by default. There is no need to "specify" them as extra field - just hard-code them.

What @hussein-awala proposes is that you attempt to test it and figure out which fields can be templated by templating then one by one and trying to serialize the DAG. because not all of them CAN be serialized and you need to serialize DAG to DB to execute it. And not all of them will take effect when templated.

See description of the original error #29819 which explains the reason for doing it. Try to do what the author of the orignal issue and see it for yourself when you remove the protection. You don't need to believe anyone's word just try it and read the issue descrption.

This sounds like a workaround for a workaround :) For some reason, you've prohibited the rendering of all the BaseOperator fields (I'm still not convinced of the need for this change), and now you're asking me to exclude some of the fields from that rule (the rule I don't even agree and the rule you can't really justify)...

No. You simply didn't dig deep enough. Look at the issue and try to understand what happened there. This is solution to a real problem that overcautiously excluded all the fields. What we are kindly asking you to do is to find out which fields out of those can be templated.

Just looking at your examples neither pool nor doc_md should be allowed to be tempalted.

Thepool field is evaluated by the scheduler when scheduler scheduler tasks for execution, and then when celery worker picks up the task it only selects the tasks belonging to it's queue. And it happens long before template rendering happens - template rendering happpen in the worker of cellery AFTER the worker already used queue field to pick tasks.

Similary doc_md should not be templated, because it is used by the UI to display task details NOT task instance details - and in order to do do it, cannot render the field, because again - rendering happens at execution time by the task, not when the UI decides to display description of the task.

Similarly execution_timeout cannot be rendered, because it is serialized to the DB by DAG file processor, WAY BEFORE the task gets executed, and there it is stored as INT not string (which is the original issue that this PR solved). And it makes completely no sense to render execution_timeout because again, it is used BEFORE the task starts and rendering is happening in the worker - because execution_timout is applied at the monitoring level of the task in teh process that starts the actual "local task process" where rendering happen. So basically (if you look at the code) SIGARM is started and alarm on timeout is set before JINJA rendering happens. You likely would not like to do:

signal.setitimer(signal.ITIMER_REAL, '{{ render_timer }}')

That simply would not work. You need int not string.

So assumption when that PR got merged that most (if not all) of the fields of BaseOperators are like that and all of them should be excluded by default. That was goo d assumption, but over-cautious. Some parameters can be templated it seems.

You seem to find one that might be allow-listed. Maybe more can be as well. What we are kindly asking you to review all of those, possibly test (as @hussein-awala proposed) and come with PR where you specifically allow-list those fields that CAN be allowlisted and you tested that they work.

Is it too much of an ask?

@hussein-awala
Copy link
Member Author

But how does this PR help solve the rendering issue?

Most of these attributes should not be used as a templated field, so the PR helped quickly fail the dag parsing instead of falling in runtime.

I'm still not convinced of the need for this change

Most of the BaseOperator fields are used in parsing time and not runtime, for example, execution_date, priority_weight, the different callbacks, the different dependencies params, retry params, and the pool (I don't know why you think it could be templated, it's used to see if we can queue the task or not, so before executing the job), and even doc_md which is a tasks param more than task instance param.

For the email, we used it in the TaskInstance class when the TI fails, so after executing it, it's ok to exclude it. I'm not sure if we can exclude the other params; as I mentioned, most of them are used by the scheduler, executor, or webserver before executing the task.

@gdavoian
Copy link
Contributor

gdavoian commented Nov 9, 2023

@potiuk @hussein-awala Thank you for the explanations! I'm starting to understand the problem and what I'm supposed to do. I'll check and get back to you soon. My apologies for any earlier misunderstandings!

@potiuk
Copy link
Member

potiuk commented Nov 9, 2023

@potiuk @hussein-awala Thank you for the explanations! I'm starting to understand the problem and what I'm supposed to do. I'll check and get back to you soon. My apologies for any earlier misunderstandings!

No worries :). Airflow is pretty ... complicated. ... and even after years of working on it every day I find something new every day or make wrong assumption. See that one for example #35539 (comment) just about 2 minutes ago. ...

@gdavoian
Copy link
Contributor

gdavoian commented Nov 9, 2023

Hm, you were right, looks like only email is a field that is 100% safe to template (as I mentioned before, the field is heavily used by our team as default_args["email"] for each and every DAG). There is also run_as_user that seems to be safe as well, but I've never used it, so can't really comment on its usage. If you're fine with that, then I can open a PR to exclude email (and optionally run_as_user if you think it makes sense) from that check.

@potiuk
Copy link
Member

potiuk commented Nov 9, 2023

Hm, you were right, looks like only email is a field that is 100% safe to template (as I mentioned before, the field is heavily used by our team as default_args["email"] for each and every DAG). There is also run_as_user that seems to be safe as well, but I've never used it, so can't really comment on its usage. If you're fine with that, then I can open a PR to exclude email (and optionally run_as_user if you think it makes sense) from that check.

I would say - just email. Because security. I have not realized it before, but this PR also had some interesting security implications.

I believe it is better to not allow to modify run_as_user because it is a potential security risk. The run_as_user is used to do impersonation, so basically run the final task as a specific linux user (via sudo basically). You can easily imagine someone tricking the code of task A to pass root there to run task B with root priviledges where someone wanted to use nobody. And usually impersonation is done because of security concerns actually, so dynamically generating it is quite a bit risky

Of course that would require several layers of other security issues, but generally speaking (similarly to SQL INJECTION kind of vulnerabilities) - if you have a content generated from (potentially) user input, you should always sanitize it and not use it any context that is anywhere near security - bound decisions.

@gdavoian
Copy link
Contributor

gdavoian commented Nov 9, 2023

@potiuk @hussein-awala You're welcome to review #35546 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:serialization type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DAG fails serialization if template_field contains execution_timeout
6 participants