Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Templating (like {{ ds }} ) stopped working in papermill after upgrade from 2.3.x to 2.4.x #26931

Closed
1 of 2 tasks
tpapaj-CKPL opened this issue Oct 7, 2022 · 20 comments
Closed
1 of 2 tasks

Comments

@tpapaj-CKPL
Copy link

Apache Airflow version

2.4.1

What happened

I am using CeleryKubernetesExecutor and I have Celery Worker. I am running Papermill task.
Succendly after upgrade from Airflow 2.3.4 to 2.4.1, {{ ds }} template stopped being recognized in worker in Papermill notebooks. I can see it being rendered properly on UI but on worker there is only {{ ds }} in parameters and operator fails with ValueError: time data '{{ ds }}' does not match format '%Y-%m-%d' error

    featuresNumber = PapermillOperator(
        task_id='features_number',
        input_nb=scriptsPath("features/features_number/features_number.ipynb"),
        output_nb=scriptsPath("features/features_number/features_number_rendered.ipynb"),
        parameters={
            'processingDate': "{{ ds }}",
            'dailyFeaturesInputPath': outputDataPath("daily/features_number"),
            'workdir': dgxPath("secrets/feature-matrix/")
        },
        queue="dgx"
    )

Now all papermill tasks using {{ ds }} are broken.

What you think should happen instead

{{ ds }} should be properly templated.

How to reproduce

Run airflow instance with celery worker, both should be 2.4.1. Try running PapermillOperator on the worker with {{ ds }} as a parameter for notebook. Check that in the output notebook in the parameter there is {{ ds }} instead of proper value.

Operating System

docker image nvidia/cuda:10.1-cudnn8-devel-ubuntu18.04 (Ubuntu 18.04)

Versions of Apache Airflow Providers

apache-airflow-providers-apache-beam 3.1.0 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-apache-cassandra 2.0.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-apache-hive 2.0.2 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-apache-spark 2.0.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-celery 2.1.0 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-cncf-kubernetes 2.0.2 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-ftp 2.0.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-google 5.1.0 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-http 2.0.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-imap 2.0.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-jdbc 2.0.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-mysql 2.1.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-papermill 3.0.0 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-postgres 2.2.0 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-sftp 2.1.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-sqlite 2.0.1 pyhd8ed1ab_0 conda-forge
apache-airflow-providers-ssh 2.1.1 pyhd8ed1ab_0 conda-forge

Deployment

Other Docker-based deployment

Deployment details

Google Kubernetes Engine for Airflow, regular Docker for celery worker

Anything else

It happened when I upgraded from airflow 2.3.4 to 2.4.1, no other libraries were changed. Tried both apache-airflow-providers-papermill 3.0.0 and apache-airflow-providers-papermill 2.2.3

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@tpapaj-CKPL tpapaj-CKPL added area:core kind:bug This is a clearly a bug labels Oct 7, 2022
@tpapaj-CKPL
Copy link
Author

tpapaj-CKPL commented Oct 7, 2022

it looks like all templates are not working because I also have problems with getting ds_nodash to work. templating of output notebook however works

Also, I can't understand how templating works in airflow. before 2.3.x there was dedicated function in base operator but that got removed. How is templating working now? I can see that execute function already contains partially template data but I can't see where that function is called

@uranusjr
Copy link
Member

uranusjr commented Oct 7, 2022

All the rendering functions are still there, ome of them are moved up to AbstractOperator (the parent class).

@tpapaj-CKPL
Copy link
Author

I copy-pasted entire BaseOperator codes from older versions to new and it still didn't work. Thanks to new downgrade command I was able to downgrade to 2.3.0 which was the last stable version in my environment(with small celery kubernetes executor script fixes) and now templating works correctly. I couldn't debug the source of that issue. I had problems with k8s pod overrides in 2.3.4 so I went back to 2.3.0.

@marvinfretly
Copy link

marvinfretly commented Oct 11, 2022

Same here, upgraded to 2.4.1 and now my parameters in papermill are not executed.

Here is my code:

task_notebook = PapermillOperator(
    task_id=path.split('/')[-1].replace('.ipynb', ''),
    input_nb=path,
    output_nb="/output/{}".format(path.split('/')[-1]),
    parameters={
        "execution_date": "{{ execution_date }}",
        "filePath": "{{ti.xcom_pull(task_ids=['downloadFile'])[0]}}",
        'fileExtension': "{{dag_run.conf.get('document')['extension']}}",
        'numeroProcess': "{{ti.xcom_pull(task_ids=['getProcess'])[0]['id']}}"
    })

@tpapaj-CKPL tpapaj-CKPL changed the title {{ ds }} template stopped working in papermill after upgrade to 2.4.x Templating (like {{ ds }} ) stopped working in papermill after upgrade from 2.3.x to 2.4.x Oct 12, 2022
@marvinfretly
Copy link

marvinfretly commented Oct 12, 2022

In case you want to downgrade to an older version:

  1. Go inside your container using docker exec -it <container_name> bash
  2. execute this: airflow db downgrade --to-revision <revision_id>
  3. Downgrade airflow version on your dockerfile
    Reference for Database Migrations (Revision id).

That's it

@uranusjr
Copy link
Member

I don’t have a setup to run Papermill, but a test with BashOperator(..., env={"FOO": "{{ ds }}"})` indicates that the template rendering mechanism is running correctly, so this is likely specific to the Papermill operator. But the operator is very very straightforward and I can’t see how it can be problematic at all. We’ll need someone that can reproduce this to debug this issue, I’m afraid 😞

@eladkal
Copy link
Contributor

eladkal commented Oct 12, 2022

I don't think this is a bug in the operator. We even have a coverage for templating in the tests:

def test_render_template(self):
args = {'owner': 'airflow', 'start_date': DEFAULT_DATE}
dag = DAG('test_render_template', default_args=args)
operator = PapermillOperator(
task_id="render_dag_test",
input_nb="/tmp/{{ dag.dag_id }}.ipynb",
output_nb="/tmp/out-{{ dag.dag_id }}.ipynb",
parameters={"msgs": "dag id is {{ dag.dag_id }}!"},
kernel_name="python3",
language_name="python",
dag=dag,
)

so I doubt the issue is in the operator itself.
@marvinfretly Please try minimal example:

task_notebook = PapermillOperator(
    task_id="hello",
    input_nb=path,
    output_nb=path2,
    parameters={
        "execution_date": "{{ execution_date }}",
    })

I suspect your issue is not a bug in the operator but wrong usage with how you pull from xcom / extract values from conf?
Can you check these templating values with another operator to see if they get templated correctly?

@marvinfretly
Copy link

I don't think this is a bug in the operator. We even have a coverage for templating in the tests:

def test_render_template(self):
args = {'owner': 'airflow', 'start_date': DEFAULT_DATE}
dag = DAG('test_render_template', default_args=args)
operator = PapermillOperator(
task_id="render_dag_test",
input_nb="/tmp/{{ dag.dag_id }}.ipynb",
output_nb="/tmp/out-{{ dag.dag_id }}.ipynb",
parameters={"msgs": "dag id is {{ dag.dag_id }}!"},
kernel_name="python3",
language_name="python",
dag=dag,
)

so I doubt the issue is in the operator itself. @marvinfretly Please try minimal example:

task_notebook = PapermillOperator(
    task_id="hello",
    input_nb=path,
    output_nb=path2,
    parameters={
        "execution_date": "{{ execution_date }}",
    })

I suspect your issue is not a bug in the operator but wrong usage with how you pull from xcom / extract values from conf? Can you check these templating values with another operator to see if they get templated correctly?

I downgraded to version 2.3.4 and it's working perfectly fine. I don't know how it is possible that the problem is on my script and not on the operator since it's working for older version

@eladkal
Copy link
Contributor

eladkal commented Oct 12, 2022

hi @marvinfretly Can you please test the the same templating code you use on other operators?
We need to confirm if this is Papermill issue or a general issue.

@marvinfretly
Copy link

marvinfretly commented Oct 12, 2022

Tryed this and it's actually working

task_test = PythonOperator(
    task_id='test',
    python_callable=test,
    do_xcom_push=True,
    provide_context=True,
    trigger_rule='none_failed',
    op_args=["{{ execution_date }}"]
)

Tryed this minimal task and it's still not working.
So yes it look like it's a PapermillOperator issue

task_notebook = PapermillOperator(
    task_id="oklm",
    input_nb="/opt/airflow/dags/modules/etl/preprocess/test.ipynb",
    output_nb="/output/test2.ipynb",
    parameters={
        "execution_date": "{{ execution_date }}",
    })

@tpapaj-CKPL
Copy link
Author

I checked old logs yesterday and I found out something interesting.
It looks like templating in output_nb argument worked during that 2.4.1, everything was fine, however nothing in parameters argument was templated. Also the "Rendered Template" tab in airflow webserver also showed the correct value. I had BigQueryToGCSOperator with {{ ds_nodash }} user in source_project_dataset_table and destination_cloud_storage_uris arguments and they were working.

So it seems the only problem was with parameters argument in Papermill on the executor node. Maybe parameters argument was blacklisted from being templated?

@eladkal
Copy link
Contributor

eladkal commented Oct 12, 2022

Cool. So this narrow down the issue to this class:

class NoteBook(File):
"""Jupyter notebook"""
type_hint: str | None = "jupyter_notebook"
parameters: dict | None = {}
meta_schema: str = __name__ + '.NoteBook'

This is the difference between PapermillOperator to the other operator and also this class utalizes parameters so this is probably why the issue is localized to the parameters and doesn't effect other templated fields.

As for the cause of the bug it's still a mystery to me. @uranusjr given the above information maybe you can better speculate?

@uranusjr
Copy link
Member

uranusjr commented Oct 13, 2022

Hmm, I can see why this doesn’t work, but the problem is, I can’t understand how it ever worked before 2.4! The template rendering code should never have been able to go into nested class like this. I reduced the code to this and the result colaborates (I tested on main, 2.3.3, and 2.2.2):

from airflow.providers.papermill.operators.papermill import PapermillOperator

context = {"ds": "2022-10-13"}

pp = PapermillOperator(task_id="pp", input_nb="my_nb", parameters={"x": "{{ ds }}"})
pp.render_template_fields(context)

# This passes because 'parameters' is implated.
assert pp.parameters["x"] == "2022-10-13"

# This does not work because the 'NoteBook' class is not renderable.
assert pp.inlets[-1].parameters["x"] == "2022-10-13"

We can fix the apparent issue here (by making NoteBook renderable), but I’m uneasy about this since we still seem to be obviously missing something.

@Nino171
Copy link

Nino171 commented Nov 15, 2022

Is there any update on this? I'm facing the same issue regarding the parameters right now.

@Nino171
Copy link

Nino171 commented Nov 15, 2022

For anyone interested i worked around this issue with the PythonOperator, where i get in **kwargs and with e.g. kwargs['execution_date'].isoformat() i am able to pass the variables inside my parameters dict/json.

@tpapaj-CKPL
Copy link
Author

We can fix the apparent issue here (by making NoteBook renderable), but I’m uneasy about this since we still seem to be obviously missing something.

@uranusjr Could please provide a hint how to do it as a dirty fix? This issue prevents anyone using papermill operator from upgrading the airflow.

@uranusjr
Copy link
Member

You can override _render_nested_template_fields to provide rendering support for custom objects. See operators like KubernetesPodOperator (which uses it to render V1EnvVar objects) for examples.

@tpapaj-CKPL
Copy link
Author

tpapaj-CKPL commented Nov 24, 2022

Thanks! I did it but there was no change, I added some console prints in AbstractOperator and it seems that _render_nested_template_fields is not called at all in PapermillOperator. I will look more at that.

EDIT:
I used the code from here: #26931 (comment)

Here is the output calls from render_template method, here is the context and content used and the method type that was used for the input:

Context: {"ds": "2022-10-13"}
Content: "my_nb" (string)
Output: "my_nb"

Context: {"ds": "2022-10-13"}
Content: {"x": "{{ ds }}"} (dict, so next method call on this parameter)
Context: {"ds": "2022-10-13"}
Content: "{{ ds }}" (string)
Output: {"x": "2022-10-13"}

Still, the pp.inlets[-1].parameters["x"] from the example contains only {{ ds }}

@nicnguyen3103
Copy link

I found the issue with the code of Papermill (or Airflow). The render_template method apparently execute after the init method but before the execute() method. As papermill init the self.inlets in the constructor with the self.parameters, the self.inlets create a Notebook object with pre-rendered params and never got updated. I will visualize the flow with the code here:

class PapermillOperator(BaseOperator):
    """
    Executes a jupyter notebook through papermill that is annotated with parameters

    :param input_nb: input notebook (can also be a NoteBook or a File inlet)
    :param output_nb: output notebook (can also be a NoteBook or File outlet)
    :param parameters: the notebook parameters to set
    :param kernel_name: (optional) name of kernel to execute the notebook against
        (ignores kernel name in the notebook document metadata)
    """

    supports_lineage = True

    template_fields: Sequence[str] = ('input_nb', 'output_nb', 'parameters', 'kernel_name', 'language_name')

    def __init__(
        self,
        *,
        input_nb: Optional[str] = None,
        output_nb: Optional[str] = None,
        parameters: Optional[Dict] = None,
        kernel_name: Optional[str] = None,
        language_name: Optional[str] = None,
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)

        self.input_nb = input_nb
        self.output_nb = output_nb
        self.parameters = parameters
        self.kernel_name = kernel_name
        self.language_name = language_name
        
        # the self.inlets was populated with pre-rendered self.parameters 
        if input_nb:
            self.inlets.append(NoteBook(url=input_nb, parameters=self.parameters))
        if output_nb:
            self.outlets.append(NoteBook(url=output_nb))

    def execute(self, context: 'Context'):
        # the self.parameters have been updated, but self.inlets does never got updated.
        if not self.inlets or not self.outlets:
            raise ValueError("Input notebook or output notebook is not specified")

        for i, item in enumerate(self.inlets):
            pm.execute_notebook(
                item.url,
                self.outlets[i].url,
                parameters=item.parameters,
                progress_bar=False,
                report_mode=True,
                kernel_name=self.kernel_name,
                language=self.language_name,
            )

The problem is fixed if you modify the execute method like this by updating self.inlets and self.outlets attribute inside the execute function, after the render function has been executed

    def execute(self, context: 'Context'):
        if not self.input_nb or not self.output_nb:
            raise ValueError("Input notebook or output notebook is not specified")
        self.inlets.append(NoteBook(url=self.input_nb, 
                                    parameters=self.parameters))
        self.outlets.append(NoteBook(url=self.output_nb))

        for i, item in enumerate(self.inlets):
            pm.execute_notebook(
                item.url,
                self.outlets[i].url,
                parameters=item.parameters,
                progress_bar=False,
                report_mode=True,
                kernel_name=self.kernel_name,
                language=self.language_name,
            )

I am currently on airflow 2.5.0, and looking at airflow ti now to see if they change the render.

@uranusjr
Copy link
Member

Rendering has always happened after __init__ and before execute since that’s how Airflow runs things. This doesn’t really explain why the mechanism worked before 2.4.

@apache apache locked and limited conversation to collaborators Jan 16, 2023
@potiuk potiuk converted this issue into discussion #28977 Jan 16, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

6 participants