Correctly store non-default Nones in serialized tasks/dags #8772

ashb · 2020-05-07T22:17:01Z

The default schedule_interval for a DAG is @daily, so
schedule_interval=None is actually not the default, but we were not
storing any null attributes previously.

This meant that upon re-inflating the DAG the schedule_interval would
become @daily.

This fixes that problem, and extends the test to look at all the
serialized attributes in our round-trip tests, rather than just the few
that the webserver cared about.

It doesn't change the serialization format, it just changes what/when
values were stored.

This solution was more complex than I hoped for, but the test case in
test_operator_subclass_changing_base_defaults is a real one that the
round trip tests discovered from the DatabricksSubmitRunOperator -- I
have just captured it in this test in case that specific operator
changes in future.

Make sure to mark the boxes below before creating PR: [x]

Description above provides context of the change
Unit tests coverage for changes (not needed for documentation changes)
Target Github ISSUE in description if exists
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@daily

The default schedule_interval for a DAG is `@daily`, so `schedule_interval=None` is actually not the default, but we were not storing _any_ null attributes previously. This meant that upon re-inflating the DAG the schedule_interval would become @daily. This fixes that problem, and extends the test to look at _all_ the serialized attributes in our round-trip tests, rather than just the few that the webserver cared about. It doesn't change the serialization format, it just changes what/when values were stored. This solution was more complex than I hoped for, but the test case in test_operator_subclass_changing_base_defaults is a real one that the round trip tests discovered from the DatabricksSubmitRunOperator -- I have just captured it in this test in case that specific operator changes in future.

airflow/serialization/serialized_objects.py

tests/serialization/test_dag_serialization.py

ashb · 2020-05-08T12:58:37Z

Curious failure:

E                   AssertionError: default_args[start_date] matches
E                   assert <Pendulum [2020-05-07T22:49:48.687518+00:00]> == <Pendulum [2020-05-07T22:49:46.451514+00:00]>

I wonder why I didn't see that locally.

kaxil · 2020-05-08T13:00:44Z

Curious failure:

E                   AssertionError: default_args[start_date] matches
E                   assert <Pendulum [2020-05-07T22:49:48.687518+00:00]> == <Pendulum [2020-05-07T22:49:46.451514+00:00]>

I wonder why I didn't see that locally.

somewhere start_date might be .now() causing it ?? not sure

ashb · 2020-05-08T13:25:19Z

Curious, the DAG with the problem has 'start_date': datetime.utcnow() -- that's "wrong" but I wouldn't have expected this test to fail. I guess we are loading the dag twice, once with s10n, once without and _then_comparing.

ashb · 2020-05-08T13:26:16Z

Oh, cos it's parsing in a subprocess. Hmmmm!

I don't think we need to parse all dags in a subprocess, just one is enough.

ashb · 2020-05-08T20:47:35Z

Oh, cos it's parsing in a subprocess. Hmmmm!

I don't think we need to parse all dags in a subprocess, just one is enough.

Changed it to only parse example_dags (rather than all the provider dags too) in the subprocess, and the rest are just round-tripped. This cut the test time for test_serialized_dags from 8s to 4-5s on my laptop too as a bonus :)

The main thing I was fixning here was `start_date=utcnow()` which is always going to be wrong (discovered via a test in apache#8772). While I was updating the DAG I updated it to use context manager, and shift operators.

The main thing I was fixning here was `start_date=utcnow()` which is always going to be wrong (discovered via a test in #8772). While I was updating the DAG I updated it to use context manager, and shift operators.

airflow/serialization/serialized_objects.py

ashb · 2020-05-09T19:09:47Z

tests/serialization/test_dag_serialization.py

@@ -648,6 +683,23 @@ def test_dag_serialized_fields_with_schema(self):
        dag_params: set = set(dag_schema.keys()) - ignored_keys
        self.assertEqual(set(DAG.get_serialized_fields()), dag_params)

+    def test_operator_subclass_changing_base_defaults(self):


@kaxil this is the case that means we need to check the MRO for defaults.

I think storing all non-defaults is nicer anyway - we can show these in the UI that way

kaxil

~~Don't merge it yet, testing something locally~~.

Just wanted to test it with the following diff:

diff --git a/airflow/serialization/serialized_objects.py b/airflow/serialization/serialized_objects.py
index 0ab1d80ff..d3a1ff54a 100644
--- a/airflow/serialization/serialized_objects.py
+++ b/airflow/serialization/serialized_objects.py
@@ -416,60 +416,6 @@ class SerializedBaseOperator(BaseOperator, BaseSerialization):
 
         return op
 
-    @classmethod
-    def _is_constructor_param(cls, attrname: str, instance: Any) -> bool:
-        # Check all super classes too
-        return any(
-            attrname in cls.__constructor_params_for_subclass(typ)
-            for typ in type(instance).mro()
-        )
-
-    @classmethod
-    def _value_is_hardcoded_default(cls, attrname: str, value: Any, instance: Any) -> bool:
-        """
-        Check if ``value`` is the default value for ``attrname`` as set by the
-        constructor of ``instance``, or any of it's parent classes up
-        to-and-including BaseOperator.
-
-        .. seealso::
-
-            :py:meth:`BaseSerialization._value_is_hardcoded_default`
-        """
-
-        def _is_default(ctor_params, attrname, value):
-            if attrname not in ctor_params:
-                return False
-            ctor_default = ctor_params[attrname].default
-
-            # Also returns True if the value is an empty list or empty dict.
-            # This is done to account for the case where the default value of
-            # the field is None but has the ``field = field or {}`` set.
-            return ctor_default is value or (ctor_default is None and value in [{}, []])
-
-        for typ in type(instance).mro():
-            ctor_params = cls.__constructor_params_for_subclass(typ)
-
-            if _is_default(ctor_params, attrname, value):
-                if typ is BaseOperator:
-                    return True
-                # For added fun, if a subclass sets a different default value to the
-                # same argument, (i.e. a subclass changes default of do_xcom_push from
-                # True to False), we then do want to include it.
-                #
-                # This is because we set defaults based on BaseOperators
-                # defaults, so if we didn't set this when inflating we'd
-                # have the wrong value
-
-                base_op_ctor_params = cls.__constructor_params_for_subclass(BaseOperator)
-                if attrname not in base_op_ctor_params:
-                    return True
-                return base_op_ctor_params[attrname].default == value
-
-            if typ is BaseOperator:
-                break
-
-        return False
-
     @classmethod
     def _is_excluded(cls, var: Any, attrname: str, op: BaseOperator):
         if var is not None and op.has_dag() and attrname.endswith("_date"):

All tests pass except one with the above diff.

Failing test:

    def validate_deserialized_task(self, serialized_task, task,):
        """Verify non-airflow operators are casted to BaseOperator."""
        assert isinstance(serialized_task, SerializedBaseOperator)
        assert not isinstance(task, SerializedBaseOperator)
        assert isinstance(task, BaseOperator)

        fields_to_check = task.get_serialized_fields() - {
            # Checked separately
            '_task_type', 'subdag',

            # Type is exluded, so don't check it
            '_log',

            # List vs tuple. Check separately
            'template_fields',

            # We store the string, real dag has the actual code
            'on_failure_callback', 'on_success_callback', 'on_retry_callback',
        }

        assert serialized_task.task_type == task.task_type

        for field in fields_to_check:
>           assert getattr(serialized_task, field) == getattr(task, field), \
                f'{task.dag.dag_id}.{task.task_id}.{field} does not match'
E           AssertionError: example_gcp_gke.pod_task.resources does not match
E           assert None == []
E             +None
E             -[]

tests/serialization/test_dag_serialization.py:353: AssertionError

That fails because we are not storing [] as it is "default"-ish (check below), but on re-inflating it's getting set to None, but that isn't the same as the value in the dag.

airflow/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Lines 185 to 187 in db1b51d

    
           if kwargs.get('xcom_push') is not None: 
        
               raise AirflowException("'xcom_push' was deprecated, use 'do_xcom_push' instead") 
        
           super().__init__(*args, resources=None, **kwargs)

airflow/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Line 209 in db1b51d

self.resources = self._set_resources(resources)

airflow/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Lines 316 to 320 in db1b51d

    
           @staticmethod 
        
           def _set_resources(resources): 
        
               if not resources: 
        
                   return [] 
        
               return [Resources(**resources)]

We can still keep the changes in the PR to support that in the future where we could serialise all properties of dags. The performance impact should not be minimal because of the use of LRU cache.

kaxil · 2020-05-09T21:52:05Z

Feel free to merge once we add a comment in the doc on why we kept the change.

ashb · 2020-05-09T22:38:18Z

I was wrong about where that fn was used, so I've removed that complex code an replace it with this in the test instead:

        if serialized_task.resources is None:
            assert task.resources is None or task.resources == []
        else:
            assert serialized_task.resources == task.resources

ashb · 2020-05-09T22:39:34Z

(We've got that code in this diff comment, we can bring it back when we want it, but that will need more changes elsewhere to serialize more fields

@daily

The default schedule_interval for a DAG is `@daily`, so `schedule_interval=None` is actually not the default, but we were not storing _any_ null attributes previously. This meant that upon re-inflating the DAG the schedule_interval would become @daily. This fixes that problem, and extends the test to look at _all_ the serialized attributes in our round-trip tests, rather than just the few that the webserver cared about. It doesn't change the serialization format, it just changes what/when values were stored. This solution was more complex than I hoped for, but the test case in test_operator_subclass_changing_base_defaults is a real one that the round trip tests discovered from the DatabricksSubmitRunOperator -- I have just captured it in this test in case that specific operator changes in future. (cherry picked from commit a715aa6)

@daily

The default schedule_interval for a DAG is `@daily`, so `schedule_interval=None` is actually not the default, but we were not storing _any_ null attributes previously. This meant that upon re-inflating the DAG the schedule_interval would become @daily. This fixes that problem, and extends the test to look at _all_ the serialized attributes in our round-trip tests, rather than just the few that the webserver cared about. It doesn't change the serialization format, it just changes what/when values were stored. This solution was more complex than I hoped for, but the test case in test_operator_subclass_changing_base_defaults is a real one that the round trip tests discovered from the DatabricksSubmitRunOperator -- I have just captured it in this test in case that specific operator changes in future. (cherry picked from commit a715aa6)

@daily

The default schedule_interval for a DAG is `@daily`, so `schedule_interval=None` is actually not the default, but we were not storing _any_ null attributes previously. This meant that upon re-inflating the DAG the schedule_interval would become @daily. This fixes that problem, and extends the test to look at _all_ the serialized attributes in our round-trip tests, rather than just the few that the webserver cared about. It doesn't change the serialization format, it just changes what/when values were stored. This solution was more complex than I hoped for, but the test case in test_operator_subclass_changing_base_defaults is a real one that the round trip tests discovered from the DatabricksSubmitRunOperator -- I have just captured it in this test in case that specific operator changes in future. (cherry picked from commit a715aa6)

@daily

The default schedule_interval for a DAG is `@daily`, so `schedule_interval=None` is actually not the default, but we were not storing _any_ null attributes previously. This meant that upon re-inflating the DAG the schedule_interval would become @daily. This fixes that problem, and extends the test to look at _all_ the serialized attributes in our round-trip tests, rather than just the few that the webserver cared about. It doesn't change the serialization format, it just changes what/when values were stored. This solution was more complex than I hoped for, but the test case in test_operator_subclass_changing_base_defaults is a real one that the round trip tests discovered from the DatabricksSubmitRunOperator -- I have just captured it in this test in case that specific operator changes in future. (cherry picked from commit a715aa6)

@daily

The default schedule_interval for a DAG is `@daily`, so `schedule_interval=None` is actually not the default, but we were not storing _any_ null attributes previously. This meant that upon re-inflating the DAG the schedule_interval would become @daily. This fixes that problem, and extends the test to look at _all_ the serialized attributes in our round-trip tests, rather than just the few that the webserver cared about. It doesn't change the serialization format, it just changes what/when values were stored. This solution was more complex than I hoped for, but the test case in test_operator_subclass_changing_base_defaults is a real one that the round trip tests discovered from the DatabricksSubmitRunOperator -- I have just captured it in this test in case that specific operator changes in future. (cherry picked from commit a715aa6)

ashb added the area:serialization label May 7, 2020

ashb requested a review from kaxil May 7, 2020 22:17

ashb commented May 7, 2020

View reviewed changes

airflow/serialization/serialized_objects.py Outdated Show resolved Hide resolved

ashb mentioned this pull request May 7, 2020

Correctly restore upstream_task_ids when deserializing Operators #8775

Merged

6 tasks

BasPH reviewed May 8, 2020

View reviewed changes

airflow/serialization/serialized_objects.py Outdated Show resolved Hide resolved

kaxil reviewed May 8, 2020

View reviewed changes

airflow/serialization/serialized_objects.py Outdated Show resolved Hide resolved

kaxil reviewed May 8, 2020

View reviewed changes

airflow/serialization/serialized_objects.py Outdated Show resolved Hide resolved

fixup! Correctly store non-default Nones in serialized tasks/dags

f874fab

kaxil reviewed May 8, 2020

View reviewed changes

tests/serialization/test_dag_serialization.py Outdated Show resolved Hide resolved

fixup! Correctly store non-default Nones in serialized tasks/dags

415173a

ashb requested review from BasPH and kaxil May 8, 2020 20:52

ashb mentioned this pull request May 8, 2020

Update example SingularityOperator dag #8790

Merged

6 tasks

fixup! Correctly store non-default Nones in serialized tasks/dags

2ada782

kaxil reviewed May 9, 2020

View reviewed changes

airflow/serialization/serialized_objects.py Outdated Show resolved Hide resolved

ashb commented May 9, 2020

View reviewed changes

kaxil approved these changes May 9, 2020

View reviewed changes

kaxil reviewed May 9, 2020

View reviewed changes

fixup! Correctly store non-default Nones in serialized tasks/dags

8d28a7c

kaxil approved these changes May 10, 2020

View reviewed changes

kaxil added this to the Airflow 1.10.11 milestone May 10, 2020

ashb merged commit a715aa6 into apache:master May 10, 2020

ashb deleted the s10n-store-schedule-interval-none branch May 10, 2020 07:57

kaxil added the type:bug-fix Changelog: Bug Fixes label Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly store non-default Nones in serialized tasks/dags #8772

Correctly store non-default Nones in serialized tasks/dags #8772

ashb commented May 7, 2020

ashb commented May 8, 2020

kaxil commented May 8, 2020

ashb commented May 8, 2020

ashb commented May 8, 2020

ashb commented May 8, 2020

ashb May 9, 2020

kaxil left a comment •

edited

Loading

kaxil commented May 9, 2020

ashb commented May 9, 2020

ashb commented May 9, 2020

	if kwargs.get('xcom_push') is not None:
	raise AirflowException("'xcom_push' was deprecated, use 'do_xcom_push' instead")
	super().__init__(args, resources=None, *kwargs)

	@staticmethod
	def _set_resources(resources):
	if not resources:
	return []
	return [Resources(**resources)]

Correctly store non-default Nones in serialized tasks/dags #8772

Correctly store non-default Nones in serialized tasks/dags #8772

Conversation

ashb commented May 7, 2020

ashb commented May 8, 2020

kaxil commented May 8, 2020

ashb commented May 8, 2020

ashb commented May 8, 2020

ashb commented May 8, 2020

ashb May 9, 2020

Choose a reason for hiding this comment

kaxil left a comment • edited Loading

Choose a reason for hiding this comment

kaxil commented May 9, 2020

ashb commented May 9, 2020

ashb commented May 9, 2020

kaxil left a comment •

edited

Loading