Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Airflow serialization for namedtuple #37168

Merged
merged 10 commits into from Feb 18, 2024

Conversation

Joffreybvn
Copy link
Contributor

@Joffreybvn Joffreybvn commented Feb 4, 2024

This PR fixes: #36839
In some cases, the namedtuple was not recognized as tuple and failed to serialize.


Side note:
When we return a tuple from a task, the tuple reach the XComEncoder and get serialized as a dict - thanks to this line. However, when we return a list of tuples, the tuples get serialized as lists. The above condition is false, and we fallback to the default json encoding behavior. Thus, the way namedtuple / tuple are handled is inconsistent.

This causes some queries of the Databricks Hook and Operator to fail right now, depending on how the Hook / Operator are parameterized.

This PR fix the namedtuple for such queries, but it doesn't fix the inconsistency: sometimes namedtuple will be encoded as dict, sometimes as list. IMO a definitive fix involve another strategy than qualname(). Because namedtuples have dynamic qualnames, wich will never be "builtins.tuple".


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@Joffreybvn Joffreybvn marked this pull request as draft February 4, 2024 17:07
@Joffreybvn Joffreybvn marked this pull request as ready for review February 4, 2024 18:51
Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm picky in this area due for speed and tightness. This gets called a lot. Make sure to check for attributes as much as you can (hasattr > isinstance)

airflow/serialization/serde.py Outdated Show resolved Hide resolved
airflow/serialization/serde.py Outdated Show resolved Hide resolved
@bolkedebruin bolkedebruin added this to the Airflow 2.8.2 milestone Feb 5, 2024
@bolkedebruin
Copy link
Contributor

ping @Joffreybvn :-)

@Joffreybvn
Copy link
Contributor Author

Sorry, busy week. I'm on it !

Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the changes to a minimum. The only two lines that are required are

if _is_namedtuple(o):
  qn = "builtins.tuple"

airflow/serialization/serde.py Show resolved Hide resolved
airflow/serialization/serde.py Show resolved Hide resolved
airflow/serialization/serde.py Show resolved Hide resolved
airflow/serialization/serde.py Show resolved Hide resolved
@Joffreybvn
Copy link
Contributor Author

We do need to specify the classname, which should be "builtins.tuple":

The serialize method (in airflow/serialization/serializers/builtin.py) does qualname() on the namedtuple, which returns "airflow.providers.databricks.hooks.Row" (the namedtuple dynamically created in the hook). If this is used as classname, it will fail to deserialize: there won't be any deserializer for it.

If we hardcode it to "builtins.tuple", it gets deserialized correctly.

@potiuk
Copy link
Member

potiuk commented Feb 15, 2024

Hey @joffreyvbn @bolkedebruin - any chance to get it soon ?

@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Feb 15, 2024

CI-CD is failing because the generated classname is the name of the namedtuple, instead of "builtins.tuple".
Do we want to serialize namedtuples like tuples ? If yes, we need to specify the classname @bolkedebruin.

Otherwise, how do you want namedtuples to be serialized?

@bolkedebruin
Copy link
Contributor

We do need to specify the classname, which should be "builtins.tuple":

The serialize method (in airflow/serialization/serializers/builtin.py) does qualname() on the namedtuple, which returns "airflow.providers.databricks.hooks.Row" (the namedtuple dynamically created in the hook). If this is used as classname, it will fail to deserialize: there won't be any deserializer for it.

If we hardcode it to "builtins.tuple", it gets deserialized correctly.

Ah sorry @Joffreybvn , I missed that. If you'd like you can revert to the previous / earlier version. That looks and reads better. Add a comment that you are overriding the class name and it makes things clear

Joffreybvn and others added 7 commits February 17, 2024 14:00
namedtuple is serialized like 'builtins.tuple'
In the future, if another object very similar to tuple exists, we can also check for `_make` and `_replace` attributes.
The classname of the namedtuple has to be `"builtins.tuple"`, instead of the qualname (the qualname will refer to the name given to the namedtuple when created).

I assume that `_serializers[qn].serialize(o)` won't always return a `classname` == qualname. Thus, creating a new `classname` variable set to "builtins.tuple", to ensure the namedtuple is considered as tuple.
This reverts commit 9015e3a.
Note: we do need to specify the classname, which should be "builtins.tuple":

The serialize method (in airflow/serialization/serializers/builtin.py) does qualname() on the namedtuple, which returns "airflow.providers.databricks.hooks.Row" (the namedtuple dynamically created in the hook). If this is used as classname, it will fail to deserialize: there won't be any deserializer for it.

If we hardcode it to "builtins.tuple", it gets deserialized correctly.
@Joffreybvn
Copy link
Contributor Author

Thanks @bolkedebruin ! I reverted the commit, and added a comment to clarify it.

@bolkedebruin bolkedebruin merged commit c49f857 into apache:main Feb 18, 2024
57 checks passed
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Feb 19, 2024
ephraimbuddy pushed a commit that referenced this pull request Feb 20, 2024
Namedtuple is serialized like 'builtins.tuple'

The serialize method (in airflow/serialization/serializers/builtin.py) does qualname() on the namedtuple, which returns an arbitrary name. If this is used as classname, it will fail to deserialize: there won't be any deserializer for it.

(cherry picked from commit c49f857)
ephraimbuddy pushed a commit that referenced this pull request Feb 20, 2024
Namedtuple is serialized like 'builtins.tuple'

The serialize method (in airflow/serialization/serializers/builtin.py) does qualname() on the namedtuple, which returns an arbitrary name. If this is used as classname, it will fail to deserialize: there won't be any deserializer for it.

(cherry picked from commit c49f857)
sunank200 pushed a commit to astronomer/airflow that referenced this pull request Feb 21, 2024
Namedtuple is serialized like 'builtins.tuple'

The serialize method (in airflow/serialization/serializers/builtin.py) does qualname() on the namedtuple, which returns an arbitrary name. If this is used as classname, it will fail to deserialize: there won't be any deserializer for it.
ephraimbuddy pushed a commit that referenced this pull request Feb 22, 2024
Namedtuple is serialized like 'builtins.tuple'

The serialize method (in airflow/serialization/serializers/builtin.py) does qualname() on the namedtuple, which returns an arbitrary name. If this is used as classname, it will fail to deserialize: there won't be any deserializer for it.

(cherry picked from commit c49f857)
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
Namedtuple is serialized like 'builtins.tuple'

The serialize method (in airflow/serialization/serializers/builtin.py) does qualname() on the namedtuple, which returns an arbitrary name. If this is used as classname, it will fail to deserialize: there won't be any deserializer for it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DatabricksSQLOperator struggles with parsing the data
5 participants