Fix serialize() usages in custom JSON encoders #28742

uranusjr · 2023-01-05T12:08:43Z

Our custom JSON encoders call serialize() in their default() hooks. However, serialize() does not guarantee to return a JSON-serializable type, and when it does not (e.g. a set object), the JSON encoder would infinitely recurse into serialize() trying to coerce the object into JSON-serializable.

This patch adds an additional ensure_json_compatible parameter to serialize(), so JSON encoders can use it to require serialize() to always return JSON-compatible values.

Some minor refactoring is also applied to serialize() to ensure the flag is correctly passed down through recursive visits, and also hide the depth tracking parameter (which is internal to the function) from outside access.

The _reverse_cache global variable is also deleted since it is not used anywhere, even before this patch.

uranusjr · 2023-01-05T12:09:41Z

airflow/serialization/serde.py

-_iterables = (list, set, tuple)
+_collections = (list, set, tuple)  # dict is treated specially.


Renamed to better reflect these classes. Iterable has special meaning in Python (collection also does, but its meaning is more in-line with what we mean here).

Our custom JSON encoders call serialize() in their default() hooks. However, serialize() does not guarantee to return a JSON-serializable type, and when it does not (e.g. a set object), the JSON encoder would infinitely recurse into serialize() trying to coerce the object into JSON-serializable. This patch adds an additional 'ensure_json_compatible' parameter to serialize(), so JSON encoders can use it to require serialize() to always return JSON-compatible values. Some minor refactoring is also applied to serialize() to ensure the flag is correctly passed down through recursive visits, and also hide the 'depth' tracking parameter (which is internal to the function) from outside access. The _reverse_cache global variable is also deleted since it is not used anywhere, even before this patch.

kaxil · 2023-01-05T16:37:38Z

Running it with full test suite

bolkedebruin · 2023-01-09T15:38:26Z

I do not think it should be solved here, but rather with a custom encoder for json that makes sets etc into a list if JSON is required. Such an encoder is already available.

Explanation: this code was intentionally written to remove all JSONisms and to have the possibility to encose into another format and to have cleaner code - hence the separation between serializer and encode (as it is within the standard library!) JSON does not guarantee order or uniqueness in its lists hence the issue. Three ways to solve this. 1) implement a JSON encoder that takes a collection and always converts it into a list with then (minor) downside the deserialization might fail. 2) adjust the serializer to serialize in such a way that the deserializer deserializes into the right type - e.g. How we do it now with custom objects 3) accept the loss of type information and convert into list in the serializer.

2 is the most robust, 1 is better for just json, 3 is a hack with potential deserialization issues (it most likely requires inspection slowing down the deserializer) but less of a hack than the current implementation.

Finally, importantly do not put any serialization code in a class. Classes are an order of a magnitude slower. This is due to the amount of stack walks the interpreter needs to do.

P.S. Please add a test that covers this both ways (deserialization / serialization) which takes care of uniqueness and ordering

P.P.S. I think the best way is to use option 2 and serialize tuples, lists, sets into something like

mytuple = {"classname": "builtins.tuple", data([a, b, c, d]), "version": "1"}}

thus wrapping it, so we can properly deserialize it.

bolkedebruin · 2023-01-09T18:07:19Z

airflow/serialization/serde.py

        return [deserialize(d) for d in o]

    if not isinstance(o, dict):
-        raise TypeError()
+        raise TypeError("not deserializable")


It is probably better to explain what type we are getting if we are improving the error message,.

bolkedebruin · 2023-01-09T18:07:30Z

airflow/serialization/serde.py

-            return True
-
-    return False
+    return any(p.match(classname) is not None for p in _patterns)


bolkedebruin · 2023-01-09T18:08:03Z

airflow/serialization/serde.py

    elif isinstance(value, dict):
        for k, v in value.items():
-            s += f"{k}={str(serialize(v, False))},"
+            s += f"{k}={str(serialize(v))},"


good catch, i think? :-)

bolkedebruin · 2023-01-09T18:08:52Z

airflow/utils/json.py

@@ -56,12 +56,12 @@ def default(self, o: Any) -> Any:
            return o.strftime("%Y-%m-%d")

        if isinstance(o, Decimal):
-            data = serialize(o)
+            data = serialize(o, ensure_json_compatible=True)


As mentioned, this should not be done this way, but rather be solved here.

bolkedebruin · 2023-01-09T18:10:13Z

airflow/serialization/serde.py

@@ -72,89 +70,104 @@ def decode(d: dict[str, str | int | T]) -> tuple:
    return d[CLASSNAME], d[VERSION], d.get(DATA, None)


-def serialize(o: object, depth: int = 0) -> U | None:
+@dataclasses.dataclass()
+class _Serializer:


Do not put this in a class. Classes are overhead on the stack and significantly slow down serialization/deserialization. The code was implemented on purpose o be as high up in the chain as possible.

bolkedebruin · 2023-01-09T18:15:40Z

airflow/serialization/serde.py

@@ -72,89 +70,104 @@ def decode(d: dict[str, str | int | T]) -> tuple:
    return d[CLASSNAME], d[VERSION], d.get(DATA, None)


-def serialize(o: object, depth: int = 0) -> U | None:
+@dataclasses.dataclass()


This adds calling overhead, which we don't want in this code.

uranusjr · 2023-01-17T00:55:21Z

Rethinking this, I think this can simply be achieved by a better implemented default. I’ll split out that main change and other minor improvements here into two PRs so things are easier to manage.

uranusjr added this to the Airflow 2.6.0 milestone Jan 5, 2023

uranusjr requested a review from bolkedebruin January 5, 2023 12:08

uranusjr requested review from kaxil and ashb as code owners January 5, 2023 12:08

boring-cyborg bot added the area:serialization label Jan 5, 2023

uranusjr commented Jan 5, 2023

View reviewed changes

uranusjr force-pushed the serialize-for-json branch 2 times, most recently from 2d1b83e to 26d8e86 Compare January 5, 2023 12:11

kaxil added the full tests needed We need to run full set of tests for this PR to merge label Jan 5, 2023

uranusjr added 2 commits January 5, 2023 16:37

Drive-by improvement

0446dba

kaxil force-pushed the serialize-for-json branch from 61c034e to 0446dba Compare January 5, 2023 16:37

bolkedebruin reviewed Jan 9, 2023

View reviewed changes

airflow/serialization/serde.py

return True

return False

return any(p.match(classname) is not None for p in _patterns)

Copy link

Contributor

bolkedebruin Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

bolkedebruin reviewed Jan 9, 2023

View reviewed changes

bolkedebruin self-assigned this Jan 9, 2023

bolkedebruin reviewed Jan 9, 2023

View reviewed changes

uranusjr closed this Jan 17, 2023

uranusjr mentioned this pull request Jan 17, 2023

Minor improvements to serde helpers #28978

Merged

ephraimbuddy removed this from the Airflow 2.6.0 milestone Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix serialize() usages in custom JSON encoders #28742

Fix serialize() usages in custom JSON encoders #28742

uranusjr commented Jan 5, 2023

uranusjr Jan 5, 2023

kaxil commented Jan 5, 2023

bolkedebruin commented Jan 9, 2023 •

edited

bolkedebruin Jan 9, 2023

bolkedebruin Jan 9, 2023

bolkedebruin Jan 9, 2023

bolkedebruin Jan 9, 2023

bolkedebruin Jan 9, 2023

bolkedebruin Jan 9, 2023

uranusjr commented Jan 17, 2023

		_iterables = (list, set, tuple)
		_collections = (list, set, tuple) # dict is treated specially.

Fix serialize() usages in custom JSON encoders #28742

Fix serialize() usages in custom JSON encoders #28742

Conversation

uranusjr commented Jan 5, 2023

uranusjr Jan 5, 2023

Choose a reason for hiding this comment

kaxil commented Jan 5, 2023

bolkedebruin commented Jan 9, 2023 • edited

bolkedebruin Jan 9, 2023

Choose a reason for hiding this comment

bolkedebruin Jan 9, 2023

Choose a reason for hiding this comment

bolkedebruin Jan 9, 2023

Choose a reason for hiding this comment

bolkedebruin Jan 9, 2023

Choose a reason for hiding this comment

bolkedebruin Jan 9, 2023

Choose a reason for hiding this comment

bolkedebruin Jan 9, 2023

Choose a reason for hiding this comment

uranusjr commented Jan 17, 2023

bolkedebruin commented Jan 9, 2023 •

edited