Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix serialize() usages in custom JSON encoders #28742

Closed
wants to merge 2 commits into from

Conversation

uranusjr
Copy link
Member

@uranusjr uranusjr commented Jan 5, 2023

Fix #28741.

Our custom JSON encoders call serialize() in their default() hooks. However, serialize() does not guarantee to return a JSON-serializable type, and when it does not (e.g. a set object), the JSON encoder would infinitely recurse into serialize() trying to coerce the object into JSON-serializable.

This patch adds an additional ensure_json_compatible parameter to serialize(), so JSON encoders can use it to require serialize() to always return JSON-compatible values.

Some minor refactoring is also applied to serialize() to ensure the flag is correctly passed down through recursive visits, and also hide the depth tracking parameter (which is internal to the function) from outside access.

The _reverse_cache global variable is also deleted since it is not used anywhere, even before this patch.

_iterables = (list, set, tuple)
_collections = (list, set, tuple) # dict is treated specially.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to better reflect these classes. Iterable has special meaning in Python (collection also does, but its meaning is more in-line with what we mean here).

@uranusjr uranusjr force-pushed the serialize-for-json branch 2 times, most recently from 2d1b83e to 26d8e86 Compare January 5, 2023 12:11
@kaxil kaxil added the full tests needed We need to run full set of tests for this PR to merge label Jan 5, 2023
Our custom JSON encoders call serialize() in their default() hooks.
However, serialize() does not guarantee to return a JSON-serializable
type, and when it does not (e.g. a set object), the JSON encoder would
infinitely recurse into serialize() trying to coerce the object into
JSON-serializable.

This patch adds an additional 'ensure_json_compatible' parameter to
serialize(), so JSON encoders can use it to require serialize() to
always return JSON-compatible values.

Some minor refactoring is also applied to serialize() to ensure the flag
is correctly passed down through recursive visits, and also hide the
'depth' tracking parameter (which is internal to the function) from
outside access.

The _reverse_cache global variable is also deleted since it is not used
anywhere, even before this patch.
@kaxil
Copy link
Member

kaxil commented Jan 5, 2023

Running it with full test suite

@bolkedebruin
Copy link
Contributor

bolkedebruin commented Jan 9, 2023

I do not think it should be solved here, but rather with a custom encoder for json that makes sets etc into a list if JSON is required. Such an encoder is already available.

Explanation: this code was intentionally written to remove all JSONisms and to have the possibility to encose into another format and to have cleaner code - hence the separation between serializer and encode (as it is within the standard library!) JSON does not guarantee order or uniqueness in its lists hence the issue. Three ways to solve this. 1) implement a JSON encoder that takes a collection and always converts it into a list with then (minor) downside the deserialization might fail. 2) adjust the serializer to serialize in such a way that the deserializer deserializes into the right type - e.g. How we do it now with custom objects 3) accept the loss of type information and convert into list in the serializer.

2 is the most robust, 1 is better for just json, 3 is a hack with potential deserialization issues (it most likely requires inspection slowing down the deserializer) but less of a hack than the current implementation.

Finally, importantly do not put any serialization code in a class. Classes are an order of a magnitude slower. This is due to the amount of stack walks the interpreter needs to do.

P.S. Please add a test that covers this both ways (deserialization / serialization) which takes care of uniqueness and ordering

P.P.S. I think the best way is to use option 2 and serialize tuples, lists, sets into something like

mytuple = {"classname": "builtins.tuple", data([a, b, c, d]), "version": "1"}}

thus wrapping it, so we can properly deserialize it.

return [deserialize(d) for d in o]

if not isinstance(o, dict):
raise TypeError()
raise TypeError("not deserializable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably better to explain what type we are getting if we are improving the error message,.

return True

return False
return any(p.match(classname) is not None for p in _patterns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

elif isinstance(value, dict):
for k, v in value.items():
s += f"{k}={str(serialize(v, False))},"
s += f"{k}={str(serialize(v))},"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, i think? :-)

@@ -56,12 +56,12 @@ def default(self, o: Any) -> Any:
return o.strftime("%Y-%m-%d")

if isinstance(o, Decimal):
data = serialize(o)
data = serialize(o, ensure_json_compatible=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, this should not be done this way, but rather be solved here.

@@ -72,89 +70,104 @@ def decode(d: dict[str, str | int | T]) -> tuple:
return d[CLASSNAME], d[VERSION], d.get(DATA, None)


def serialize(o: object, depth: int = 0) -> U | None:
@dataclasses.dataclass()
class _Serializer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not put this in a class. Classes are overhead on the stack and significantly slow down serialization/deserialization. The code was implemented on purpose o be as high up in the chain as possible.

@bolkedebruin bolkedebruin self-assigned this Jan 9, 2023
@@ -72,89 +70,104 @@ def decode(d: dict[str, str | int | T]) -> tuple:
return d[CLASSNAME], d[VERSION], d.get(DATA, None)


def serialize(o: object, depth: int = 0) -> U | None:
@dataclasses.dataclass()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds calling overhead, which we don't want in this code.

@uranusjr
Copy link
Member Author

Rethinking this, I think this can simply be achieved by a better implemented default. I’ll split out that main change and other minor improvements here into two PRs so things are easier to manage.

@uranusjr uranusjr closed this Jan 17, 2023
@ephraimbuddy ephraimbuddy removed this from the Airflow 2.6.0 milestone Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:serialization full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RecursionError with new refactored serialization
4 participants