# Serialization and Deserialization

**Concepts:**
* Motivation - Need for serialization and deserialization of Python objects
* Introduce JSON / YAML / TOML formats
* Serialize Pydantic models to from JSON / YAML
* Deserialize Pydantic models to from JSON / YAML
* Implementing JSON encoders for custom types
* Fields and extending schema definitions
* Config of serialization, excluding and including fields
* Performance remarks for serialization

In order to make use of Pydantic models we will need to get data in and/or out of instances of our models. This is known as deserialization and serialization, respectively. Pydantic was originally designed with the primary use case being web development where data is frequently serialized and deserialized in order to send and receive data between client and server. But Pydantic models can be useful in many other situations including, but not limited to, configuration files and data storage. The native Python serialization protocol is the pickle format, which is compatible with Pydantic, but pickling only works in a Python only system. If interfacing with other systems, pickling may not be possible. Furthermore, you may not always have control of the data files you need to validate. Common file formats used in the Python ecosystem are JSON (Javascript object notation), YAML (yet another markup language), and TOML (Tom's obvious, minimal language). Before we see examples of each of these, let's first create a function that takes in a serialization function and a deserialization function. The function will serialize a global data dict to string and print the string. It will then deserialize the string back to a Python object and print the object.

In [None]:
from typing import Any, Callable, Optional

data = {
    "data": [0, 1, 1, 2, 3, 5],
    "attributes": {
        "is_fibonacci": True,
        "base_cases": {"f0": 0, "f1": 1},
    },
}


def serialize_then_deserialize(
    serialization_function: Callable[[Any], str],
    deserialization_function: Callable[[str], Any],
    serialization_kwargs: Optional[dict[str, Any]] = None,
    deserialization_kwargs: Optional[dict[str, Any]] = None,
) -> None:
    if serialization_kwargs is None:
        serialization_kwargs = {}
    if deserialization_kwargs is None:
        deserialization_kwargs = {}

    data_serialized = serialization_function(
        data,
        **serialization_kwargs,
    )
    print("Serialized data to string:")
    print(data_serialized)

    data_serialized_deserialized = deserialization_function(
        data_serialized,
        **deserialization_kwargs,
    )
    print("\nDeserialized data from string:")
    print(data_serialized_deserialized)

**JSON**

Python has native JSON support in the standard library module `json`. Deserialization can be accomplished using `json.load` and `json.loads`, with the former taking a file pointed (an object a `.read()` method) and the latter taking a `str`, `bytes`, or `bytearray`. The analogous counterpoints (serialization) can be accomplished with `json.dump` and `json.dumps`, respectively.

In [None]:
import json

serialize_then_deserialize(
    serialization_function=json.dumps,
    deserialization_function=json.loads,
    serialization_kwargs={"indent": 2},
)

**YAML**

Python does not natively support YAML files but third-party libraries exist such as PyYAML. Deserialization is accomplished with `yaml.load` and serialization is accomplished with `yaml.dump`. WARNING: The YAML specification is much more flexible than JSON and allows for execution of arbitrary Python functions. Thus it is recommended to use `yaml.load` only if your data comes from a trusted source. PyYAML also has `yaml.safe_load` and `yaml.safe_dump` that do not recognize arbitray Python objects.

In [None]:
import yaml

serialize_then_deserialize(
    serialization_function=yaml.safe_dump,
    deserialization_function=yaml.safe_load,
    serialization_kwargs={"sort_keys": False},
)

**TOML**

Starting with Python 3.11, Python does have native support for TOML files in the `tomllib` module. Earlier versions of python can use the `toml` third-party library. Like the `json` module, deserialization and serialization is accomplished with the `load`, `loads`, `dump`, and `dumps` functions.

In [None]:
import toml

serialize_then_deserialize(
    serialization_function=toml.dumps,
    deserialization_function=toml.loads,
)

## Pydantic integration

For deserialization Pydantic models have the `parse_raw`, `parse_obj`, and `parse_file` methods for `str`, `dict`, and `pathlib.Path` objects, respectively. Let's see each one in action by using the weather data in `my-data.json`. First we need to inspect the data and create a Pydantic model to represent the schema of the data.

In [None]:
import datetime

from pydantic import BaseModel


class TemperatureSample(BaseModel):
    date: datetime.date
    time: datetime.time
    temperature: float


class TemperatureData(BaseModel):
    data: list[TemperatureSample]

Let's start by reading the data in as a string and deserializing the string...

In [None]:
from pathlib import Path

fpath_temperature_data = Path.cwd() / "my-data.json"

raw_temperature_data = fpath_temperature_data.read_text()
print("Raw, unparsed, unvalidated data in string form:")
display(raw_temperature_data)

temperature_data = TemperatureData.parse_raw(raw_temperature_data)
print("\nDeserialized data as a Pydantic model instance:")
display(temperature_data)

The output is not so human friendly but we successfuly deserialized the raw JSON string into an instance of `TemperatureData`. Note that because the data is now in a `TemperatureData` instance, the data is also parsed and validated! Now what if we already had the data in memory as a Python object. Then we could use the `parse_obj` method...

In [None]:
raw_temperature_data_dict = json.loads(raw_temperature_data)
print("Raw, unparsed, unvalidated data in dictionary form:")
display(raw_temperature_data_dict)

temperature_data = TemperatureData.parse_obj(raw_temperature_data_dict)
print("\nDeserialized data as a Pydantic model instance:")
display(temperature_data)

Again we have deserialized, parsed, and validated the data into a `TemperatureData` instance. Finally, if we only have the path to a file containing the data, we can use the `parse_file` method...

In [None]:
temperature_data = TemperatureData.parse_file(fpath_temperature_data)
print("Deserialized data as a Pydantic model instance:")
display(temperature_data)

Notes:
* Currently pydantic only supports JSON and pickle files in the `parse_file` method. `pydantic-yaml` is an extension to Pydantic that provides this support, but a workaround is to load the data into memory and then use `parse_obj` or `parse_raw`.
* `parse_obj` expects a dictionary, so other Python types cannot be used in this method

The same temperature data exists in `my-data.yaml`, but it is a list of data points. Let's try this out...

In [None]:
fpath_temperature_data_yaml = Path.cwd() / "my-data.yaml"
raw_temperature_data = yaml.safe_load(fpath_temperature_data_yaml.read_text())
temperature_data = TemperatureData.parse_obj(
    {"data": raw_temperature_data},
)

print("Deserialized data as a Pydantic model instance:")
display(temperature_data)

Now let's try serializing this data. Pydantic models have the `dict` and `json` methods to serialize to dictionaries and JSON strings, respectively. But first let's reduce the dataset to the first 3 entries so we dont flood our screen with a long list of data.

In [None]:
shortened_temperature_data = TemperatureData(data=temperature_data.data[:3])

Now let's serialize the shortened dataset to a JSON string...

In [None]:
print(shortened_temperature_data.json(indent=2))

Once we have the serialized string, we could send the string to an API endpoint or write to disk. Now let's serialize to a Python dictionary...

In [None]:
shortened_temperature_data.dict()

Once again, we can then do with the serialized object as we please. For example, if we wanted to write to a TOML file, we could pass the serialized object into `toml.dump`.

## Serialization of custom types

What if we want to serialize a custom data type. Let's return to the `Point` class from Part-2-Basic-Usage...

In [None]:
class Point:
    def __init__(self, x: float, y: float) -> None:
        self.x = x
        self.y = y

    def __repr__(self) -> str:
        return f"{self.__class__.__name__}(x={self.x}, y={self.y})"

    def __eq__(self, other: "Point") -> bool:
        return (self.x == other.x) and (self.y == other.y)

    def distance_to(self, other: "Point") -> float:
        dx = self.x - other.x
        dy = self.y - other.y
        return (dx**2 + dy**2) ** 0.5

If we build a Pydantic model that uses this class as a type hint, we won't be able to serialize using the `json` method...

In [None]:
class LineSegment(BaseModel):
    p1: Point
    p2: Point

    class Config:
        arbitrary_types_allowed = True


line_segment = LineSegment(
    p1=Point(x=1, y=3),
    p2=Point(x=8, y=2),
)

line_segment.json()

We get an error message informing us the `Point` is not JSON serializable. We can fix this using the `json_encoders` attribute of the `Config` class. This attribute is a dictionary that maps field types to functions that serialize those types. So we can modify the `Point` class to include an instance method that will serialize the data in the `Point` instance...

In [None]:
class SerializablePoint(Point):
    def serialize(self) -> dict[str, Any]:
        return {"x": self.x, "y": self.y}

Then we can modify our `LineSegment` model to include this JSON encoder...

In [None]:
class LineSegment(BaseModel):
    p1: SerializablePoint
    p2: SerializablePoint

    class Config:
        arbitrary_types_allowed = True
        json_encoders = {
            SerializablePoint: SerializablePoint.serialize,
        }

Now we can serialize an instance of `LineSegment`...

In [None]:
line_segment = LineSegment(
    p1=SerializablePoint(x=1, y=3),
    p2=SerializablePoint(x=8, y=2),
)

line_segment.json()

We could have chosen to serialize `Point` in any number of reasonable ways. We chose to simply record the coordidates in a dictionary, but the possibilities are endless.

What if we want to deserialize data for `Point` fields. Currently, this will not work...

In [None]:
line_segment_data = {"p1": {"x": 1, "y": 3}, "p2": {"x": 8, "y": 2}}

line_segment = LineSegment.parse_obj(line_segment_data)

We need to add validators to the `SerializablePoint` class definition...

In [None]:
from pydantic.validators import float_validator


class DeserializablePoint(SerializablePoint):
    @classmethod
    def __get_validators__(cls):
        yield cls.deserialize

    @classmethod
    def deserialize(cls, data: dict[str, Any]) -> "DeserializablePoint":
        if ("x" not in data) or ("y" not in data):
            raise ValueError("Missing attributes x and/or y")

        x = float_validator(data["x"])
        y = float_validator(data["y"])

        return cls(x=data["x"], y=data["y"])

Now we have a `Point` class that is both serializable and deserializable...

In [None]:
class LineSegment(BaseModel):
    p1: DeserializablePoint
    p2: DeserializablePoint

    class Config:
        json_encoders = {
            DeserializablePoint: DeserializablePoint.serialize,
        }


line_segment_data = {"p1": {"x": 1, "y": 3}, "p2": {"x": 8, "y": 2}}

line_segment = LineSegment.parse_obj(line_segment_data)
display(line_segment)

line_segment.json()

For simplicity, I implemented our deserializer to assume the incoming data is a dictionary. A more flexible deserializer would require more logic.

## Include and exclude

Both `model.dict()` and `model.json()` have `include` and `exclude` parameters that specify which field to include or exclude when serializing model data. Other parameters exist as well, see [Exporting models](https://docs.pydantic.dev/latest/usage/exporting_models/).

In [None]:
print(line_segment.json(indent=2, exclude={"p1"}))

## Fields and extending schema definitions

???

## `Config` related to serialization and deserialization

Certain attributes in the `Config` class relate to serialization and deserialization. A description of a select few are:
* `use_enum_values` - For `Enum` fields, the enumeration values will be used (as opposed to the Enum itself) when serializing with `model.dict()`.
* `arbitrary_types_allowed` - Setting this to `True` allows the use of arbitrary types for fields (i.e. classes that do not define the `__get_validators__` method. This could be useful is you want to use `PIL.Image.Image` as a field type for example.
* `json_loads` - A custom function for decoding JSON; see [custom JSON (de)serialisation](https://docs.pydantic.dev/latest/usage/exporting_models/#custom-json-deserialisation)
* `json_dumps` - A custom function for encoding JSON; see [custom JSON (de)serialisation](https://docs.pydantic.dev/latest/usage/exporting_models/#custom-json-deserialisation)
* `json_encoders` - A dict used to customise the way types are encoded to JSON; see [JSON Serialisation](https://docs.pydantic.dev/latest/usage/exporting_models/#modeljson)

## Performance remarks for serialization

???