### Dataclasses Explained (Part 1)

The goal of this article/video is to **explain** how data classes work, not just show you how to create data classes.

In this notebook we'll explore dataclasses and their correspondance to code we might write when implementing classes using plain old vanilla Python instead of the dataclass syntax.

So what are dataclasses?

Dataclasses were introduced to Python 3.7, but what are they?

Some new type of data structure?  Some new type of object?

The answer to that is no.

A dataclass is simply a **code generator** that allows us to define custom classes using a different syntax, and allows us to generate what is often referred to as "boilerplate" code - code that is repetitive and basically always works the same way. Essentially a dataclass is a class **decorator** that can either monkey patch an existing class, or, when slots are involved, generates a new class based on the old one, with extra functionality injected.

You've seen code generators before if you've worked with named tuples, either `namedtuple` in the `collections` module, or the more modern `NamedTuple` class in the `typing` module.

Before you start using dataclasses, it is really important that you understand how to create your own classes in Python, and how to implement things like equality, hashing, ordering, etc. Although dataclasses hide all this from you, you should know how these things work in order to truly understand what dataclasses are creating for you, and avoid subtle bugs you may create by using dataclasses without understanding what's happening under the hood.

The PEP for dataclasses ([PEP](https://peps.python.org/pep-0557)) writes this:
> Although they use a very different mechanism, Data Classes can be thought of as “mutable namedtuples with defaults”. Because Data Classes use normal class definition syntax, you are free to use inheritance, metaclasses, docstrings, user-defined methods, class factories, and other Python class features.

and 

> A class decorator is provided which inspects a class definition for variables with type annotations as defined in PEP 526, “Syntax for Variable Annotations”. In this document, such variables are called fields. Using these fields, the decorator adds generated method definitions to the class to support instance initialization, a repr, comparison methods, and optionally other methods as described in the Specification section. Such a class is called a Data Class, but there’s really nothing special about the class: the decorator adds generated methods to the class and returns the same class it was given.


#### Dataclasses and the `attrs` Library

A bit of history.

One of the inspirations for dataclasses is the `attrs` library started by Hynek Schlawack in 2015. 

[https://www.attrs.org/en/stable/index.html](https://www.attrs.org/en/stable/index.html)

The `attrs` library became very popular around 2017, and people started asking for this to be included in the canonical Python standard library.

Discussions started on this, prompted by Python's then BDFL, Guido, between Hynek and Eric Smith who volunteered for the project.

In the end, instead of rolling `attrs` into the standard library, a simplifying subset of `attrs` was added to the standard library, and became known as **dataclasses**. 

You can see the PEP for it [here](https://peps.python.org/pep-0557)

Does this mean that `attrs` is no longer relevant today?

Not at all!

In fact `attrs` is very much under continued and very active development, and additional libraries, leveraging `attrs` are also under continuous development (for example the [cattrs](https://catt.rs/en/stable/) library).

Here is an interesting post by Hynek on `attrs` and dataclasses: 
[https://hynek.me/articles/import-attrs/](https://hynek.me/articles/import-attrs/)


**A big thanks to Hynek Schlawack and the `attrs` repo collaborators for their continued dedication to `attrs`!!**



Now, let's start digging into dataclasses.

I am not going to discuss the differences/similarities of dataclasses vs named tuples vs pydantic vs attrs - I might do another post on that in the future. Simplistically, there is *some* overlap, but also a lot of *differences* between them, and they have different use cases.

I see a lot of click-bait articles/videos out there that spell out how one killed the other - and that's all it is, click bait.Those people either don't truly understand what each of those things are, or are being disingeneous, and, unfortunately, ultimately causing confusion.

#### The Basics

> **Note**: I am using Python 3.11 for these examples. Earlier versions of Python may not have all the functionality presented here, as dataclasses are (slowly) evolving and gaining additional functionality from one Python version to the next. That's one advantage of using `attrs` over `dataclasses` - `attrs` has not only a lot more functionality than `dataclasses`, but evolves faster since it is not bound to the Python release cycle.


As the name **dataclasses** would seem to indicate, dataclasses are **classes** used for **data** structures (similar in some ways to named tuples, except that dataclasses offer a lot more since they are regular Python classes, not specializations of the `tuple` class)

Let's take a look at how dataclasses can be used to generate standard Python classes, and see how much boilerplate code they eliminate.

As an example, let's create a two-dimensional `Circle` class that needs attributes for it's origin (`x` and `y`) and it's `radius`. To avoid complications with `float` comparisons, I'm going to limit these attributes to integers (which is not totally insance given that integers would be just fine to draw such a circle on a screen anyway.)

In [1]:
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self.x = x
        self.y = y
        self.radius = radius

In [2]:
c = Circle()
c

<__main__.Circle at 0x10865c3d0>

Let's add some functionality that we usually add (or should add) to our class.

First, let's have a custom `__repr__`

In [3]:
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self.x = x
        self.y = y
        self.radius = radius

    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"

In [4]:
c1 = Circle(0, 0, 1)
c1

Circle(x=0, y=0, radius=1)

Now let's see how we can do the same thing using a dataclass:

In [5]:
from dataclasses import dataclass

In [6]:
@dataclass
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1    

In [7]:
c2 = CircleD()

In [8]:
c1

Circle(x=0, y=0, radius=1)

In [9]:
c2

CircleD(x=0, y=0, radius=1)

Basically the dataclass gave us the `__repr__` for "free". We don't have to type that code ourselves, and the less code we type the less bugs we are likely to introduce. 

Not only that, but dataclasses will generate that code using best practices - how many of use really use  `self.__class__.__qualname__`? Many people (myself included I'll confess) just use the hardcoded class name, maybe stretching it to `self.__class__.__name__`, when using `__qualname__` is actually better. (I'll let you do some web searches on your own to figure out why if you don't know already).

Both classes work the same way as far as attribute access goes:

In [10]:
c1.x, c2.radius

(0, 1)

In [11]:
c1.x = 100
c2.radius = 100
c1, c2

(Circle(x=100, y=0, radius=1), CircleD(x=0, y=0, radius=100))

If you think back to the intro of this video, I pulled an excerpt from the PEP:

> the decorator adds generated methods to the class and returns the same class it was given

Let's test that out!

In [12]:
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1   

In [13]:
hex(id(CircleD))

'0x109805770'

In [14]:
c = CircleD()
repr(c)

'<__main__.CircleD object at 0x108666190>'

In [15]:
CircleD = dataclass(CircleD)

In [16]:
hex(id(CircleD))

'0x109805770'

In [17]:
repr(c)

'CircleD(x=0, y=0, radius=1)'

As you can see, `@dataclass` did not create a new object, it basically modified, at run time, a regular Python class by adding attributes and methods to the class.

There has been an update to dataclasses that allows us to use slots instead of a class instance dictionary for maintaining state - in these cases, a new class object **is** generated - it has to, since you cannot, once a class has been created, change whether your are using slots or not.

##### Equality Comparisons

Something else we get for free is equality comparisons:

In [18]:
c3 = CircleD(1, 1, 5)
c4 = CircleD(1, 1, 5)
c3 == c4

True

Our custom class does not have that functionality. By default, custom classes will use identity (object's id) to equality compare two instances:

In [19]:
c1 = Circle(1, 1, 5)
c2 = Circle(1, 1, 5)
c1 == c2

False

That's usually not what we want (it could be, but often not), this means that we need to override the `__eq__` method:

In [20]:
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self.x = x
        self.y = y
        self.radius = radius

    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
        

And let's try it out now:

In [21]:
c1 = Circle(0, 0, 1)
c2 = Circle(0, 0, 1)

c1 is c2, c1 == c2

(False, True)

##### Hashability

Great! But you should know that when you implement a custom `__eq__` method, we often also implement hashability as well (I discuss this in my deep dive series, all notebooks are freely available on github [here](https://github.com/fbaptiste/python-deepdive), so I am not going to discuss the reason behind this here).

With our current implementation, the `Circle` class is not even hashable. (And neither is the dataclass)

Ok, so let's implement the `__hash__` function, such that the hash of two `Circle` objects that are equal will also have an equal hash:

In [22]:
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self.x = x
        self.y = y
        self.radius = radius

    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))

In [23]:
c1 = Circle(0, 0, 1)
c2 = Circle(0, 0, 1)

In [24]:
c1 == c2, hash(c1) == hash(c2)

(True, True)

Much better - now we can even use instances of our `Circle` class as set elements or dictionary keys.

In [25]:
s = {Circle(), Circle()}
s

{Circle(x=0, y=0, radius=1)}

As you can see, because the `Circle` instances are hashable, and we have equality implemented, the set retained, as we would expect, only one unique element. Same thing with dictionaries:

In [26]:
d = {Circle(): "circle"}
d

{Circle(x=0, y=0, radius=1): 'circle'}

In [27]:
d[Circle()] = 'custom circle'
d

{Circle(x=0, y=0, radius=1): 'custom circle'}

But there is an issue!

Normally, hashable objects should be immutable, otherwise we can run into issues - let's see that.

In [28]:
c1 = Circle(0, 0, 1)
c2 = Circle(1, 1, 1)

In [29]:
d = {
    c1: "circle 1",
    c2: "circle 2"
}

d

{Circle(x=0, y=0, radius=1): 'circle 1',
 Circle(x=1, y=1, radius=1): 'circle 2'}

Now let's mutate that second circle:

In [30]:
c2.x = 0
c2.y = 0
c2

Circle(x=0, y=0, radius=1)

Now `c1` and `c2` are *equal*, and in fact also have equal hashes now:

In [31]:
c1 == c2, hash(c1) == hash(c2)

(True, True)

And let's look at our dictionary:

In [32]:
d

{Circle(x=0, y=0, radius=1): 'circle 1',
 Circle(x=0, y=0, radius=1): 'circle 1'}

Notice something weird? Looks like two duplicate entries in the dictionary - and sometimes we get other odd behavior depending on how we mutate the key objects. So, in general we really need immutability for dictionary keys.

##### Immutability

To make our custom class implementation better we need to make the attributes used in the hash, `x`, `y`, and `radius` immutable.

Let's add even more boilerplate code to our class:

In [33]:
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))

In [34]:
c1 = Circle()
c2 = Circle(1, 1, 1)

In [35]:
d = {
    c1: "cirle 1",
    c2: "circle 2",
}

d

{Circle(x=0, y=0, radius=1): 'cirle 1', Circle(x=1, y=1, radius=1): 'circle 2'}

And now, if we try to mutate `c2`, we'll get an exception:

In [36]:
try:
    c2.x = 0
except AttributeError as ex:
    print(f"Attribute Error: {ex}")

Attribute Error: property 'x' of 'Circle' object has no setter


Again let me ask you this. How many times have you written a class that implements equality and hashability and forgotten to make read-only properties out of the attributes that are used in your hash function?

Even if we are aware of this, and take the trouble to try and make it difficult for users of our class to inadvertently create problems by mutating the object (like using read-only propeties as we did here), it's a lot of code - and more potential for typos and bugs. More unit testing too!

Again, dataclasses can help us here by generating all this code for us (note that dataclasses do not use properties to make read-only attributes, but rather they override the `__setattr__` and `__delattr__` to achieve a similar result. Although we could replicate that approach in our code quite easily, and would avoid writing properties, it involves hardcoding the attributes we want to be "frozen" into the `__setattr__` and `__delattr__` methods. This means every time we add or remove a frozen attribute, we need to remember to modify that code as well. While that works just fine for a code generator that re-builds the code every time the app is re-started (and the dataclass decorator is re-evaluated), it's not great for human developers.

In [37]:
@dataclass(frozen=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1

In [38]:
c3 = CircleD()
c4 = CircleD(1, 1, 1)
c5 = CircleD()

c3, c4, c5

(CircleD(x=0, y=0, radius=1),
 CircleD(x=1, y=1, radius=1),
 CircleD(x=0, y=0, radius=1))

In [39]:
c3 == c5, c4 == c5

(True, False)

Again, equality still works just fine. But now our dataclass `CircleD` is both immutable and hashable (at least as far as the attributes `x`, `y`, and `radius` go).

In [40]:
hash(c3), hash(c4), hash(c5)

(-1882636517035687140, 5750192569890809213, -1882636517035687140)

In [41]:
from dataclasses import FrozenInstanceError

try:
    c4.x = 0
except FrozenInstanceError as ex:
    print(f"FrozenInstanceError: {ex}")

FrozenInstanceError: cannot assign to field 'x'


This means that we can also safely use it in sets and dictionary keys as well.

In [42]:
s = {c3, c4, c5}
s

{CircleD(x=0, y=0, radius=1), CircleD(x=1, y=1, radius=1)}

##### Ordering

One other thing we do not have in our custom `Circle` class is ordering:

In [43]:
c1 = Circle()
c2 = Circle(1, 1, 2)

try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '<' not supported between instances of 'Circle' and 'Circle'


In order to implement ordering we need to override special functions such as `__lt__`, `__le__`, `__gt__`, `__ge__`

Let's do this for our custom class. Here, we will consider ordering based on whether the tuple (x, y, radius) of one circle is smaller or larger than the ame tuple for the other circle. Does not make much sense, we would probably want something more custom - maybe based on the radius only, so we'll come back to that later.

In [44]:
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)

Ok, so that's a start:

In [45]:
c1 = Circle(0, 0, 1)
c2 = Circle(1, 1, 1)

c1 < c2

True

Of course, through reflection we also get `>` for "free":

In [46]:
c2 > c1

True

But we have a few issues:
- we could compare a Circle instance to another data structure inadvertently, and actually get a result
- we still need to implement `>=`, `<=`, etc

In [47]:
try:
    c1 <= c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '<=' not supported between instances of 'Circle' and 'Circle'


In [48]:
from typing import NamedTuple

class CircleNT(NamedTuple):
    x: int = 0
    y: int = 0
    radius: int = 1

In [49]:
c1 = Circle()
c2 = CircleNT(1, 1, 1)

c1, c2

(Circle(x=0, y=0, radius=1), CircleNT(x=1, y=1, radius=1))

In [50]:
c1 < c2

True

This is probably not what we want - we should only allow comparisons between instances of the same class. In fact, that's how our equality function works, both in our custom class and in our dataclass.

In [51]:
c1 = Circle()
c2 = CircleNT()

c1, c2

(Circle(x=0, y=0, radius=1), CircleNT(x=0, y=0, radius=1))

In [52]:
c2 == c1

False

In [53]:
c1 = CircleD()
c2 = CircleNT()

c1, c2

(CircleD(x=0, y=0, radius=1), CircleNT(x=0, y=0, radius=1))

In [54]:
c1 == c2

False

So, again, let me ask you this question. How often do you remember to also check the type in the order methods?

So, let's fix our code to account for that.

In [55]:
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented

In [56]:
c1 = Circle()
c2 = CircleNT()

c1, c2

(Circle(x=0, y=0, radius=1), CircleNT(x=0, y=0, radius=1))

In [57]:
c1 == c2

False

Now, we need to implement the other ordering functions.

We could try using the `total_ordering` decorator available in the `functools` module:

In [58]:
from functools import total_ordering

In [59]:
@total_ordering
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented

Let's try it out:

In [60]:
c1 = Circle()
c2 = Circle(1, 1, 1)

In [61]:
c1 < c2

True

In [62]:
c1 <= c2

True

In [63]:
c2 >= c1

True

In [64]:
c2 > c1

True

What about the case where we aren't comparing the same objects (but where they have the same attribute names):

In [65]:
c1 = Circle()
c2 = CircleNT(1, 1, 1)

In [66]:
try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '<' not supported between instances of 'Circle' and 'CircleNT'


In [67]:
try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '<' not supported between instances of 'Circle' and 'CircleNT'


In [68]:
try:
    c1 <= c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '<=' not supported between instances of 'Circle' and 'CircleNT'


In [69]:
try:
    c1 >= c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '>=' not supported between instances of 'Circle' and 'CircleNT'


Ok, so `total_ordering` worked for us, and saved us writing quite a lot of boilerplate code.

What about dataclasses? By default, dataclasses do not implement any ordering.

In [70]:
c1 = CircleD()
c2 = CircleD(1, 1, 1)

In [71]:
try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '<' not supported between instances of 'CircleD' and 'CircleD'


But we can easily enable this in our dataclass this way:

In [72]:
@dataclass(frozen=True, order=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1

In [73]:
c1 = CircleD()
c2 = CircleD(1, 1, 1)

In [74]:
c1 < c2, c1 <= c2, c2 > c1, c2 >= c1

(True, True, True, True)

And it will not support comparing to a different type:

In [75]:
c1 = CircleD()
c2 = CircleNT(1, 1, 1)

In [76]:
try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '<' not supported between instances of 'CircleD' and 'CircleNT'


The only thing to note is how dataclasses define the ordering - it basically uses a tuple made of up the attributes defined in the dataclass, in the order in which they were declared (equality is iplemented the same way too).

This means that you are limited in terms of how to define a custom ordering by default. We'll come back to that in a bit.

##### Serializing to Dictionaries and Tuples

Something that can be convenient sometimes, is the ability to extract the attribute values of an instance of our class into a dictionary (where the keys are the attribute names), or even into a tuple (ordered in some specific way).

Dataclasses have that built-in as well.

In [77]:
from dataclasses import asdict, astuple

In [78]:
c1 = CircleD()

In [79]:
asdict(c1)

{'x': 0, 'y': 0, 'radius': 1}

In [80]:
astuple(c1)

(0, 0, 1)

If we wanted something similar in our custom class, we would have to write that code ourselves.

Let's do it.

In [81]:
@total_ordering
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented
    
    def asdict(self):
        return {
            'x': self.x,
            'y': self.y,
            'radius': self.radius
        }
    
    def astuple(self):
        return self.x, self.y, self.radius

Now this works, but is rather simplistic, and means that if we ever add/remove attributes rom the class we also need to update the `asdict` and `astuple` methods - not the end of the world, but can easily lead to bugs.

In [82]:
c1 = Circle()
c1.asdict()

{'x': 0, 'y': 0, 'radius': 1}

In [83]:
c1.astuple()

(0, 0, 1)

We could certainly start writing more complicated code to be more generic, but we'll have to deal with introspection, decide what attributes to include, etc - not simple!

#### Fields Introspection

Speaking of introspection, dataclasses come through on that front too!

In [84]:
from dataclasses import fields

In [85]:
c1 = CircleD()

In [86]:
for field in fields(c1):
    print(field, end='\n---------------------\n')

Field(name='x',type=<class 'int'>,default=0,default_factory=<dataclasses._MISSING_TYPE object at 0x105941050>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)
---------------------
Field(name='y',type=<class 'int'>,default=0,default_factory=<dataclasses._MISSING_TYPE object at 0x105941050>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)
---------------------
Field(name='radius',type=<class 'int'>,default=1,default_factory=<dataclasses._MISSING_TYPE object at 0x105941050>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)
---------------------


##### Comparing Dataclass to Custom Class Code So Far

Here's our data class:

In [87]:
@dataclass(frozen=True, order=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1

And our custom class:

In [88]:
@total_ordering
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented
    
    def asdict(self):
        return {
            'x': self.x,
            'y': self.y,
            'radius': self.radius
        }
    
    def astuple(self):
        return self.x, self.y, self.radius

So, for very little effort we get quite a lot of functionality out of dataclasses. And what we have looked at so far will probably cover 80% of anything you'd ever need out of dataclasses (yes, I made that number up!).

#### Adding Methods and Properties to Dataclasses

Remember that dataclases is just a code generator that generates a standard Python class. This means that we can choose to add additional properties, methods, and even override special dunder methods however we want.

In [89]:
from math import pi

@dataclass(frozen=True, order=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius

In [90]:
c = CircleD()
c.area, c.circumference()

(3.141592653589793, 6.283185307179586)

##### Custom Ordering

We can even override the special dunder methods. Let's go back to our ordering of the Circles.

We saw how the default ordering that dataclasses defined for our Circles was not ideal. Let's say I really want to define ordering between circles in one of two ways:
- based on the radius **only**
- based on the distance from the origin

There is no way (currently, that I know of) to completely customize a sort order key function in dataclasses.

You can specify whether a field should be included or not in the comparison tuple though - so doing a comparison based only on the radius would be possible using the native functionality of dataclasses. The second sort option would not, and you therefore need to implement your own `__lt__`, `__le__`, etc. 

You can, however, completely customize the sort key in the more powerful `attrs` library (using the [cmp_using](https://www.attrs.org/en/stable/api.html#attrs.cmp_using)) attribute.

Let's implement a custom sort order based on the distance from the origin (we'll circle back to the other sort order, which is easy to achieve using native dataclasses functionality).

In [91]:
from math import pi, dist

@dataclass(frozen=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return dist((0, 0), (self.x, self.y)) < dist((0, 0), (other.x, other.y))
        return NotImplemented

We now should have `<` implemented:

In [92]:
c1 = CircleD(2, 2, 10)
c2 = CircleD(3, 3, 100)

In [93]:
c1 < c2

True

Of course we need to implement the other methods too:

In [94]:
try:
    c1 <= c2
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: '<=' not supported between instances of 'CircleD' and 'CircleD'


We can actually do this using the `total_ordering` decorator!

In [95]:
@total_ordering
@dataclass(frozen=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return dist((0, 0), (self.x, self.y)) < dist((0, 0), (other.x, other.y))
        return NotImplemented

In [96]:
c1 = CircleD(2, 2, 10)
c2 = CircleD(3, 3, 100)

c1 <= c2

True

So, the question I have here, is in which order should we decorate with `total_ordering`? The way I have it here? Or this way?

In [97]:
@dataclass(frozen=True)
@total_ordering
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return dist((0, 0), (self.x, self.y)) < dist((0, 0), (other.x, other.y))
        return NotImplemented

Both approaches seem to work, but because dataclasses is a code generator, and is actually opaque, I have no real idea if one way may or may not be better.

My reasoning is that the second approach is preferrable. Suppose I did not use that `total_ordering` decorator - I would want to define all the `__lt__`, `__le__` methods in my class that then gets decorated with `@dataclass`.

So, my reasoning is that I should first apply the `@total_ordering` decorator, and **then** apply the `@dataclass` decorator. If you know different, please let us know in the comments!

**NOTE**:

With all these examples we looked at for customizing the sort order, there is actually a problem - and that's because our equality definition and ordering comparisons (<, etc) are not really compatible!

Take a look at this example:

In [98]:
c1 = CircleD(1, 1, 10)
c2 = CircleD(1, 1, 20)

From a sorting perspective we would consider these two circles to be equal - but from an actual equality standapoint they are obviously not equal since the radius is different:

In [99]:
c1 == c2

False

In [100]:
c1 <= c2

False

Hmm... That's obviously wrong!

To fix this we really should not use @total_ordering, since it uses `__eq__` for the equality part of the comparisons - which means we'll need to define all the ordering special functions ourselves. Even more code!! For simplicty, and to keep this video down to about an hour I did not do this - so here's the actual code we would need to fix this issue (and would be needed for both the dataclass and the regular custom class of course).

In [101]:
@dataclass(frozen=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
    
    @staticmethod
    def _dist_from_origin(c):
        return dist((0, 0), (c.x, c.y))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return self._dist_from_origin(self) < self._dist_from_origin(other)
        return NotImplemented
    
    def __le__(self, other):
        if self.__class__ == other.__class__:
            return self._dist_from_origin(self) <= self._dist_from_origin(other)
        return NotImplemented
    
    def __gt__(self, other):
        if self.__class__ == other.__class__:
            return self._dist_from_origin(self) > self._dist_from_origin(other)
        return NotImplemented
    
    def __ge__(self, other):
        if self.__class__ == other.__class__:
            return self._dist_from_origin(self) >= self._dist_from_origin(other)
        return NotImplemented


Now we should get more consistent results:

In [102]:
c1 = CircleD(1, 1, 10)
c2 = CircleD(1, 1, 20)

The circles are still not equal:

In [103]:
c1 == c2

False

But the ordering comparisons work more as expected:

In [104]:
c1 < c2

False

In [105]:
c1 <= c2

True

In [106]:
c1 > c2

False

In [107]:
c1 >= c2

True

#### Keyword-Only Initializer Arguments

When we write a custom class, we can customize our `__init__` method to require certain arguments to be keyword-only.

We might want to do something like this for our circle class (not sure why we would want to do this in this example, but this is just to illustrate things:

In [108]:
@total_ordering
class Circle:
    def __init__(self, x: int = 0, y: int = 0, *, radius: int = 1,):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented
    
    def asdict(self):
        return {
            'x': self.x,
            'y': self.y,
            'radius': self.radius
        }
    
    def astuple(self):
        return self.x, self.y, self.radius

So now, the only way to specify a custom radius is to pass it as a named argument:

In [109]:
c = Circle(radius=2)
c

Circle(x=0, y=0, radius=2)

And we can no longer do this:

In [110]:
try:
    Circle(0, 0, 2)
except TypeError as ex:
    print(f"TypeError:{ex}")

TypeError:Circle.__init__() takes from 1 to 3 positional arguments but 4 were given


We can achieve the same thing in dataclasses by using something to indicate, in our attribute declarations a boundary between positional and keyword-only arguments (just like the `*` did in our `__init__` methods definition).

In [111]:
from dataclasses import KW_ONLY

In [112]:
@dataclass(frozen=True, order=True)
class CircleD:
    x: int = 0
    y: int = 0
    _: KW_ONLY
    radius: int = 1

And now we get the same functionality:

In [113]:
c = CircleD(0, 0, radius=2)

In [114]:
try:
    Circle(0, 0, 2)
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: Circle.__init__() takes from 1 to 3 positional arguments but 4 were given


If we wanted to make all the arguments in our `__init__` keyword-only arguments, it's even simpler:

In [115]:
@dataclass(frozen=True, order=True, kw_only=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1

In [116]:
c = CircleD(x=0, y=0, radius=1)

In [117]:
try:
    CircleD(0, y=0, radius=1)
except TypeError as ex:
    print(f"TypeError: {ex}")

TypeError: CircleD.__init__() takes 1 positional argument but 2 positional arguments (and 2 keyword-only arguments) were given


Why does this error message state that `__init__` takes 1 positional argument and 2 were provided? 

It should take none, and we only gave it one.

Don't forget that any method in a class always starts with one positional argument because the function is a bound method - that `self` argument. So it expects one (the bound object, which Python supplies for us), and since we also passed in `0` as a positional argument, it ends up with two positional arguments, when only one is allowed.

#### Resource Utilization / Performance - Dataclasses vs NamedTuples

I stated right in the beginning that I would not compare and contrast dataclasses, named tuples, attrs and Pydantic objects. 

However, I have come across people that categorically reject named tuples and will only use dataclasses. 

My inital, knee-jerk reaction was along the lines of this old saying: _when you have a shiny new hammer, everything looks like a nail_. (I often use that saying when talking about metaclasses also!)

My initial reaction was that named tuples are more "lightweight" and provide better performance. 

But this was just a (not totally unfounded) guess. 

So, I wanted to look into that a bit more and see for myself.

Let's create a simple data class:

In [118]:
@dataclass
class PointD1:
    x: int = 0
    y: int = 0

And let's create a named tuple:

In [119]:
from collections import namedtuple

PointNT1 = namedtuple("PointNT1", "x y")

In [120]:
from typing import NamedTuple

class PointNT2(NamedTuple):
    x: int = 0
    y: int = 0

Remember, I am looking at the use case of a callable returning multiple results.

Usually this done by simply returning a plain tuple - but the named tuple gives us the advantage of being able to reference values by name, as well as by index.

Of course this can be done using a dataclass - we could even make the dataclass immutable if we wanted to, and we could access values by index by using the `astuple` function - but this use case really only cares about accessing the fields by name (otherwise why not just use a plain tuple?).

Before someone accuses me of cheating, there is a way to make storage or dataclasses a bit more efficient, by using slots. I won't get into what slots are here (I do in my deep dive course), but let's use this as well.

In [121]:
@dataclass(slots=True)
class PointD2:
    x: int = 0
    y: int = 0

And lastly, since immutability may be something you want from a function's return values, let's do that variant too with data classes:

In [122]:
@dataclass(frozen=True)
class PointD3:
    x: int = 0
    y: int = 0

In [123]:
@dataclass(frozen=True, slots=True)
class PointD4:
    x: int = 0
    y: int = 0

Ok, so to recap what we have:

 - `PointNT1`: named tuple created using `collections.namedtuple`
 - `PointNT2`: named tuple created using the more modern (with type hints) `typing.NamedTuple` class
 - `PointD1`: dataclass, mutable, no slots
 - `PointD2`: dataclass, mutable, slots
 - `PointD3`: dataclass, frozen, no slots
 - `PointD4`: dataclas, frozen, slots

Let's create an instance of each one:

In [124]:
pnt1 = PointNT1(1, 2)
pnt2 = PointNT2(1, 2)
pd1 = PointD1(1, 2)
pd2 = PointD2(1, 2)
pd3 = PointD3(1, 2)
pd4 = PointD4(1, 2)

Let's look at the size of each one of those objects:

Getting the "size" of an object can be tricky, since we need to not only look at the object, but the data it contains too.

To make life simpler, I am going to use the [objsize](https://github.com/liran-funaro/objsize) library (you'll need to pip install it if you're following along):

In [125]:
!pip install objsize



And we can now use it this way:

In [126]:
from objsize import get_deep_size

In [127]:
get_deep_size((1, 2, 3)), get_deep_size([1, 2, 3])

(148, 172)

Let's look at the memory for each variant:

In [128]:
print("NT1", get_deep_size(pnt1))
print("NT1", get_deep_size(pnt2))
print("D1", get_deep_size(pd1))
print("D2", get_deep_size(pd2))
print("D3", get_deep_size(pd3))
print("D4", get_deep_size(pd4))

NT1 112
NT1 112
D1 112
D2 104
D3 112
D4 104


As we can observe, and as we might have expected, the slotted dataclasses are a bit more efficient when it comes to storage (in our scenario we are saving 8 byes per instance - not exactly earth shattering, one way or the other).

But what really surprised me was that the named tuples were not far more efficient with memory overhead than more full-featured classes.

Next, what about timings for attribute access (by name)? How do the two compare?

In [138]:
from timeit import timeit

In [139]:
read_attrib_pnt1 = timeit("pnt1.x", globals=globals(), number=50_000_000)
read_attrib_pnt2 = timeit("pnt2.x", globals=globals(), number=50_000_000)
read_attrib_pd1 = timeit("pd1.x", globals=globals(), number=50_000_000)
read_attrib_pd2 = timeit("pd2.x", globals=globals(), number=50_000_000)
read_attrib_pd3 = timeit("pd3.x", globals=globals(), number=50_000_000)
read_attrib_pd4 = timeit("pd4.x", globals=globals(), number=50_000_000)

In [140]:
print(f"pnt1: {read_attrib_pnt1:.5f}")
print(f"pnt2: {read_attrib_pnt2:.5f}")
print(f"pd1: {read_attrib_pd1:.5f}")
print(f"pd2: {read_attrib_pd2:.5f}")
print(f"pd3: {read_attrib_pd3:.5f}")
print(f"pd4: {read_attrib_pd4:.5f}")

pnt1: 0.87488
pnt2: 0.84304
pd1: 0.43091
pd2: 0.41454
pd3: 0.42733
pd4: 0.41714


So, the interesting result here is that dataclasses do not seem to incur much, if any, memory overhead, and attribute access appears faster for dataclasses than for named tuples.

The other thing that's kind of important is the amount of time it takes to create a named tuple instance vs a dataclass instance.

In [141]:
create_pnt1 = timeit("PointNT1(1, 2)", globals=globals(), number=1_000_000)
create_pnt2 = timeit("PointNT2(1, 2)", globals=globals(), number=1_000_000)
create_pd1 = timeit("PointD1(1, 2)", globals=globals(), number=1_000_000)
create_pd2 = timeit("PointD2(1, 2)", globals=globals(), number=1_000_000)
create_pd3 = timeit("PointD3(1, 2)", globals=globals(), number=1_000_000)
create_pd4 = timeit("PointD4(1, 2)", globals=globals(), number=1_000_000)

In [142]:
print(f"pnt1: {create_pnt1:.5f}")
print(f"pnt2: {create_pnt2:.5f}")
print(f"pd1: {create_pd1:.5f}")
print(f"pd2: {create_pd2:.5f}")
print(f"pd3: {create_pd3:.5f}")
print(f"pd4: {create_pd4:.5f}")

pnt1: 0.15040
pnt2: 0.12031
pd1: 0.08607
pd2: 0.07842
pd3: 0.19190
pd4: 0.18190


So, this is interesting too - creating instances of a mutable data class is also faster then creating a named tuple instance.

Let's compare all the different timings in one table:

In [143]:
!pip install tabulate



In [144]:
from tabulate import tabulate, tabulate_formats

In [145]:
data = [
    ['Object', 'Size', 'Create', 'Read Attrib'],
    ['collections.namedtuple', get_deep_size(pnt1), create_pnt1, read_attrib_pnt1],
    ['typing.NamedTuple', get_deep_size(pnt2), create_pnt2, read_attrib_pnt2],
    ['dataclass (mutable)', get_deep_size(pd1), create_pd1, read_attrib_pd1],
    ['dataclass (mutable, slots)', get_deep_size(pd2), create_pd2, read_attrib_pd2],
    ['dataclass (frozen)', get_deep_size(pd3), create_pd3, read_attrib_pd3],
    ['dataclass (frozen, slots)', get_deep_size(pd4), create_pd4, read_attrib_pd4],
]

print(tabulate(data, headers="firstrow", tablefmt="fancy_outline"))

╒════════════════════════════╤════════╤═══════════╤═══════════════╕
│ Object                     │   Size │    Create │   Read Attrib │
╞════════════════════════════╪════════╪═══════════╪═══════════════╡
│ collections.namedtuple     │    112 │ 0.150396  │      0.874879 │
│ typing.NamedTuple          │    112 │ 0.120309  │      0.843042 │
│ dataclass (mutable)        │    112 │ 0.0860746 │      0.430909 │
│ dataclass (mutable, slots) │    104 │ 0.0784211 │      0.414541 │
│ dataclass (frozen)         │    112 │ 0.191903  │      0.427334 │
│ dataclass (frozen, slots)  │    104 │ 0.181898  │      0.417137 │
╘════════════════════════════╧════════╧═══════════╧═══════════════╛


Overall, seems like the better option if I were serializing something like rows from a database into some structure, or returning structured data from a callable, would be to use a mutable dataclass with slots.

Not a conclusion I was expecting when I first started looking at this to be honest.

So, am I going to switch to dataclasses instead fo named tuples for function return values? Probably not, I'm set in my ways, and I do like the ability to create a named tuple using a single line of code:

In [146]:
Point = namedtuple('Point', 'x y z')

But I have to admit this has made me rethink named tuples vs dataclasses!

Did I get something wrong with these comparisons? Let me know in the comments.

#### Conclusion and Future Videos

We saw how to create dataclasses, and customize them quite a bit.

But there are a whole lot more finer grained customizations we can add to dataclasses by adding declarations to the fields defined in the dataclass body, a few more options in the `@dataclass` decorator, as well as an interesting new special method available for dataclasses, called `__post_init__`. We also have the option to add metadata to each field in the dataclass, as well as a few other odds and ends.

For all the functionality we saw with dataclasses, keep in mind that this is basically a subset of `attrs`. 

But this is more than enough for a single video, so I'm going to leave things here and come back with a second installment on more advanced features of dataclasses in the near future. Stay tuned!