# Dataclasses

Dataclasses are primarily aimed at creating data oriented classes. Think of things like a class to represent a point or a vector or any kind of simple data structure. This is very different with behavior oriented classes like a payment service that exposes a number of methods that you call in order to process payments.

Python dataclasses are a relatively new feature introduced in Python 3.7 that provide a way to create classes that are primarily used to store data, with less boilerplate code than regular classes. Dataclasses can be thought of as a way to define classes that are similar to named tuples, but with added functionality and the ability to define methods.

How does a dataclass help in representing data oriented classes? Dataclasses provide a simple way to define classes that represent data, by automatically generating a `constructor` (`__init__`), `representation` (`repr`), `equality` (`eq`), and other special methods that are typically used in data-oriented classes. This makes it easier to write classes that are focused on storing and working with data, rather than implementing complex behaviors.

To create a dataclass, you simply use the `@dataclass` decorator and define the fields that the class should have.

#### *Just a side note on decorators:*
    
A decorator is a special type of function that can be used to modify the behavior of another function or class. A decorator is defined using the `"@"` symbol followed by the name of the decorator function.

Decorators allow you to add functionality to existing code without modifying the original code. This is particularly useful when you want to add functionality to a library or module that you don't have control over.

Here's a simple example of a decorator that adds timing information to a function:

In [None]:
import time

def timer(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"Execution time: {end_time - start_time} seconds")
        return result
    return wrapper

@timer
def my_function():
    print("I am using the 'timer' decorator")

my_function()

The `timer` function is a decorator that takes a function as an argument and returns a new function that wraps the original function with timing code. The `@timer` syntax is used to apply the decorator to the my_function function.

When my_function is called, it will now include the timing information provided by the timer decorator.

Let's go back to our topic.

In [None]:
from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int


This creates a Person class with two fields: name (a string) and age (an integer). The @dataclass decorator automatically generates a constructor and other methods like `__repr__` and `__eq__` based on the defined fields.

This means that you can create instances of the Person class like this:
> `p = Person("Alice", 30)`

Dataclasses also support default values for fields, as well as type annotations and other features that make working with data structures in Python more convenient.

Another Example:

In [None]:
import random
import string

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

class Person:
    def __init__(self, name: str, address: str):
        self.name = name
        self.address = address
        
person = Person(name="Alvin", address="143 Singapore")
print(person)


Unfortunately, the value returned by `person` was not very useful as it was just a memory addresss. Ideally, when you want to print the person, you would want to see the name and the address.

What we can do is that we can add a `__str__` dunder method to our class to indicate what should happen when we print the class.

In [None]:
import random
import string

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

class Person:
    def __init__(self, name: str, address: str):
        self.name = name
        self.address = address

    def __str__(self) -> str:
        return f"name={self.name}, address={self.address}"
        
person = Person(name="Alvin", address="143 Singapore")
print(person)


The problem with the above approach is that when we need to add a few more attributes to the `Person` class: 
- we need to add it as an argument 
- we need to store these arguments to the actual attributes of the object
- we need to add it to the `__str__` dunder method
- we need to make sure that when we compare the `person` we need to make sure that we take the new field into account if it is applicable

So, it complicates things. This is where dataclasses come into the picture.

In [None]:
import random
import string
from dataclasses import dataclass

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
        
person = Person(name="Alvin", address="143 Singapore")
print(person)

Another thing that we can do with dataclasses is that we can assign default values to an attribute.

In [None]:
import random
import string
from dataclasses import dataclass

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
        
person = Person(name="Alvin", address="143 Singapore")
print(person)

For primitive types like booleans, integers floats and strings, this way of defaulting values works pretty well. But what if we have something a little bit more complicated. For example, we would like to add a list of email addresses which initially we would need to default to a blank list.

In [None]:
import random
import string
from dataclasses import dataclass

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = []
        
person = Person(name="Alvin", address="143 Singapore")
print(person)

The problem is python would evaluate the default values when it interprets the script. When we try to access the value of email address which is a it will always refer to the same list, thus each person will have the same list of email address which is problematic.

To solve that, dataclasses provides a factory function that we can use instead.

In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
        
person = Person(name="Alvin", address="143 Singapore")
print(person)

What happens is that when dataclasses generates the class, it is going to call the function. We don't provide the type, we provide a function.

In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
    id: str = field(default_factory=generate_id)
        
person = Person(name="Alvin", address="143 Singapore")
print(person)

With the default values, we can still set them as part of the initializer.

In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
    id: str = field(default_factory=generate_id)
        
person = Person(name="Alvin", address="143 Singapore", active=False, id="ABCD123")
print(person)

Let's say we don't want to let the `id` field to be initialized. What we can do is to add the `init` option to the field.

In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
    id: str = field(init=False, default_factory=generate_id)
        
person = Person(name="Alvin", address="143 Singapore", active=False, id="ABCD123")
print(person)

Sometimes, we would like to generate a value from the other instance variables during initialization. How do we do that? Because we cannot create a function for that as we don't have the values yet. That is where the `__post_init__` method comes into play.

We can try to add a search_string that can be used later in case we would later like to search for persons like the name and the address for example.



In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
    id: str = field(init=False, default_factory=generate_id)
    search_string: str =field(init=False)
    
    def __post_init__(self) -> None:
        self.search_string = f"{self.name} {self.address}"
        
person = Person(name="Alvin", address="143 Singapore", active=False)
print(person)

What if we would want the search_string to be private. What we can do is to prefix underscore in front of the variable name. One underscore `_` means it's `private` while two underscores `__` means `protected`.


In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
    id: str = field(init=False, default_factory=generate_id)
    _search_string: str =field(init=False)
    
    def __post_init__(self) -> None:
        self._search_string = f"{self.name} {self.address}"
        
person = Person(name="Alvin", address="143 Singapore", active=False)
print(person)

In the above case, we might not need to print the `_search_string` as it is just a copy of the `name` and `address`. What we can do, is to add the `repr` option in the field and set it to `False`.

In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
    id: str = field(init=False, default_factory=generate_id)
    _search_string: str =field(init=False, repr=False)
    
    def __post_init__(self) -> None:
        self._search_string = f"{self.name} {self.address}"
        
person = Person(name="Alvin", address="143 Singapore", active=False)
print(person)

What if we want our object to be frozen? What we can do is to add the `frozen` option to the dataclass decorator and set it to `True`.

In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass(frozen=True)
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
    id: str = field(init=False, default_factory=generate_id)
    _search_string: str =field(init=False, repr=False)
            
person = Person(name="Alvin", address="143 Singapore", active=False)
person.name = "Anumanth"
print(person)

If we assign a new value to person. It will create a new object to hold the new values and attributes and will be eligible for garbage collection.

In [None]:
import random
import string
from dataclasses import dataclass, field

def generate_id() -> str:
    return "".join(random.choices(string.ascii_letters, k=25))

@dataclass(frozen=True)
class Person:
    name: str
    address: str
    active: bool = True
    email_addresses: list[str] = field(default_factory=list)
    id: str = field(init=False, default_factory=generate_id)
    _search_string: str =field(init=False, repr=False)
            
person = Person(name="Alvin", address="143 Singapore", active=False)
person = Person(name="Anumanth", address="143 Singapore", active=False)
print(person)