# Introduction to the `dataclasses` module

The `dataclasses` module is a module introduced in python 3.7. The main feature of the module is the introduction of the `dataclass` decorator.

In [127]:
from dataclasses import dataclass, field

In [14]:
print(help(dataclass))

Help on function dataclass in module dataclasses:

dataclass(cls=None, /, *, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False)
    Returns the same class as was passed in, with dunder methods
    added based on the fields defined in the class.
    
    Examines PEP 526 __annotations__ to determine fields.
    
    If init is true, an __init__() method is added to the class. If
    repr is true, a __repr__() method is added. If order is true, rich
    comparison dunder methods are added. If unsafe_hash is true, a
    __hash__() method function is added. If frozen is true, fields may
    not be assigned to after instance creation.

None


### Example: Defining a class around a Dataset

Consider you want to create a class to keep track of information about companies. 



In [284]:

import pandas as pd

class Company:
    name: str = ''
    profits: float = 0.
    id: str

In [285]:
print(Company())

<__main__.Company object at 0x7fc60b2ef7c0>


We can see that adding `@dataclass` we get for free a nice `__str__` method that shows the contents of the instance

In [407]:
@dataclass
class Company:
    name: str = ''
    profits: float = 0. 
    id: str = ''

In [408]:
print(Company())

Company(name='', profits=0.0, id='')


This class has some potentiall problems. For example, the user can set an id for each company

In [409]:
c1 = Company(id='1234')
c1

Company(name='', profits=0.0, id='1234')


#### Forbid users to set up an attribute using  `field` with  `init=False`.

We can forbid users to set up the `id` setting a `field` with  `init=False`.

In [413]:
import uuid

def generate_uuid4():
    return str(uuid.uuid4())

@dataclass
class Company:
    name: str = ''
    profits: float = 0. 
    id: str = field(init=False, default_factory=generate_uuid4)

This will raise an error when trying to set up the `id` field.

In [414]:
c1 = Company(id='1234')
c1

TypeError: __init__() got an unexpected keyword argument 'id'

But it will work flawlessly if the `id` is not specified by the user.

In [415]:
c1 = Company(name='Intel', profits=100)
c1

Company(name='Intel', profits=100, id='5c304230-8701-4e3e-a36a-bcefce1378be')

#### Exclude attributes beein printed of a `field` with `repr=False`

Consider the case were there is internal information that even though accesible might not be interesting to users when printing a variable. One can use `field(repr=False)` to avoid printing it.

In [432]:
import uuid

def generate_uuid4():
    return str(uuid.uuid4())

@dataclass
class Company:
    name: str = ''
    profits: float = 0. 
    id: str = field(init=False, default_factory=generate_uuid4, repr=False)
        
c1 = Company(name='Intel', profits=100)
c1

Company(name='Intel', profits=100)

Consider thought the problem that information of a DataClass is subject to mutation.
We could change the name of `c1` and this might be an undesired behaviour.

In [434]:
c1.name = 'ACS'
print(c1)

Company(name='ACS', profits=100)


#### Freeze a Dataclass using `frozen=True` in the dataclass decorator


In [435]:
import uuid

def generate_uuid4():
    return str(uuid.uuid4())

@dataclass(frozen=True)
class Company:
    name: str = ''
    profits: float = 0. 
    id: str = field(init=False, default_factory=generate_uuid4, repr=False)
        
c1 = Company(name='Intel', profits=100)
c1

Company(name='Intel', profits=100)

Now we will not be available to mutate attribtes of an instance 

In [437]:
c1.name = 'ACS'

FrozenInstanceError: cannot assign to field 'name'

#### Forcing users to specify keyword arguments with  `kw_only=True`

We can create a `Company` instance without specifying keyword arguments. In some cases we might want users to specify it, to avoid mistakes when defining instances.

In [445]:
c1 = Company('Intel', 100)
c1

Company(name='Intel', profits=100)

In [444]:
# feature only available for python > 3.10 
import sys
sys.version

'3.9.12 (main, Apr  5 2022, 01:53:17) \n[Clang 12.0.0 ]'

In [447]:
import uuid

def generate_uuid4():
    return str(uuid.uuid4())

@dataclass(kw_only=True)
class Company:
    name: str = ''
    profits: float = 0. 
    id: str = field(init=False, default_factory=generate_uuid4, repr=False)
        
# this will not work since kw_only=True
c1 = Company('Intel', 100)
c1

TypeError: dataclass() got an unexpected keyword argument 'kw_only'

####  @dataclass(`match_args=True`)

In [523]:
import uuid

def generate_uuid4():
    return str(uuid.uuid4())

#@dataclass(match_args=True) # python >= 3.10
@dataclass
class Company:
    name: str = ''
    profits: float = 0. 
    id: str = field(init=False, default_factory=generate_uuid4, repr=False)
        
# this will not work since kw_only=True
c1 = Company('Intel', 100)
c1

Company(name='Intel', profits=100)

####  @dataclass(`match_args=True`)

When creating a python object there is a `company.__dict__` 

#### @dataclass(slots=True): Using slots instead of `__dict__`

We can have instances with faster access time to attriburtes using slots instead of the standard `__dict__` where attributes are stored. 
Be carefull, slots can break multiple inheritence.

This can be done as follows:

In [453]:
import uuid

def generate_uuid4():
    return str(uuid.uuid4())

@dataclass(slots=True)
class Company:
    name: str = ''
    profits: float = 0. 
    id: str = field(init=False, default_factory=generate_uuid4, repr=False)
        
# this will not work since kw_only=True
c1 = Company('Intel', 100)
c1

TypeError: dataclass() got an unexpected keyword argument 'slots'

### Example: Creating a DataClass for a Dataset.

Consider you want to wrap a pandas dataframe into a class `Dataset` containing:    

- **name**: a name for the dataset
- **version**: A version identifier of the dataset (just in case the dataset changes over time)
- **dataframe**: The dataframe associated with the dataset
- **hash**: A hash of the dataframe to compare whether two `Dataset` objects are the same

In [387]:
from hashlib import sha256

def hash_df(df):
    s = str(df.columns) + str(df.index) + str(df.values)
    return sha256(s.encode()).hexdigest()

@dataclass
class Dataset:
    name: str = ''
    version: int = 0
    dataframe: pd.DataFrame = field(default_factory=pd.DataFrame) 
    hash: str = field(init=False)
        
    def __post_init__(self) -> None:
        self.hash = hash_df(self.dataframe)

Now let's create two `Dataset` objects in which the `.dataframe` is the same but only the `name` changes slightly

In [389]:
df1 = pd.DataFrame({'name':['John','David'], 'age':[52,20]})
d1 = Dataset('clients', version=1, dataframe = df1)
d2 = Dataset('clientes', version=1, dataframe = df1)

print(d1, d2)

print(f'\nEqual Datasets = {d1 == d2}')

Dataset(name='clients', version=1, dataframe=    name  age
0   John   52
1  David   20, hash='93524f2aff2f443700f6116f143f7ed99e24852d4b73827c863b26133c8f06b0') Dataset(name='clientes', version=1, dataframe=    name  age
0   John   52
1  David   20, hash='93524f2aff2f443700f6116f143f7ed99e24852d4b73827c863b26133c8f06b0')

Equal Datasets = False


We can see that the `Dataset` objects evaluated to not beeing equal. If we only care about `dataframe` to decide whether two `Dataset` objects are the same we can define an `__eq__` method in the class to ensure they are considered equal if the hashes of their data are the same.

In [405]:

@dataclass
class Dataset:
    name: str = ''
    version: int = 0
    dataframe: pd.DataFrame = field(default_factory=pd.DataFrame) 
    hash: str = field(init=False)
        
    def __post_init__(self) -> None:
        self.hash = hash_df(self.dataframe)
        
    def __eq__(self, dataset) -> bool:
        return hash_df(self.dataframe) == hash_df(dataset.dataframe)

In [406]:
df1 = pd.DataFrame({'name':['John','David'], 'age':[52,20]})

d1 = Dataset('clients', version=1, dataframe = df1)
d2 = Dataset('clientes', version=1, dataframe = df1)

print(d1, d2)

print(f'\nEqual Datasets = {d1 == d2}')

Dataset(name='clients', version=1, dataframe=    name  age
0   John   52
1  David   20, hash='93524f2aff2f443700f6116f143f7ed99e24852d4b73827c863b26133c8f06b0') Dataset(name='clientes', version=1, dataframe=    name  age
0   John   52
1  David   20, hash='93524f2aff2f443700f6116f143f7ed99e24852d4b73827c863b26133c8f06b0')

Equal Datasets = True


In [514]:
d1.__dict__

{'name': 'clients',
 'version': 1,
 'dataframe':     name  age
 0   John   52
 1  David   20,
 'hash': '93524f2aff2f443700f6116f143f7ed99e24852d4b73827c863b26133c8f06b0'}

In [480]:
d1.__dataclass_fields__.keys()

dict_keys(['name', 'version', 'dataframe', 'hash'])

In [493]:
type(d1.__dataclass_fields__['version'])

dataclasses.Field

In [510]:
d1.__dataclass_fields__['version']

Field(name='version',type=<class 'int'>,default=0,default_factory=<dataclasses._MISSING_TYPE object at 0x7fc5e83f57c0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)

In [513]:
d1.__dataclass_params__

_DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)