# Zen of Python: The structure of your data should be explicit
Here I show a few design pattern examples you can use in your code. Refer back to the main article to get this right.

In [18]:
from __future__ import annotations
import seaborn
import pandas as pd
import typing

iris_df = seaborn.load_dataset("iris")
iris_df = iris_df[['sepal_length', 'sepal_width', 'species']]
assert((iris_df['sepal_length']>0).sum() == iris_df.shape[0])
print(iris_df.shape)
print(iris_df.head())
iris_df.head()

(150, 3)
   sepal_length  sepal_width species
0           5.1          3.5  setosa
1           4.9          3.0  setosa
2           4.7          3.2  setosa
3           4.6          3.1  setosa
4           5.0          3.6  setosa


Unnamed: 0,sepal_length,sepal_width,species
0,5.1,3.5,setosa
1,4.9,3.0,setosa
2,4.7,3.2,setosa
3,4.6,3.1,setosa
4,5.0,3.6,setosa


In [21]:
iris_list = iris_df.to_dict(orient='records')
iris_list[:3]

[{'sepal_length': 5.1, 'sepal_width': 3.5, 'species': 'setosa'},
 {'sepal_length': 4.9, 'sepal_width': 3.0, 'species': 'setosa'},
 {'sepal_length': 4.7, 'sepal_width': 3.2, 'species': 'setosa'}]

## Example 1: Dataclasses

The `dataclasses` module has been wildly popular lately because it is one of the easiest ways to create data containers. The decorator `dataclasses.dataclass` will generate an `__init__` function that accepts the three defined static attributes, and we use `frozen=True` to indicate that the class should not be modified after being created. This is a good failsafe to have when working with your data unless you have a very good reason to expect modification - instead, I recommend redesigning your code around this constraint.

In [2]:
import dataclasses

@dataclasses.dataclass(frozen=True)
class IrisEntryDataclass:
    sepal_length: int
    sepal_width: int
    species: str
    
    # factory method constructor
    @classmethod
    def from_dataframe_row(cls, row: pd.Series):
        return cls(
            sepal_length = row['sepal_length'],
            sepal_width = row['sepal_width'],
            species = row['species'],
        )

Using the factory method you can create a list of entry objects that contain the data associated with each row.

In [25]:
entries = list()
for ind, row in iris_df.iterrows():
    new_iris = IrisEntryDataclass.from_dataframe_row(row)
    entries.append(new_iris)

entries[:3]

[IrisEntryDataclass(sepal_length=5.1, sepal_width=3.5, species='setosa'),
 IrisEntryDataclass(sepal_length=4.9, sepal_width=3.0, species='setosa'),
 IrisEntryDataclass(sepal_length=4.7, sepal_width=3.2, species='setosa')]

## Example 2: `attrs` Classes

Ideally you could impose stricter validation rules on the data inserted into these objects. The `attrs` project maintains a superset of functionality from `dataclasses` (`attrs` actually predates and was an inspiration for `dataclasses`) and provides some convenient decorators and methods for type checking/conversions and value checking. Note below I create validation functions using the `@species.validator`, `@sepal_length.validator`, and `@sepal_width.validator`.

In [4]:
import attrs
import attr

@attr.s(frozen=True, slots=True)
class IrisEntryAttrs:
    '''Represents a single iris.'''
    sepal_length: int = attrs.field(converter=float)
    sepal_width: int = attrs.field(converter=float)
    species: str = attrs.field(converter=str) 
    
    ######################### Factory Methods #########################
    @classmethod
    def from_dataframe_row(cls, row: pd.Series):
        return cls(
            sepal_length = row['sepal_length'],
            sepal_width = row['sepal_width'],
            species = row['species'],
        )
    
    ######################### Validators #########################
    @species.validator
    def species_validator(self, attr, value) -> None:
        if not len(value) > 0:
            raise ValueError(f'{attr.name} cannot be empty')
    
    @sepal_length.validator
    @sepal_width.validator
    def meas_validator(self, attr, value) -> None:
        if not value > 0:
            raise ValueError(f'{attr.name} was out of range')
    
    ######################### Properties #########################
    def sepal_area(self) -> float:
        return self.sepal_length * self.sepal_width

## Example 3: Overloading Builtin Data Structures
Follwing conventions, one might be tempted to build a container object for the list of entries, although in some select cases it might make sense to create a container that directly inherits from the builtin `list`, or, as shown here `typing.List` (which is supposed to be more friendly for inheritence). In general inheritance is bad for your health, but in some select cases it can make things simpler. Here our primary use is to introduce the factory method `from_dataframe`, which simply calls the `IrisEntryAttrs` factory method to parse each row of the dataframe separately. This way you can add operations like grouping or filtering or even data type conversion as an extension of the list.

In [27]:
class IrisEntriesList(typing.List[IrisEntryDataclass]):
    
    ######################### Factory Methods #########################
    @classmethod
    def from_dataframe(cls, df: pd.DataFrame):
        # add type hint by hinting at returned variable
        elist = [IrisEntryDataclass.from_dataframe_row(row) for ind,row in df.iterrows()]
        new_entries: cls = cls(elist)
        return new_entries
        
    ######################### Grouping and Filtering #########################
    def group_by_species(self) -> typing.Dict[str, IrisEntriesList]:
        groups = dict()
        for e in self:
            groups.setdefault(e.species, self.__class__())
            groups[e.species].append(e)
        return groups

    def filter_sepal_area(self, sepal_area: float):
        elist = [e for e in self if e.sepal_area() >= sepal_area]
        entries: self.__class__ = self.__class__(elist)
        return entries
    
entries = IrisEntriesList.from_dataframe(iris_df)
entries[:2]

[IrisEntryDataclass(sepal_length=5.1, sepal_width=3.5, species='setosa'),
 IrisEntryDataclass(sepal_length=4.9, sepal_width=3.0, species='setosa')]

## Misc examples from the article
Here I'm including some of the quick examples I showed in the article.

### Don't Use Dataframes
Dataframes are generally popular because they are easy to use and great for plotting and creating summary statistics. While I often use them to work with tabular data, I suggest that you avoid using them as the primary data structures in your pipelines for two reasons: (1) you do not have explicit knowledge about your data without introspecting, and the introspection may need to happen at multiple levels of your program; and (2) they are often the wrong tools for the job (performance-wise) - even though they may be fine for many tasks.

In [15]:
iris_df['species']
iris_df.species

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object

### Nested Iterables
For languages such as Python, one might prefer to use standard data structures like `list`s, `dict`s, and `set`s to structure your data because they are simple, and in many cases, the right tools for the job (as in, you can choose to use optimal data structures for whatever operations you will perform - an approach often missed by dataframe users). However, they can be a lot to keep track of as the data takes on more complicated structures and you are tracing it through larger and larger callstacks. You can see the complexity from the type hints I provide in `group_by_species` - a function that simply groups objects by species. The function accepts a list of dictionaries mapping strings to floats or strings, and it outputs a dictionary mapping strings to lists of dictionaries that map strings to floats or strings.

In [19]:
import json
import pprint
(json.dumps(iris_list[:5]))

'[{"sepal_length": 5.1, "sepal_width": 3.5, "species": "setosa"}, {"sepal_length": 4.9, "sepal_width": 3.0, "species": "setosa"}, {"sepal_length": 4.7, "sepal_width": 3.2, "species": "setosa"}, {"sepal_length": 4.6, "sepal_width": 3.1, "species": "setosa"}, {"sepal_length": 5.0, "sepal_width": 3.6, "species": "setosa"}]'

In [8]:
import statistics
statistics.mean([iris['petal_length'] for iris in irises])

3.758

In [9]:
EntryList = typing.List[typing.Dict[str, typing.Union[float, str]]]
def group_by_species(irises: EntryList) -> typing.Dict[str, EntryList]:
    groups = dict()
    for iris in irises:
        groups.setdefault(iris['species'], list())
        groups[iris['species']].append(iris)
    return groups

groups = group_by_species(irises)
groups.keys()

dict_keys(['setosa', 'versicolor', 'virginica'])