# "Rediscovering" [structured](https://en.wikipedia.org/wiki/Structured_programming) programming 

... or how to organize your (Python) code and not lose your sanity

## Introduction

I am using the expression "structured programming" a bit loosely here, but I think it is not entirely irrelevant to the topic I wish to discuss. I would like to argue for a certain way of organising source code that I believe will make it more readable more modular and more extensible. What I mean by these terms hopefully will be clear by the end.

I will use a funtion (and its evolution) I have written for plotting model variables (density, composition) from `hdf5` datafiles. The code is written in Python but the concepts can be generalized to any programming language.

## A motivating example



In [5]:
import os.path as path
from glob import iglob

# (1)
def plot_figures(model, variable, step, overwrite=False):
    fpath = path.join(model.figures, variable.name)
    
    # globbing for datafiles and ordering them by filename (2)
    files = sorted(iglob(path.join(model.root, "*.gzip.h5")))
    
    for ii, datafile in enumerate(files):
        if step != -1 and ii % step:
            continue
        
        # output filename
        out = os.path.join(fpath, "luca%d.png" % ii)
        
        # (3)
        if not newer_than(datafile, out) or not overwrite:
            continue
        
        # (4)
        plot_variable(variable, datafile, out)

Explanations:
1. Here I assume that the arguments for the function stand for:
    - `model`: has fields:
        - `figure`: a path to the directory where figures are stored
        - `root`: path to the directory where datafiles can be found
    - `variable`: has fields:
        - `name`: the name of the variable we want to plot
2. `iglob(path.join(model.root, "luca*.gzip.h5"))` similar to executing the bash
command `ls /path/to/model/luca*.gzip.h5` terminal

On a relatively unrelated note, I recommend checking out the `os.path` module of the Python standard library. It has many useful functions for manipulating filepaths and can replace most of the shell functionalities in a cross-platform way (i.e. it works on Windows, Linux, Mac).

If you are willing to sacrifice backward compatibility, the `pathlib` (available since Python 3.4) module is, in my opinion, a better replacement of the `os.path` functions.

You can do this:

In [83]:
from pathlib import Path

p = Path.cwd().absolute().glob("*.ipynb")
tuple(p)[:3]

(PosixPath('/home/istvan/packages/src/github.com/bozso/python_course/notebooks/en/file_io_exceptions.ipynb'),
 PosixPath('/home/istvan/packages/src/github.com/bozso/python_course/notebooks/en/object_oriented_programming.ipynb'),
 PosixPath('/home/istvan/packages/src/github.com/bozso/python_course/notebooks/en/structured_programming.ipynb'))

Instead of doing this:

In [89]:
import os

p = iglob(path.join(path.abspath(os.getcwd()), "*.ipynb"))
tuple(p)[:3]

('/home/istvan/packages/src/github.com/bozso/python_course/notebooks/en/file_io_exceptions.ipynb',
 '/home/istvan/packages/src/github.com/bozso/python_course/notebooks/en/object_oriented_programming.ipynb',
 '/home/istvan/packages/src/github.com/bozso/python_course/notebooks/en/structured_programming.ipynb')

In [10]:
def plot_figures_extra_check(model, variable, step, name, overwrite=False, ext="png"):
    fpath = path.join(model.figures, variable.name)
    
    # globbing for datafiles and ordering them by filename (2)
    files = sorted(iglob(path.join(model.root, "*.gzip.h5" % name)))
    
    for ii, datafile in enumerate(files):
        if step != -1 and ii % step:
            continue
        
        # add extra check...
        
        # output filename
        out = os.path.join(fpath, "%s%d.%s" % (ii, name, ext))
        
        # (3)
        if not newer_than(datafile, out) or not overwrite:
            continue
        
        # (4)
        plot_variable(variable, datafile, out)

## Interfaces and why you should care about them

Just a reminder of how our main function looks like at the moment:

In [91]:
def plot_figures_extra_check(model, variable, step, name, overwrite=False,
                             ext="png", check=None):
    fpath = path.join(model.figures, variable.name)
    
    # globbing for datafiles and ordering them by filename (2)
    files = sorted(iglob(path.join(model.root, "%s*.gzip.h5" % name)))
    
    for ii, datafile in enumerate(files):
        if step != -1 and ii % step:
            continue
        
        # add extra check...
        if check is not None and not check(datafile):
            continue
        
        # output filename
        out = os.path.join(fpath, "%s%d.%s" % (ii, name, ext))
        
        # (3)
        if not newer_than(datafile, out) or not overwrite:
            continue
        
        # (4)
        plot_variable(variable, datafile, out)

The first thing that is worth noticing is that we have a bunch of semi related variables as fuction arguments. E.g. `model` is needed for globbing for the datafiles and building up the output filename; `step`, `check`, and `overwrite` are used for deciding whether to use or skip a given datafile.

One note reagarding the number of arguments is that its too high. It is usually recommended to limit the number of arguments a function takes to 2 or 3. Perhaps more importantly, as mentioned in the previous paragraph, we have "overlapping" arguments. By overlapping I mean that they relate to the same functionality.

So how can we improve the situation? First of all we should think about what our function does and how it achieves its intended goals.

We can separate several distinct steps in:
1. Querying the datafiles (i.e globbing for `*.h5` files in the model root directory).
2. Generating filepaths of the output files.
3. Filtering out datafiles we do not want to use for plotting. This can be separated into two steps, checking if the plotfile needs to be updated and other filtering.
4. Plotting the selected datafiles.

Each of these 4 steps can be represented by an *interface*. But what is an interface? We can think of it as a concept. A concept that describes what a variable we use in our program can do. Since there is no type checking in Python there is no definitive way of defining an interface, but I have come up with the following:

In [98]:
from typing import Tuple

class Globber(object):
    """
    An object that can query a list of paths.
    """
    def glob(self) -> Tuple[str]:
        """
        Returns a tuple of strings that represent filepaths.
        """
        pass

Notice that **nowhere** in this class have I implemented **any** functionality. An interface, at least as I use them, is **not** supposed to contain **details of implementation**, it only **describes** what a variable, that implements said interface, can do.

An interface is only half of the story. The other half is the actual class(es) that implement(s) it. Let's implement the Globber interface:

In [109]:
class GlobData(NamedTuple):
    # root directory where look for datafiles
    root: str
    # name of the datafile pattern
    name: str
    # extension of datafiles
    extension: str

    def glob(self) -> Tuple[str]:
        return tuple(
            iglob(path.join(self.root, "%s*.%s" % (self.name, self.extension)))
        )

At first glance this does not help too much. We only "hid" the call to `iglob` inside a class. Fair enough, but I would encourage reading on to see what happens when we separate our other steps into classes.

Next we "declare" the interface for filtering files:

In [101]:
class Filterer(object):
    """
    An object that can select datapaths that we want to use.
    """
    def filter(self, datapath: str, out: str) -> bool:
        """Returns False if datafile needs to be skipped."""
        pass

For the sake of simplicity let's assume we just want to check if the plotted imagefile needs to be updated and implement the interface accordingly:

In [104]:
class NeedsUpdate(NamedTuple):
    overwrite: bool
    
    def filter(self, datapath: str, out: str) -> bool:
        return newer_than(datafile, out) or self.overwrite

Finally we declare the interface for building paths for output image files.

In [100]:
class Transformer(object):
    """
    An object that generates the output path corresponding to a datafile.
    """
    def transform(self, filepath: str) -> str:
        pass

A possible implementation of it:

In [105]:
class Variable(NamedTuple):
    name: str

In [107]:
class OutName(NamedTuple):
    # the variable we want to plot
    var: Variable
    # path to the output directory
    figures: str
    # default value can be given
    extension: str = "png"
        
    def transform(self, datafile: str) -> str:
        # e.g. this will transform "model/luca001.gzip.h5" into
        # {figures}/{var.name}/luca001.png
        
        # get the filename without the extension
        name = path.basename(datafile).split(".")[0]
        
        return path.join(
            self.figures, var.name, "%s.%s" % (name, self.extension)
        )

Just for the sake of completeness I also declare the plotter, but will not implement it.

In [66]:
class Plotter(object):
    def plot(self, data: str, out: str):
        pass

After all this we can also create a class for managing plotting:

In [125]:
class Plotter(NamedTuple):
    globber: Globber
    filterer: Filterer
    transformer: Transformer
    plotter: Plotter
    
    def make_plots(self):
        # get the list of datafiles
        datafiles = self.globber.glob()
        
        # create the list of output files
        outs = [self.transformer.transform(file) for file in datafiles]
        
        # filter the files we want to use
        files = [
            (data, out)
            for data, out in zip(datafiles, outs)
            if self.filterer.filter(data, out)
        ]
        
        # finally plot our data
        for data, out in files:
            self.plotter.plot(data, out)

Explain: comprehensions

I will show an example of how these new classes can be used:

In [114]:
class Model(NamedTuple):
    root: str
    figures: str

In [118]:
# mock variable and model
density = Variable(name="density")
model = Model(root="~", figures="/home/user/figures")

In [128]:
p = Plotter(
    globber = GlobData(root=model.root, name="luca", extension=".gzip.h5"),
    filterer = NeedsUpdate(overwrite=False),
    transformer=OutName(var=density, extension="png", figures=model.figures),
    # here we would pass the object that implements the Plotter interface
    plotter = None
)

# this will not work of course, it is just here for demonstration purposes
p.make_plots()

What advantages did we gain?

Now it is more easy to distinguish the different parts of the our plotting algorithm. We clearly separeted different responsibilities into different classes and different variables. An added bonus that these separate little classes that implement the necessary interfaces can be reused in other modeules or even in different cases of Plotter uses.

Another advantage may not be obvious, but perhaps it is the most important one. We made our plotting mechanism much more flexible. Let me demonstrate this with an example.

Say we want to use only every fifth datafile for plotting. Now we do not have to copy paste the entire `make_plots` function, we only have to create a new class that implements the `Filterer` interface and pass an object of that class to the `filterer` argument of `Plotter`. Let's see a simple solution of this:

In [135]:
class EveryNth(NamedTuple):
    skip: int
        
    def filter(self, datafile: str, out: str) -> bool:
        # assume extract_index is implemented somewhere and returns
        # the index of a datafile, e.g. in the case of "data029.gzip.h5"
        # it returns 29
        return not (extract_index(path.basename(datafile)) % self.skip)

In [136]:
# we can reuse the Plotter object from before
# remember we can replace members individually using _replace
p_new = p._replace(filterer=EveryNth(skip=5))

# now we have a plotter that will plot every fifth datafile
p_new.make_plots()

1. We apply the unix philisophy: Design programs (or classes in our case) that do one thing but they do it right. This helps the readers of our source code (most of the times that is us). 