# Quickstart

affe quickstart guide.

# Preliminaries

In [1]:
# This is a code-formatter, you cann comment it without losing functionality
%load_ext lab_black

## Imports

In [2]:
import affe
import numpy as np
import pandas as pd

In [3]:
from affe.execs import (
    CompositeExecutor,
    NativeExecutor,
    JoblibExecutor,
    GNUParallelExecutor,
)

In [4]:
from affe import Flow

In [5]:
from affe.tests import get_dummy_flow

# Basic Illustration: Flows saying _"hi"_

To illustrate, let us create 10 different workflows. Each of those says "hi" in a signature way.

In [6]:
# Making a flow is very easy.
flows = [
    get_dummy_flow(message="hi" * (i + 1), content=dict(i=i * 10)) for i in range(3)
]

In [7]:
flow = flows[0]

In [8]:
flow.config

{'io': {'fs': {'root': '/Users/zissou/repos/affe',
   'cli': 'root',
   'data': 'root',
   'out': 'root',
   'scripts': 'root',
   'out.flow.config': 'out.flow',
   'out.flow.logs': 'out.flow',
   'out.flow.results': 'out.flow',
   'out.flow.models': 'out.flow',
   'out.flow.timings': 'out.flow',
   'out.flow.tmp': 'out.flow',
   'out.flow.flows': 'out.flow',
   'out.flow': 'out'}},
 'message': 'hi',
 'content': {'i': 0}}

## Flow Execution

Now you can print some hello worlds, embedded in a Flow object.

In [9]:
flow.run()
flow.run_with_log()

flows[1].run_with_log()

Hello world
2 secs passed
hi


'/Users/zissou/repos/affe/out/flow/logs/logfilehihi'

## Flow Scheduling

= Execution of multiple flows, for instance via a tool like `joblib`

In [10]:
e = NativeExecutor
c_jl = JoblibExecutor(flows, e, n_jobs=3)

c_jl.run()

[{'i': 0}, {'i': 10}, {'i': 20}]

# Manual Creation of Flows

The "hi"-flows defined above were nice because they illustrate in the simplest way possible what a flow is and how it can be used. In this section, we dive in a bit deeper in how you can make a Flow yourself, from scratch.

## Your workflow

Typicall, you start from a certain workflow. As illustrated above, a _workflow_ is a piece of work you care about, and you want to be able to execute it in a controlled, experiment-like fashion. 

Here, we assume you are interested in the archetype machine learning task of predicting the specifies of the Iris flower

In [11]:
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit classifier
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X_train, y_train)

# Predict and Evaluate
y_pred = clf.predict(X_test)

score = accuracy_score(y_test, y_pred, normalize=True)

score

1.0

## Make your _workflow_ into a _Flow_

Now that you what you want to do, you want obtain a flow that implements this. The advantage is that annoying things like

- logging
- timeouts
- execution
- scheduling

are all taken care of, as soon as you succeed. This means removing boilerplate, and using battle-tested code instead.

### Basic Example (passing a function as argument)

In its most basic form, this is a really simple thing, as we can just throw in a random python function _directly_. Consider this the _lazy_ way of doing things, which is supported.

The only assumption is that your `flow` function has one input, typically named `config`. For the time being, this is a fairly constant assumption across `affe`.

In [12]:
def hello_world(config):
    print("Hello World")
    return


f = Flow(flow=hello_world)

So that's nice and all, this is quick and dirty and it fails when you are trying to run this through a more advanced executor, such as one with logging.

In [13]:
f.run_with_log()

'logfile'

If you check the logfile, you can get some information as to why this is happening. Essentially, a common problem with abstracted execution is that you do need to have some kind of persistence of the code you wish to run. This is just to motivate that at times, you would want to build your custom subclass `Flow` object, which will not be plagued by such limitations.

# Your Flow as a Flow-Subclass

This, we could consider the right way to do things in `affe`

- Subclass the Flow class
- Add anything you like

## Implementation in Notebook

In [14]:
from affe import Flow
from time import sleep

In [15]:
class IrisFlow(Flow):
    def __init__(self, max_depth=None, sleep_seconds=0, **kwargs):
        """
        All the information you want to pass inside the flow function,
        you can embed in the config dictionary.
        """
        self.config = dict(max_depth=max_depth, sleep_seconds=sleep_seconds)
        super().__init__(config=self.config, **kwargs)
        return

    @staticmethod
    def imports():
        """For remote executions, you better specify your imports explicitly.

        Depending on the use-case, this is not necessary, but it will never hurt.
        """
        from sklearn import datasets
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import accuracy_score
        from time import sleep

        return

    def flow(self, config):
        """
        This function is basically a verbatim copy of your workflow above.

        Prerequisites:
            - This function has to be called flow
            - It expects one input: config

        The only design pattern to take into account is that you can assume one
        input only, which then by definition constitutes your "configuration" for your workflow.
        Whatever parameters you need, you can extract from this. This pattern is somewhat restricitive,
        but if you are implementing experiments, you probably should be this strict anyway; you're welcome.

        The other thing is the name of this function: it has to be "flow", in order for some of the
        executioners to properly find it. Obviously, if your only usecase is to run the flow function
        yourself, this does not matter at all. But in most cases it does, and again: adhering to this pattern
        will never hurt you, deviation could.
        """
        # Obtain configuration
        max_depth = config.get("max_depth", None)
        sleep_seconds = config.get("sleep_seconds", 0)

        print("I am about to execute the IRIS FLOW")
        print("BUT FIRST: I shall sleep {} seconds".format(sleep_seconds))
        sleep(sleep_seconds)
        print("I WOKE UP, gonna do my stuff now.")

        # Load data
        X, y = datasets.load_iris(return_X_y=True)
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=42
        )

        # Fit classifier
        clf = DecisionTreeClassifier(max_depth=max_depth)
        clf.fit(X_train, y_train)

        # Predict and Evaluate
        y_pred = clf.predict(X_test)

        score = accuracy_score(y_test, y_pred, normalize=True)

        msg = """
        I am DONE executing the IRIS FLOW
        """
        print(msg)
        return score

### Tryout

Now, we can verify how this thing works.

In [16]:
iris_flow_02 = IrisFlow(max_depth=1)
iris_flow_02.run()

I am about to execute the IRIS FLOW
BUT FIRST: I shall sleep 0 seconds
I WOKE UP, gonna do my stuff now.

        I am DONE executing the IRIS FLOW
        


0.7111111111111111

In [17]:
iris_flow_10 = IrisFlow(max_depth=10)
iris_flow_10.run()

I am about to execute the IRIS FLOW
BUT FIRST: I shall sleep 0 seconds
I WOKE UP, gonna do my stuff now.

        I am DONE executing the IRIS FLOW
        


1.0

## Implementation in Codebase

Alright, that looked pretty nice already. Now the question is: _what is in it for me?_
    
Well you get:
- logging
- timeouts
- boilerplate filesystem managment
- fancy executioners
- and so much more!

    
So let's dive into that.

However, the `IrisFlow` object does not exist outside of our Jupyter notebook, and that is unfortunately not OK for `affe` when running something in a subprocess/another shell, which is what you need to get these fancy functionalities.

But, allow us to resume via a demonstration flow, which learns a decision tree on the iris dataset (yes, exactly what we were doing with our IrisFlow already). You can check the source code to verify that this does exactly the same thing as the IrisFlow above, with then the added feature that `IrisDemo` actually exists in your python path etc.

### Import

Let us import the `IrisDemo` object, and demonstrate that it behaves exactly similar.

In [18]:
from affe.demo import IrisDemo

demoflow = IrisDemo(max_depth=3, log_filepath="logs/irisdemo")
demoflow.run()

I am about to execute the IRIS FLOW
BUT FIRST: I shall sleep 0 seconds
I WOKE UP, gonna do my stuff now.

        I am DONE executing the IRIS FLOW
        


1.0

### Logging

Depending how you run the flow, another executioner is called in the backend. And some of those executors actually give you logging outside of the box, if you do it right.

In our case, we need this one:
- `DTAIExperimenterProcessExecutor` which is used in the `run_with_log_via_shell` function

Additionally, if we specify the logfile parameter, we can give the logfiles custom names etc, which allows us to demonstrate.

In [19]:
demoflow = IrisDemo(max_depth=3, log_filepath="logs/irisdemo")
demoflow.run_with_log_via_shell()

'logs/irisdemo'

### Timeouts

To see how the timeouts work, we can use the "sleep" functionality to enforce our iris flow to take a bit longer. 

If force the workflow (due to sleeping) to take longer than the timeout, execution will abort.

In [20]:
# this will just work, because the run() method has no notion of timeout

iris_flow = IrisDemo(
    max_depth=10, sleep_seconds=5, log_filepath="logs/via-subprocess", timeout_s=3
)
iris_flow.run()

I am about to execute the IRIS FLOW
BUT FIRST: I shall sleep 5 seconds
I WOKE UP, gonna do my stuff now.

        I am DONE executing the IRIS FLOW
        


1.0

So in this case, nothing really happens. Things change, however, when executing through shell.

In [21]:
# timeout is higher than the actual execution time

iris_flow = IrisDemo(
    max_depth=10, sleep_seconds=2, log_filepath="logs/timeout-sufficient", timeout_s=10
)
iris_flow.run_with_log_via_shell()

'logs/timeout-sufficient'

In [22]:
# timeout lower than execution time

iris_flow = IrisDemo(
    max_depth=10,
    sleep_seconds=15,
    log_filepath="logs/timeout-insufficient",
    timeout_s=10,
)
iris_flow.run_with_log_via_shell()

'logs/timeout-insufficient'

You can check those logfiles yourself, and see what happens. The second logfile will tell you that it aborted due to hitting its timelimit, as it should.

### Filesystem Management

This is not _by default_ in a Flow object, in order to keep things clean. However, there exists another object, which is called `FlowOne`. This is still very much a bare-bones object: it is a subclass of Flow, with some minimal bookkeeping for a common experimental filesystem configuration baked in.

In that way, it becomes a very nice starting point for future extensions.




In [23]:
from affe.flow import FlowOne

In [26]:
def hello_world(config):
    print("Hello World")
    return


f = FlowOne(flow=hello_world)

In [27]:
f.root_dp

PosixPath('/Users/zissou/repos/affe')

In [25]:
f.run_with_log()

AssertionError: 

### "Deep-dive" into the Executors

One of `affe`'s key contributions is its strict separation between three related things: _definition_ of a workflow, _actual execution of a workflow_ and lastly _scheduling multiple workflows_. 

This may seem trivial at first, but actually is responsible for making `affe` work for real.