# Quickstart

affe quickstart guide.

# Preliminaries

In [1]:
# This is a code-formatter, you cann comment it without losing functionality
%load_ext lab_black

## Imports

In [2]:
import affe
import numpy as np
import pandas as pd

In [3]:
from affe.execs import (
    CompositeExecutor,
    NativeExecutor,
    JoblibExecutor,
    GNUParallelExecutor,
)

In [4]:
from affe import Flow

In [5]:
from affe.tests import get_dummy_flow

# Basic Illustration: Flows saying _"hi"_

To illustrate, let us create 10 different workflows. Each of those says "hi" in a signature way.

In [6]:
# Making a flow is very easy.
flows = [
    get_dummy_flow(message="hi" * (i + 1), content=dict(i=i * 10)) for i in range(3)
]

In [7]:
flow = flows[0]

In [8]:
flow.config

{'io': {'fs': {'root': '/Users/zissou/repos/affe',
   'cli': 'root',
   'data': 'root',
   'out': 'root',
   'scripts': 'root',
   'out.flow.config': 'out.flow',
   'out.flow.logs': 'out.flow',
   'out.flow.results': 'out.flow',
   'out.flow.models': 'out.flow',
   'out.flow.timings': 'out.flow',
   'out.flow.tmp': 'out.flow',
   'out.flow.flows': 'out.flow',
   'out.flow': 'out'}},
 'message': 'hi',
 'content': {'i': 0}}

## Flow Execution

Now you can print some hello worlds, embedded in a Flow object.

In [9]:
flow.run()

Hello world
2 secs passed
hi


{'i': 0}

In [10]:
flow.run_with_log()

'/Users/zissou/repos/affe/out/flow/logs/logfilehi'

In [11]:
flows[1].run_with_log()

'/Users/zissou/repos/affe/out/flow/logs/logfilehihi'

## Flow Scheduling

= Execution of multiple flows, for instance via a tool like `joblib`

In [12]:
e = NativeExecutor
c_jl = JoblibExecutor(flows, e, n_jobs=3)

In [13]:
c_jl.run()

[{'i': 0}, {'i': 10}, {'i': 20}]

# Manual Creation of Flows

The "hi"-flows defined above were nice because they illustrate in the simplest way possible what a flow is and how it can be used. In this section, we dive in a bit deeper in how you can make a Flow yourself, from scratch.

## Your workflow

Typicall, you start from a certain workflow. As illustrated above, a _workflow_ is a piece of work you care about, and you want to be able to execute it in a controlled, experiment-like fashion. 

Here, we assume you are interested in the archetype machine learning task of predicting the specifies of the Iris flower

In [14]:
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit classifier
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X_train, y_train)

# Predict and Evaluate
y_pred = clf.predict(X_test)

score = accuracy_score(y_test, y_pred, normalize=True)

score

1.0

## Make your _workflow_ into a _Flow_

Now that you what you want to do, you want obtain a flow that implements this. The advantage is that annoying things like

- logging
- timeouts
- execution
- scheduling

are all taken care of, as soon as you succeed. This means removing boilerplate, and using battle-tested code instead.

### Basic Example (passing a function as argument)

In its most basic form, this is a really simple thing, as we can just throw in a random python function _directly_. Consider this the _lazy_ way of doing things, which is supported.

The only assumption is that your `flow` function has one input, typically named `config`. For the time being, this is a fairly constant assumption across `affe`.

In [15]:
def hello_world(config):
    print("Hello World")
    return


f = Flow(flow=hello_world)

f.run()

Hello World


So that's nice and all, this is quick and dirty and it fails when you are trying to run this through a more advanced executor, such as one with logging.

In [16]:
f.run_with_log()

'logfile'

If you check the logfile, you can get some information as to why this is happening. Essentially, a common problem with abstracted execution is that you do need to have some kind of persistence of the code you wish to run. This is just to motivate that at times, you would want to build your custom subclass `Flow` object, which will not be plagued by such limitations.

## IrisFlow as a Flow-Subclass

This, we could consider the right way to do things in `affe`

- Subclass the Flow class
- Add anything you like

### Implementation

In [17]:
from affe import Flow

In [18]:
class IrisFlow(Flow):
    def __init__(self, max_depth=None, **kwargs):
        """
        All the information you want to pass inside the flow function,
        you can embed in the config dictionary.
        """
        self.config = dict(max_depth=max_depth)
        super().__init__(config=self.config, **kwargs)
        return

    @staticmethod
    def imports():
        """For remote executions, you better specify your imports explicitly.

        Depending on the use-case, this is not necessary, but it will never hurt.
        """
        from sklearn import datasets
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import accuracy_score

        return

    def flow(self, config):
        """
        This function is basically a verbatim copy of your workflow above.

        Prerequisites:
            - This function has to be called flow
            - It expects one input: config

        The only design pattern to take into account is that you can assume one
        input only, which then by definition constitutes your "configuration" for your workflow.
        Whatever parameters you need, you can extract from this. This pattern is somewhat restricitive,
        but if you are implementing experiments, you probably should be this strict anyway; you're welcome.

        The other thing is the name of this function: it has to be "flow", in order for some of the
        executioners to properly find it. Obviously, if your only usecase is to run the flow function
        yourself, this does not matter at all. But in most cases it does, and again: adhering to this pattern
        will never hurt you, deviation could.
        """
        # Obtain configuration
        max_depth = config.get("max_depth", None)

        print("I am about to execute the IRIS FLOW")

        # Load data
        X, y = datasets.load_iris(return_X_y=True)
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=42
        )

        # Fit classifier
        clf = DecisionTreeClassifier(max_depth=max_depth)
        clf.fit(X_train, y_train)

        # Predict and Evaluate
        y_pred = clf.predict(X_test)

        score = accuracy_score(y_test, y_pred, normalize=True)

        print("I am about DONE executing the IRIS FLOW")
        return score

### Tryout

Now, we can verify how this thing works.

In [19]:
iris_flow_02 = IrisFlow(max_depth=1)
iris_flow_02.run()

I am about to execute the IRIS FLOW
I am about DONE executing the IRIS FLOW


0.7111111111111111

In [20]:
iris_flow_10 = IrisFlow(max_depth=10)
iris_flow_10.run()

I am about to execute the IRIS FLOW
I am about DONE executing the IRIS FLOW


1.0

Alright, that looks pretty nice already. Now the question is: what does this get me for free? 
    
Well you get:
- logging
- timeouts
- fancy executioners
- boilerplate filesystem managment
    
So let's dive into that.

### Logging

Depending how you run the flow, another executioner is called in the backend. And some of those executors actually give you logging outside of the box, if you do it right.

In our case, we need this one:
- `DTAIExperimenterProcessExecutor` which is used in the `run_with_log_via_shell` function

Additionally, if we specify the logfile parameter, we can give the logfiles custom names etc, which allows us to demonstrate.

In [28]:
# let us try:
iris_flow = IrisFlow(max_depth=10, log_filepath="via-subprocess")
iris_flow.run_with_log_via_shell()

'via-subprocess'

Unfortunately, when being executed directly from Jupyter notebook it still fails, because the subprocess does not recognize the IrisFlow object! It has no idea how to unpickle it.

However, if you define your flow, and this object is accesible to for instance, your conda environment that you run your experiments in, you will be fine.

In [29]:
from affe.demo import DTDemo

In [30]:
f = DTDemo()

f.run()

{'fit_time_s': 0.0013339519500732422,
 'predict_time_s': 0.00022482872009277344,
 'score': 1.0}

In [33]:
f.run_via_shell_with_log_autonomous()

'/Users/zissou/repos/affe/note/out/GenericBenchmark/logs/logs-0000'

In [24]:
# this one works
iris_flow = IrisFlow(max_depth=10, log_filepath="via-shell")
iris_flow.run_with_log_via_shell_autonomous()

'via-shell'

In [27]:
iris_flow.get_shell_with_log_command()

"/opt/miniconda3/envs/affe/bin/python /Users/zissou/repos/affe/src/affe/cli/flow_cli_with_monitors.py -f 'flowfile.pkl' -l 'via-shell' -t 60"

### Deeper dive into the Executors

One of `affe`'s key contributions is its strict separation between three related things: _definition_ of a workflow, _actual execution of a workflow_ and lastly _scheduling multiple workflows_. 

This may seem trivial at first, but actually is responsible for making `affe` work for real.