# Introduction to Research Module

This tutorial introduces `Research` functionality of `BatchFlow`.

Research class allows you to easily evaluate multiple parallel experiments with different parameters combinations.
It includes instruments to
* describe complex domains to generate parameters configurations,
* create flexible experiment plan as a sequence of callables, generators and pipelines,
* parallelize experiments by processes and GPUs,
* share heavy processing between experiments,
* save and load results of experiments in a unified form.

## Imports and Utilities

We start with some useful imports and constant definitions


In [1]:
import sys
sys.path.append('../../..')

from batchflow.research import Research, Option, Domain, EC, O, EP

## The simplest example

In the simplest case `Research` can run one callable and save results (output). Of course, such example is meaningless but we will complicate it gradually to demonstrate how to work with `Research`.

Experiment in research is a sequence calls of some methods that can be chained.

In [2]:
def power(a=2, b=3):
    return a ** b

research = Research().add_callable(power, save_to='power')
research.run(dump_results=False)

100%|██████████| 1/1 [00:00<00:00,  5.35it/s]


<batchflow.research.research.Research at 0x7f099d97c080>

By default, Research create folder and store its results there but we specify `dump_results=False` to store results in RAM.

The results can be seen in a special table even during the research execution. They are stored in `research.results` which can be transformed to `pandas.DataFrame` by calling `research.results.df` property:

In [3]:
research.results.df

Unnamed: 0,id,iteration,name,value
0,b8c9f03130430383,0,power,8


In that case experiment with unique identificator `b8c9f03130430383` has one iteration of function `power` execution and saving the results under the name `power` (why do we need iteration we will descibe below). The same callable with the same parameters can be added in several ways:

In [4]:
research = Research().add_callable(power, save_to='power', a=3, b=2)
research = Research().add_callable(power, save_to='power', args=[3, 2])
research = Research().power(3, 2, save_to='power')
research = Research().power(a=3, b=2, save_to='power')

research.run(dump_results=False)

research.results.df

100%|██████████| 1/1 [00:00<00:00,  4.02it/s]


Unnamed: 0,id,iteration,name,value
0,b8c9f03130430383,0,power,9


## Constructing domains

Let's demonstrate how to create more complex domains.

In [5]:
domain = Domain(a=[2, 3], b=[3, 2])

In that case we will get all possible combinations of `a` and `b` (totally four configs).

In [6]:
research = Research(domain=domain).power(a=EC('a'), b=EC('b'), save_to='power')
research.run(dump_results=False)

100%|██████████| 4/4 [00:00<00:00,  7.03it/s]


<batchflow.research.research.Research at 0x7f0d08905208>

`EC` (abbreviation for "experiment config") is a named expression to refer to items of config which will be assigned to experiment. In general, named expression is a way to refer to objects that doesn't exist at the moment of the definition. Thus, `EC('key')` is for experiment config item, `EC()` without args stands for the entire experiment config.

The most common named expression is `E` which allows to get `Experiment` instance, thereby gaining access to all the attributes of the current experiment. For example, `EC()` is an alias for `E().config`.

In results we can find two new columns: `a` and `b` for values of parameters in config.

In [7]:
research.results.df

Unnamed: 0,id,a,b,iteration,name,value
0,984d62c930430383,2,3,0,power,8
1,bef0db4949534914,2,2,0,power,4
2,72b9965c05613302,3,3,0,power,27
3,5420406315176767,3,2,0,power,9


The same domain can be created in several other ways:

In [8]:
domain = Domain({'a': [2, 3], 'b': [3, 2]})
domain = Domain(a=[2, 3]) * Domain(b=[3, 2])
domain = Option('a', [2, 3]) * Option('b', [3, 2])

To concat domains use `+`:

In [9]:
domain = Domain(a=[2, 3]) + Domain(b=[3, 2])

research = Research(domain=domain).power(kwargs=EC(), save_to='power')
research.run(dump_results=False)

research.results.df

100%|██████████| 4/4 [00:00<00:00,  5.91it/s]


Unnamed: 0,id,a,b,iteration,name,value
0,71d4879f30430383,2.0,,0,power,8
1,e32d7dce49534914,3.0,,0,power,27
2,6bffa3ee05613302,,3.0,0,power,8
3,ff973b7615176767,,2.0,0,power,4


Here we specify only one parameter in config, for the second the default value from function is used (`a=2`, `b=3`).

Besides, we can get "scalar product" of domains with one parameter of the same length:

In [10]:
domain = Domain(a=[4, 5]) @ Domain(b=[1, 2])

research = Research(domain=domain).power(kwargs=EC(), save_to='power')
research.run(dump_results=False)

research.results.df

100%|██████████| 2/2 [00:00<00:00,  3.65it/s]


Unnamed: 0,id,a,b,iteration,name,value
0,d31cf35130430383,4,1,0,power,4
1,eab3a82849534914,5,2,0,power,25


## Experiment description

Let's look at another toy experiment with an two callables.

In [11]:
domain = Domain({'a': [2, 3], 'b': [3, 2]})

def inc(x):
    return x + 1

research = (Research(domain=domain)
            .power(a=EC('a'), b=EC('b'), save_to='power')
            .inc(O('power'), save_to='inc')
           )
research.run(dump_results=False)

research.results.df

100%|██████████| 4/4 [00:00<00:00,  4.46it/s]


Unnamed: 0,id,a,b,iteration,name,value
0,984d62c930430383,2,3,0,power,8
1,984d62c930430383,2,3,0,inc,9
2,bef0db4949534914,2,2,0,power,4
3,bef0db4949534914,2,2,0,inc,5
4,72b9965c05613302,3,3,0,power,27
5,72b9965c05613302,3,3,0,inc,28
6,5420406315176767,3,2,0,power,9
7,5420406315176767,3,2,0,inc,10


We use `O` named expression to substitute output of the `power` function to `inc` function.

We have several rows for one iteration of each experiment because now we have to callables and two names in `name` column. But we can aggregate results:

In [12]:
research.results.to_df(pivot=True)

Unnamed: 0,id,a,b,iteration,power,inc
0,984d62c930430383,2,3,0,8,9
1,bef0db4949534914,2,2,0,4,5
2,72b9965c05613302,3,3,0,27,28
3,5420406315176767,3,2,0,9,10


Now we have separate column for each variable in results instead of `name` and `value`.

## Generators in research

In addition to callables, we can add generators into `Research`. Now it will become clear why one experiment can have several iterations.

In [13]:
def inc(x):
    for i in range(2):
        yield x + i

domain = Domain(a=[2, 3], b=[3, 2])
        
research = (Research(domain=domain)
            .add_callable(power, a=EC('a'), b=EC('b'))
            .add_generator(inc, save_to='inc', x=O('power'))
           )
research.run(dump_results=False)

research.results.df

100%|██████████| 4/4 [00:00<00:00,  4.11it/s]


Unnamed: 0,id,a,b,iteration,name,value
0,984d62c930430383,2,3,0,inc,8
1,984d62c930430383,2,3,1,inc,9
2,bef0db4949534914,2,2,0,inc,4
3,bef0db4949534914,2,2,1,inc,5
4,72b9965c05613302,3,3,0,inc,27
5,72b9965c05613302,3,3,1,inc,28
6,5420406315176767,3,2,0,inc,9
7,5420406315176767,3,2,1,inc,10


Here we have different increments for different iterations of experiment. The number of iterations for each experiment is specified in `run` call. By defaults, it is equal to `1` if research contains only callables, and `None` if it has generators. `None` is interpreted as infinity and the experiment will continue until the generator in research is exhausted.

We can define the same experiment plan in `Research` in the following way:

In [14]:
research = (Research(domain=domain)
            .power(a=EC('a'), b=EC('b'))
            .inc(x=O('power'), save_to='inc', mode='generator')
           )

research.run(dump_results=False)

research.results.df

0it [00:06, ?it/s]
0it [00:05, ?it/s]
0it [00:04, ?it/s]
0it [00:03, ?it/s]
0it [00:02, ?it/s]
100%|██████████| 4/4 [00:01<00:00,  2.55it/s]


Unnamed: 0,id,a,b,iteration,name,value
0,984d62c930430383,2,3,0,inc,8
1,984d62c930430383,2,3,1,inc,9
2,bef0db4949534914,2,2,0,inc,4
3,bef0db4949534914,2,2,1,inc,5
4,72b9965c05613302,3,3,0,inc,27
5,72b9965c05613302,3,3,1,inc,28
6,5420406315176767,3,2,0,inc,9
7,5420406315176767,3,2,1,inc,10


The number of callables and generators in `Research` is not limited.

In [15]:
research = (Research(domain=domain)
            .power(a=EC('a'), b=EC('b'))
            .inc(O('power'), save_to='inc', mode='generator')
            .add_callable('root', power, a=O('inc'), b=0.5, save_to='root', when='last')
           )
research.run(dump_results=False, finalize=True)

research.results.df

100%|██████████| 4/4 [00:00<00:00,  5.04it/s]


Unnamed: 0,id,a,b,iteration,name,value
0,984d62c930430383,2,3,0,inc,8.0
1,984d62c930430383,2,3,1,inc,9.0
2,984d62c930430383,2,3,2,inc,9.0
3,984d62c930430383,2,3,2,root,3.0
4,bef0db4949534914,2,2,0,inc,4.0
5,bef0db4949534914,2,2,1,inc,5.0
6,bef0db4949534914,2,2,2,inc,5.0
7,bef0db4949534914,2,2,2,root,2.236068
8,72b9965c05613302,3,3,0,inc,27.0
9,72b9965c05613302,3,3,1,inc,28.0


Here we define `finalize=True` because at the `iteration=2` generator is exhausted and without that flag expeiment will stop so `root` will not be executed.

## Instances of some class on Research

In order to define more complex experiments with interactions between units, we can add to experiments instances of some classes. They will be initialized with config at the first iteration of experiment and its attributes can be used as callables and generators.

In [16]:
class MyCalc:
    def __init__(self, b):
        self.b = b
    
    def power(self, a):
        return a ** self.b

research = (Research(domain=domain)
            .add_instance('calc', MyCalc, b=EC('b'))
            .add_callable('calc.power', a=EC('a'))
            .add_generator(inc, x=O('calc.power'), save_to='inc')
           )

research.run(n_iters=2, dump_results=False)

research.results.df

100%|██████████| 8/8 [00:00<00:00,  9.69it/s]


Unnamed: 0,id,a,b,iteration,name,value
0,984d62c930430383,2,3,0,inc,8
1,984d62c930430383,2,3,1,inc,9
2,bef0db4949534914,2,2,0,inc,4
3,bef0db4949534914,2,2,1,inc,5
4,72b9965c05613302,3,3,0,inc,27
5,72b9965c05613302,3,3,1,inc,28
6,5420406315176767,3,2,0,inc,9
7,5420406315176767,3,2,1,inc,10


## Parallel executions

Experiments can be executed in parallel.

In [17]:
def heavy_callable():
    i = int(2 * 10e7)
    for i in range(i):
        pass

Now let's execute it with default parameters.

In [18]:
%%timeit

research = Research(n_reps=2).add_callable(heavy_callable)
research.run(dump_results=False, bar=False)

7.36 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


To execute two experiments in parallel let's define `workers=2`.

In [19]:
%%timeit

research = Research(n_reps=2).add_callable(heavy_callable)
research.run(dump_results=False, workers=2, bar=False)

3.98 s ± 42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Now we will have two parallel workers to run experiments within the research. And the full execution time is almost twice as fast.

## Branches

If there is a common part in experiments with different configurations, then it can be taken out into a separate unit and evaluated once for several experiments.

In [20]:
import numpy as np
from batchflow.research import E

def load_data(random):
    return random.normal(size=10)

def mean(array, use_numpy=True):
    if use_numpy:
        return f"{np.mean(array):.02f} (with numpy)"
    return f"{sum(array) / len(array):.02f} (without numpy)"

domain = Domain({'use_numpy': [False, True]})

research = (Research(domain=domain)
            .add_callable(load_data, random=E().random)
            .add_callable(mean, array=O('load_data'), use_numpy=EC('use_numpy'),
                          save_to='stat')
           )

research.run(n_iters=2, dump_results=False)

100%|██████████| 4/4 [00:00<00:00,  6.45it/s]


<batchflow.research.research.Research at 0x7f0d08778908>

The resulting dataframe will be the following:

In [21]:
research.results.df

Unnamed: 0,id,use_numpy,iteration,name,value
0,7635c7e430430383,False,0,stat,0.03 (without numpy)
1,7635c7e430430383,False,1,stat,0.45 (without numpy)
2,8e95395749534914,True,0,stat,-0.04 (with numpy)
3,8e95395749534914,True,1,stat,0.33 (with numpy)


As we can see, stats in value columns for different experiment are different. Now let's add `root=True` to `load_data` callable and `branches=2` to `run`:

In [22]:
research = (Research(domain=domain)
            .add_callable(load_data, random=E().random, root=True)
            .add_callable(mean, array=O('load_data'), use_numpy=EC('use_numpy'),
                          save_to='stat')
           )

research.remove(ask=False)
research.run(n_iters=2, branches=2)

0it [01:38, ?it/s]
0it [01:37, ?it/s]
100%|██████████| 4/4 [00:00<00:00,  4.31it/s]


<batchflow.research.research.Research at 0x7f0d08801048>

In that case we will execute `load_data` once for two experiments and then its output will used by mean units in experiments (branches) which will be executed in parallel thread.

In [23]:
research.results.df

Unnamed: 0,id,use_numpy,iteration,name,value
0,7635c7e430430383,False,0,stat,0.03 (without numpy)
1,7635c7e430430383,False,1,stat,0.45 (without numpy)
2,8e95395749534914,True,0,stat,0.03 (with numpy)
3,8e95395749534914,True,1,stat,0.45 (with numpy)
