# Introduction to Research Module

This tutorial introduces `Research` functionality of `BatchFlow`.

[`Research`](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.Research) class allows you to easily evaluate multiple parallel experiments with different parameters combinations.
It includes instruments to
* describe complex domains to generate parameters configurations,
* create flexible experiment plan as a sequence of callables, generators and pipelines,
* parallelize experiments by processes and GPUs,
* share heavy processing between experiments,
* save and load results of experiments in a unified form.

We start with some useful imports and constant definitions


In [1]:
import sys
sys.path.append('../../..')

from batchflow.research import Research, Option, Domain, EC, O, EP

## The simplest example

In the simplest case `Research` can run [one callable](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.Research.add_callable)  and save results (output). Of course, such example is meaningless but we will complicate it gradually to demonstrate how to work with `Research`.

Experiment in research is a sequence of calls of some methods that can be chained.

In [2]:
def power(a=2, b=3):
    return a ** b

research = Research().add_callable(power, save_to='power')
research.run(dump_results=False)

100%|██████████| 1/1 [00:00<00:00,  5.01it/s]


<batchflow.research.research.Research at 0x7f213630d3c8>

By default, `Research` creates folder and stores its results there but we specify `dump_results=False` to store results in RAM (see [tutorial 3](https://github.com/analysiscenter/batchflow/blob/research/examples/tutorials/research/03_results_processing.ipynb) for more details).

The results can be seen in a special table even during the research execution (see `detach` parameter of [`run`](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.Research.run)). They are stored in `research.results` which can be transformed to `pandas.DataFrame` by calling `research.results.df` property:

In [3]:
research.results.df

Unnamed: 0,id,iteration,power
0,b8c9f03194150301,0,8


In that case experiment has one iteration of function `power` execution and saving the results under the name `power` (why do we need iteration we will descibe below). In the column `id` we can see unique experiment id. The same callable with the same parameters can be added in several ways:

In [4]:
research = Research().add_callable(power, save_to='power', a=3, b=2)
research = Research().add_callable(power, save_to='power', args=[3, 2])
research = Research().power(3, 2, save_to='power')
research = Research().power(a=3, b=2, save_to='power')

research.run(dump_results=False)

research.results.df

100%|██████████| 1/1 [00:00<00:00,  4.28it/s]


Unnamed: 0,id,iteration,power
0,b8c9f03126059800,0,9


Now we get new id for the experiment but with the same first 8 digits. The first 8 digits is a hash of experiment config (here we have empty config), the rest is randomly sampled digits. To make all random generators in `Research` reproducible, define `seed` parameter of `run`.

## Constructing domains

Let's demonstrate how to create domains of parameters. [`Domain`](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.Domain) is a special class to describe parameters configurationsto try in experiment.

In [5]:
domain = Domain(a=[2, 3], b=[3, 2])

In that case we will get all possible combinations of `a` and `b` (totally four configs).

In [6]:
research = Research(domain=domain).power(a=EC('a'), b=EC('b'), save_to='power')
research.run(dump_results=False)

100%|██████████| 4/4 [00:00<00:00,  7.87it/s]


<batchflow.research.research.Research at 0x7f24a1336908>

[`EC`](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.EC) (abbreviation for "experiment config") is a named expression to refer to items of config which will be assigned to experiment. In general, named expression is a way to refer to objects that doesn't exist at the moment of the definition. Thus, `EC('key')` stands for experiment config item, `EC()` without args stands for the entire experiment config.

The most common named expression is [`E`](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.E) which allows to get [`Experiment`](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.Experiment) instance, thereby gaining access to all the attributes of the current experiment. For example, `EC()` is an alias for `E().config`.

In results we can find two new columns: `a` and `b` for values of parameters in config.

In [7]:
research.results.df

Unnamed: 0,id,a,b,iteration,power
0,984d62c960813263,2,3,0,8
1,bef0db4945397615,2,2,0,4
2,72b9965c20726203,3,3,0,27
3,5420406311691788,3,2,0,9


The same domain can be created in several other ways:

In [8]:
domain = Domain({'a': [2, 3], 'b': [3, 2]})
domain = Domain(a=[2, 3]) * Domain(b=[3, 2])
domain = Option('a', [2, 3]) * Option('b', [3, 2])

To concat domains use `+`:

In [9]:
domain = Domain(a=[2, 3]) + Domain(b=[3, 2])

research = Research(domain=domain).power(kwargs=EC(), save_to='power')
research.run(dump_results=False)

research.results.df

0it [00:01, ?it/s]/4 [00:00<00:00,  3.88it/s]
0it [00:01, ?it/s]/4 [00:00<00:00,  4.97it/s]
100%|██████████| 4/4 [00:00<00:00,  6.20it/s]


Unnamed: 0,id,a,b,iteration,power
0,71d4879f84218796,2.0,,0,8
1,e32d7dce40441344,3.0,,0,27
2,6bffa3ee43197887,,3.0,0,8
3,ff973b7627424687,,2.0,0,4


Here we specify only one parameter in config, for the second the default value from function is used (`a=2`, `b=3`).

Besides, we can get "scalar product" of domains with one parameter of the same length:

In [10]:
domain = Domain(a=[4, 5]) @ Domain(b=[1, 2])

research = Research(domain=domain).power(kwargs=EC(), save_to='power')
research.run(dump_results=False)

research.results.df

0it [00:11, ?it/s]
0it [00:20, ?it/s]
100%|██████████| 2/2 [00:20<00:00, 10.22s/it]


Unnamed: 0,id,a,b,iteration,power
0,d31cf35105232907,4,1,0,4
1,eab3a82883452834,5,2,0,25


## Experiment description

Let's look at another toy experiment with two callables.

In [11]:
domain = Domain({'a': [2, 3], 'b': [3, 2]})

def inc(x):
    return x + 1

research = (Research(domain=domain)
            .power(a=EC('a'), b=EC('b'), save_to='power')
            .inc(O('power'), save_to='inc')
           )
research.run(dump_results=False)

research.results.df

100%|██████████| 4/4 [00:00<00:00,  5.91it/s]


Unnamed: 0,id,a,b,iteration,power,inc
0,984d62c990944608,2,3,0,8,9
1,bef0db4928276293,2,2,0,4,5
2,72b9965c51868163,3,3,0,27,28
3,5420406322352592,3,2,0,9,10


We use [`O`](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.O) named expression to substitute output of the `power` function to `inc` function.

We have several columns for each saved value but we can use [`to_df`](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.ResearchResults.to_df) method with `pivot=False` to make two columns for all saved values: `'name'` and `'value'`. `to_df` with default parametrs returns the same as `df` property.

In [12]:
research.results.to_df(pivot=False)

Unnamed: 0,id,a,b,iteration,name,value
0,984d62c990944608,2,3,0,power,8
1,984d62c990944608,2,3,0,inc,9
2,bef0db4928276293,2,2,0,power,4
3,bef0db4928276293,2,2,0,inc,5
4,72b9965c51868163,3,3,0,power,27
5,72b9965c51868163,3,3,0,inc,28
6,5420406322352592,3,2,0,power,9
7,5420406322352592,3,2,0,inc,10


## Generators in research

In addition to callables, we can add [generators](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.Research.add_generator)  into `Research`. Now it will become clear why one experiment can have several iterations.

In [13]:
def inc(x):
    for i in range(2):
        yield x + i

domain = Domain(a=[2, 3], b=[3, 2])
        
research = (Research(domain=domain)
            .add_callable(power, a=EC('a'), b=EC('b'))
            .add_generator(inc, save_to='inc', x=O('power'))
           )
research.run(dump_results=False, finalize=False)

research.results.df

100%|██████████| 4/4 [00:00<00:00,  5.04it/s]


Unnamed: 0,id,a,b,iteration,inc
0,984d62c958486105,2,3,0,8
1,984d62c958486105,2,3,1,9
2,984d62c958486105,2,3,2,9
3,bef0db4906006827,2,2,0,4
4,bef0db4906006827,2,2,1,5
5,bef0db4906006827,2,2,2,5
6,72b9965c54259108,3,3,0,27
7,72b9965c54259108,3,3,1,28
8,72b9965c54259108,3,3,2,28
9,5420406302541977,3,2,0,9


Here we have different increments for different iterations of experiment (`finalize` we will describe below). The number of iterations for each experiment is specified in `run` call. By defaults, it is equal to `1` if research contains only callables, and `None` if it has generators. `None` is interpreted as infinity and the experiment will continue until the generator in research is exhausted.

We can define the same experiment plan in `Research` in the following way:

In [14]:
research = (Research(domain=domain)
            .power(a=EC('a'), b=EC('b'))
            .inc(x=O('power'), save_to='inc', mode='generator')
           )

research.run(dump_results=False, finalize=False)

research.results.df

100%|██████████| 4/4 [00:00<00:00,  4.59it/s]


Unnamed: 0,id,a,b,iteration,inc
0,984d62c969238196,2,3,0,8
1,984d62c969238196,2,3,1,9
2,984d62c969238196,2,3,2,9
3,bef0db4919111069,2,2,0,4
4,bef0db4919111069,2,2,1,5
5,bef0db4919111069,2,2,2,5
6,72b9965c26421910,3,3,0,27
7,72b9965c26421910,3,3,1,28
8,72b9965c26421910,3,3,2,28
9,5420406324096615,3,2,0,9


The number of callables and generators in `Research` is not limited.

In [15]:
research = (Research(domain=domain)
            .power(a=EC('a'), b=EC('b'))
            .inc(O('power'), save_to='inc', mode='generator')
            .add_callable('root', power, a=O('inc'), b=0.5, save_to='root', when='last')
           )
research.run(dump_results=False)

research.results.df

100%|██████████| 4/4 [00:00<00:00,  4.08it/s]


Unnamed: 0,id,a,b,iteration,inc,root
0,984d62c902904128,2,3,0,8,
1,984d62c902904128,2,3,1,9,
2,984d62c902904128,2,3,2,9,3.0
3,bef0db4948895701,2,2,0,4,
4,bef0db4948895701,2,2,1,5,
5,bef0db4948895701,2,2,2,5,2.236068
6,72b9965c73107538,3,3,0,27,
7,72b9965c73107538,3,3,1,28,
8,72b9965c73107538,3,3,2,28,5.291503
9,5420406374200572,3,2,0,9,


Here we use default `finalize=True` because at the second iteration generator is exhausted and without that flag expeiment will stop so `root` will not be executed. With `finalize=False` we will get the following results:

In [16]:
research = (Research(domain=domain)
            .power(a=EC('a'), b=EC('b'))
            .inc(O('power'), save_to='inc', mode='generator')
            .add_callable('root', power, a=O('inc'), b=0.5, save_to='root', when='last')
           )
research.run(dump_results=False, finalize=False)

research.results.df

100%|██████████| 4/4 [00:01<00:00,  3.70it/s]


Unnamed: 0,id,a,b,iteration,inc,root
0,984d62c951302923,2,3,0,8,
1,984d62c951302923,2,3,1,9,
2,984d62c951302923,2,3,2,9,
3,bef0db4966178835,2,2,0,4,
4,bef0db4966178835,2,2,1,5,
5,bef0db4966178835,2,2,2,5,
6,72b9965c56010251,3,3,0,27,
7,72b9965c56010251,3,3,1,28,
8,72b9965c56010251,3,3,2,28,
9,5420406329582695,3,2,0,9,


## Instances of some class in Research

In order to define more complex experiments with interactions between units, we can add to experiments [instances](https://analysiscenter.github.io/batchflow/api/batchflow.research.html#batchflow.research.Research.add_instance)  of some classes. They will be initialized with config at the first iteration of experiment and its attributes can be used as callables and generators or in named expressions. To add its attribute as an executable unit, use `{instance_name}.{attr}`.

In [17]:
class MyCalc:
    def __init__(self, b):
        self.b = b
    
    def power(self, a):
        return a ** self.b

research = (Research(domain=domain)
            .add_instance('calc', MyCalc, b=EC('b'))
            .add_callable('calc.power', a=EC('a'))
            .add_generator(inc, x=O('calc.power'), save_to='inc')
           )

research.run(n_iters=2, dump_results=False, finalize=False)

research.results.df

100%|██████████| 8/8 [00:01<00:00,  6.66it/s]


Unnamed: 0,id,a,b,iteration,inc
0,984d62c988160318,2,3,0,8
1,984d62c988160318,2,3,1,9
2,bef0db4986238007,2,2,0,4
3,bef0db4986238007,2,2,1,5
4,72b9965c31236688,3,3,0,27
5,72b9965c31236688,3,3,1,28
6,5420406359434719,3,2,0,9
7,5420406359434719,3,2,1,10


## Parallel executions

Experiments can be executed in parallel. Here is an example of heavy callable:

In [18]:
def heavy_callable():
    i = int(2 * 10e7)
    for i in range(i):
        pass

In [19]:
%%timeit

heavy_callable()

3.57 s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Now let's execute it with default parameters.

In [20]:
%%timeit

research = Research(n_reps=2).heavy_callable()
research.run(dump_results=False, bar=False)

0it [01:18, ?it/s]
0it [01:17, ?it/s]
0it [01:16, ?it/s]
0it [01:15, ?it/s]
0it [01:14, ?it/s]


7.78 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


To execute two experiments in parallel, let's define `workers=2`.

In [21]:
%%timeit

research = Research(n_reps=2).heavy_callable()
research.run(dump_results=False, workers=2, bar=False)

3.95 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Now we will have two parallel workers to run experiments within the research. And the full execution time is almost twice as fast.

## Branches

If there is a common part in experiments with different configurations, then it can be taken out into a separate unit and evaluated once for several experiments. To make your researches reproducible, use `random` attribute of the `Expeirment` as a [random generator](https://analysiscenter.github.io/batchflow/api/batchflow.utils_random.html). If the `seed` of `Research` will be fixed, then all samples in that way values will be generated in the same way.

In [22]:
import numpy as np
from batchflow.research import E

def load_data(random):
    return random.normal(size=10)

def mean(array, use_numpy=True):
    if use_numpy:
        return f"{np.mean(array):.02f} (with numpy)"
    return f"{sum(array) / len(array):.02f} (without numpy)"

domain = Domain({'use_numpy': [False, True]})

research = (Research(domain=domain)
            .add_callable(load_data, random=E().random)
            .add_callable(mean, array=O('load_data'), use_numpy=EC('use_numpy'),
                          save_to='stat')
           )

research.run(n_iters=2, dump_results=False)

100%|██████████| 4/4 [00:00<00:00,  7.79it/s]


<batchflow.research.research.Research at 0x7f24a095d2e8>

The resulting dataframe will be the following:

In [23]:
research.results.df

Unnamed: 0,id,use_numpy,iteration,stat
0,7635c7e449664249,False,0,-0.48 (without numpy)
1,7635c7e449664249,False,1,-0.44 (without numpy)
2,8e95395759804202,True,0,0.02 (with numpy)
3,8e95395759804202,True,1,0.08 (with numpy)


As we can see, stats in value columns for different experiment are different.

![Title](img/without_branches.png)

 Now let's add `root=True` to `load_data` callable and `branches=2` to `run`:

In [24]:
research = (Research(domain=domain)
            .add_callable(load_data, random=E().random, root=True)
            .add_callable(mean, array=O('load_data'), use_numpy=EC('use_numpy'),
                          save_to='stat')
           )

research.run(n_iters=2, branches=2, dump_results=False)

100%|██████████| 4/4 [00:00<00:00,  8.12it/s]


<batchflow.research.research.Research at 0x7f24a095dc88>

In that case we will execute `load_data` once for two experiments and then its output will used by mean units in experiments (branches) which will be executed in parallel thread.

In [25]:
research.results.df

Unnamed: 0,id,use_numpy,iteration,stat
0,7635c7e450348798,False,0,-0.04 (without numpy)
1,7635c7e450348798,False,1,-0.16 (without numpy)
2,8e95395768289627,True,0,-0.04 (with numpy)
3,8e95395768289627,True,1,-0.16 (with numpy)


![Title](img/with_branches.png)

Now experiments work with the same datasets so results are the same for two experiments. Note that root functions will be the same for few branches with different configs. That's why it's very important not to use any kind of data in root units that depends on config!

## Summary

Here we described how to
* add callables and generators into `Research`,
* get results,
* define parameters domain,
* run experiments in parallel,
* make some callables common for several experiments.

The next tutorails will add details and will further disclose opportunities of `Research`.