# Research results processing

Research has a special instrument to collect and process results. In this tutorial, we will show you step by step how to work with them.

In [1]:
import sys

sys.path.append('../../../..')

from batchflow.research import Alias, Domain, Research, ResearchResults, EC, O

Let's define simple research with generator and callable and two parameters in domain. Some values in domain we define with aliases to demonstrate some features.

In [2]:
def f():
    return (i for i in range(2))

def g(c, d, i):
    return c * i + d

domain = Domain(c=[Alias(1, 'one'), Alias(2, 'two'), 3], d=[1, 2])

research = (Research(domain=domain)
            .f(mode='generator', save_to='f')
            .g(c=EC('c'), d=EC('d'), i=O('f'), save_to='res'))

research.run(dump_results=False)

100%|██████████| 6/6 [00:00<00:00,  9.44it/s]


<batchflow.research.research.Research at 0x7fe7f0e2b6a0>

`Research` stores results in an instance of class `ResearhResults`:

In [3]:
type(research.results)

batchflow.research.results.ResearchResults

It has inner structure to store results:

In [4]:
research.results.results.items()

[('16158ce667808836',
  OrderedDict([('f', OrderedDict([(0, 0), (1, 1), (2, 1)])),
               ('res', OrderedDict([(0, 1), (1, 2), (2, 2)]))])),
 ('86cc322770109780',
  OrderedDict([('f', OrderedDict([(0, 0), (1, 1), (2, 1)])),
               ('res', OrderedDict([(0, 2), (1, 3), (2, 3)]))])),
 ('4472f3a916195885',
  OrderedDict([('f', OrderedDict([(0, 0), (1, 1), (2, 1)])),
               ('res', OrderedDict([(0, 1), (1, 3), (2, 3)]))])),
 ('59a0b17c75089686',
  OrderedDict([('f', OrderedDict([(0, 0), (1, 1), (2, 1)])),
               ('res', OrderedDict([(0, 2), (1, 4), (2, 4)]))])),
 ('98773a5859049302',
  OrderedDict([('f', OrderedDict([(0, 0), (1, 1), (2, 1)])),
               ('res', OrderedDict([(0, 1), (1, 4), (2, 4)]))])),
 ('973188cd12100969',
  OrderedDict([('f', OrderedDict([(0, 0), (1, 1), (2, 1)])),
               ('res', OrderedDict([(0, 2), (1, 5), (2, 5)]))]))]

...but you as a user don't need to think about it. As we show in previous tutorials, use `df` property to transform it to `pandas.DataFrame`:

In [5]:
research.results.df

Unnamed: 0,id,c,d,iteration,f,res
0,16158ce667808836,1,1,0,0,1
1,16158ce667808836,1,1,1,1,2
2,16158ce667808836,1,1,2,1,2
3,86cc322770109780,1,2,0,0,2
4,86cc322770109780,1,2,1,1,3
5,86cc322770109780,1,2,2,1,3
6,4472f3a916195885,2,1,0,0,1
7,4472f3a916195885,2,1,1,1,3
8,4472f3a916195885,2,1,2,1,3
9,59a0b17c75089686,2,2,0,0,2


`'id'` is a id of the experiment, columns `'c'` and `'d'` are for parameters from config, `iteration` is for iteration of the experiment. Other columns store saved values.

The output pf `df` property is the same as for `to_df` method:

In [6]:
research.results.to_df()

Unnamed: 0,id,c,d,iteration,f,res
0,16158ce667808836,1,1,0,0,1
1,16158ce667808836,1,1,1,1,2
2,16158ce667808836,1,1,2,1,2
3,86cc322770109780,1,2,0,0,2
4,86cc322770109780,1,2,1,1,3
5,86cc322770109780,1,2,2,1,3
6,4472f3a916195885,2,1,0,0,1
7,4472f3a916195885,2,1,1,1,3
8,4472f3a916195885,2,1,2,1,3
9,59a0b17c75089686,2,2,0,0,2


## Parameters of `to_df`

### `pivot`

By default, `pivot=True` and the resulting dataframe will have separate column for each saved value (in our case it is `'f'` and `'res'`). In order two have two columns for variable name and it's value, we can define `pivot=False`:

In [7]:
research.results.to_df(pivot=False)

Unnamed: 0,id,c,d,iteration,name,value
0,16158ce667808836,1,1,0,f,0
1,16158ce667808836,1,1,1,f,1
2,16158ce667808836,1,1,2,f,1
3,16158ce667808836,1,1,0,res,1
4,16158ce667808836,1,1,1,res,2
5,16158ce667808836,1,1,2,res,2
6,86cc322770109780,1,2,0,f,0
7,86cc322770109780,1,2,1,f,1
8,86cc322770109780,1,2,2,f,1
9,86cc322770109780,1,2,0,res,2


### `include_config`

Dataframe includes experiment id so experiment config can be recovered from it. Thus sometimes it can be needed to drop parameter values from results:

In [8]:
research.results.to_df(include_config=False)

Unnamed: 0,id,iteration,f,res
0,16158ce667808836,0,0,1
1,16158ce667808836,1,1,2
2,16158ce667808836,2,1,2
0,86cc322770109780,0,0,2
1,86cc322770109780,1,1,3
2,86cc322770109780,2,1,3
0,4472f3a916195885,0,0,1
1,4472f3a916195885,1,1,3
2,4472f3a916195885,2,1,3
0,59a0b17c75089686,0,0,2


### `concat_config`

You also can create one column for the whole config. Here you will have concated config keys and values (its aliases).

In [9]:
research.results.to_df(concat_config=True)

Unnamed: 0,id,config,iteration,f,res
0,16158ce667808836,c_one-d_1,0,0,1
1,16158ce667808836,c_one-d_1,1,1,2
2,16158ce667808836,c_one-d_1,2,1,2
3,86cc322770109780,c_one-d_2,0,0,2
4,86cc322770109780,c_one-d_2,1,1,3
5,86cc322770109780,c_one-d_2,2,1,3
6,4472f3a916195885,c_two-d_1,0,0,1
7,4472f3a916195885,c_two-d_1,1,1,3
8,4472f3a916195885,c_two-d_1,2,1,3
9,59a0b17c75089686,c_two-d_2,0,0,2


If you don't want to drop columns for each parameter, use `drop_columns=False`. `drop_columns` with `concat_config=False` doesn't make sense.

In [10]:
research.results.to_df(concat_config=True, drop_columns=False)

Unnamed: 0,id,config,c,d,iteration,f,res
0,16158ce667808836,c_one-d_1,1,1,0,0,1
1,16158ce667808836,c_one-d_1,1,1,1,1,2
2,16158ce667808836,c_one-d_1,1,1,2,1,2
3,86cc322770109780,c_one-d_2,1,2,0,0,2
4,86cc322770109780,c_one-d_2,1,2,1,1,3
5,86cc322770109780,c_one-d_2,1,2,2,1,3
6,4472f3a916195885,c_two-d_1,2,1,0,0,1
7,4472f3a916195885,c_two-d_1,2,1,1,1,3
8,4472f3a916195885,c_two-d_1,2,1,2,1,3
9,59a0b17c75089686,c_two-d_2,2,2,0,0,2


### `use_alias`

By default, `use_alias=False` and columns for config parameters will have true values. To use its aliases, define `use_alias=True`:

In [11]:
research.results.to_df(use_alias=True)

Unnamed: 0,id,c,d,iteration,f,res
0,16158ce667808836,one,1,0,0,1
1,16158ce667808836,one,1,1,1,2
2,16158ce667808836,one,1,2,1,2
3,86cc322770109780,one,2,0,0,2
4,86cc322770109780,one,2,1,1,3
5,86cc322770109780,one,2,2,1,3
6,4472f3a916195885,two,1,0,0,1
7,4472f3a916195885,two,1,1,1,3
8,4472f3a916195885,two,1,2,1,3
9,59a0b17c75089686,two,2,0,0,2


### `remove_auxilary`

Results also have additional columns for parameters that were not defined in domain: `'repetition'`, `'updates'` and '`device`'. By default, they are dropped from dataframe because they are often don't vary oк do not interest the user. If you need them, define `remove_auxilary=False`:

In [12]:
research.results.to_df(remove_auxilary=False)

Unnamed: 0,id,c,d,repetition,updates,device,iteration,f,res
0,16158ce667808836,1,1,0,0,,0,0,1
1,16158ce667808836,1,1,0,0,,1,1,2
2,16158ce667808836,1,1,0,0,,2,1,2
3,86cc322770109780,1,2,0,0,,0,0,2
4,86cc322770109780,1,2,0,0,,1,1,3
5,86cc322770109780,1,2,0,0,,2,1,3
6,4472f3a916195885,2,1,0,0,,0,0,1
7,4472f3a916195885,2,1,0,0,,1,1,3
8,4472f3a916195885,2,1,0,0,,2,1,3
9,59a0b17c75089686,2,2,0,0,,0,0,2


They are considered as separate parameters so will not be concated to config if `concat_config=True`:

In [13]:
research.results.to_df(remove_auxilary=False, concat_config=True)

Unnamed: 0,id,config,repetition,updates,device,iteration,f,res
0,16158ce667808836,c_one-d_1,0,0,,0,0,1
1,16158ce667808836,c_one-d_1,0,0,,1,1,2
2,16158ce667808836,c_one-d_1,0,0,,2,1,2
3,86cc322770109780,c_one-d_2,0,0,,0,0,2
4,86cc322770109780,c_one-d_2,0,0,,1,1,3
5,86cc322770109780,c_one-d_2,0,0,,2,1,3
6,4472f3a916195885,c_two-d_1,0,0,,0,0,1
7,4472f3a916195885,c_two-d_1,0,0,,1,1,3
8,4472f3a916195885,c_two-d_1,0,0,,2,1,3
9,59a0b17c75089686,c_two-d_2,0,0,,0,0,2


## Loading of results

In [14]:
research = (Research(domain=domain)
            .f(mode='generator', save_to='f')
            .g(c=EC('c'), d=EC('d'), i=O('f'), save_to='res'))

research.run(dump_results=True)

100%|██████████| 6/6 [00:00<00:00,  7.41it/s]


<batchflow.research.research.Research at 0x7feb4a022eb8>

When research is executed with `dump_results=False` then experiment results are stored in RAM only when experiment is executed (if `dump` is not used). Comparing to the previous research, inner storage of `research.results` doesn't store any variable values.

In [15]:
research.results.results.items()

[('16158ce673079978', OrderedDict()),
 ('86cc322725919769', OrderedDict()),
 ('4472f3a930003004', OrderedDict()),
 ('59a0b17c38718849', OrderedDict()),
 ('98773a5857513592', OrderedDict()),
 ('973188cd38339321', OrderedDict())]

But all values were dumped in `research` folder. All of them can be loaded by the same `to_df` method or `df` property:

In [16]:
research.results.df

Unnamed: 0,id,c,d,iteration,f,res
0,16158ce673079978,1,1,0,0,1
1,16158ce673079978,1,1,1,1,2
2,16158ce673079978,1,1,2,1,2
3,86cc322725919769,1,2,0,0,2
4,86cc322725919769,1,2,1,1,3
5,86cc322725919769,1,2,2,1,3
6,4472f3a930003004,2,1,0,0,1
7,4472f3a930003004,2,1,1,1,3
8,4472f3a930003004,2,1,2,1,3
9,59a0b17c38718849,2,2,0,0,2


Under the hood, `to_df` calls `load` method to load all saved values.

## Load results

You can load results directly from the research folder.

In [17]:
results = ResearchResults('research')
results.df

Unnamed: 0,id,c,d,iteration,f,res
0,98773a5857513592,3,1,0,0,1
1,98773a5857513592,3,1,1,1,4
2,98773a5857513592,3,1,2,1,4
3,4472f3a930003004,2,1,0,0,1
4,4472f3a930003004,2,1,1,1,3
5,4472f3a930003004,2,1,2,1,3
6,86cc322725919769,1,2,0,0,2
7,86cc322725919769,1,2,1,1,3
8,86cc322725919769,1,2,2,1,3
9,59a0b17c38718849,2,2,0,0,2


## Filtering

In the simplest cases, you can transform all the results to `pandas.DataFrame` and use all `pandas` functional to process it. But sometimes you will have huge reseach with heavy results. For example, results can store not only numeric and string values but also arrays, serialized models and so on. In that case, it will be useful to load only the necessary elements of the results.

`load` and `to_df` can filter results on the stage of loading. For example, to load results for experiments with `c=1` just define it as a keyword argument.

In [18]:
results = ResearchResults('research')
results.to_df(c=1)

Unnamed: 0,id,c,d,iteration,f,res
0,86cc322725919769,1,2,0,0,2
1,86cc322725919769,1,2,1,1,3
2,86cc322725919769,1,2,2,1,3
3,16158ce673079978,1,1,0,0,1
4,16158ce673079978,1,1,1,1,2
5,16158ce673079978,1,1,2,1,2


To be sure that we don't load any other experiment results, let's check what we really load:

In [19]:
results.results.items()

[('86cc322725919769',
  OrderedDict([('f', OrderedDict([(0, 0), (1, 1), (2, 1)])),
               ('res', OrderedDict([(0, 2), (1, 3), (2, 3)]))])),
 ('16158ce673079978',
  OrderedDict([('f', OrderedDict([(0, 0), (1, 1), (2, 1)])),
               ('res', OrderedDict([(0, 1), (1, 2), (2, 2)]))]))]

There are several other ways to load results. For example, by specifying config (or it's part) to load:

In [20]:
results.to_df(config={'c': 1, 'd': 2})

Unnamed: 0,id,c,d,iteration,f,res
0,86cc322725919769,1,2,0,0,2
1,86cc322725919769,1,2,1,1,3
2,86cc322725919769,1,2,2,1,3


We also can filter by `alias` value. Note that each call of `to_df` or `load` will reload data.

In [21]:
results.to_df(alias={'c': 'two'})

Unnamed: 0,id,c,d,iteration,f,res
0,4472f3a930003004,2,1,0,0,1
1,4472f3a930003004,2,1,1,1,3
2,4472f3a930003004,2,1,2,1,3
3,59a0b17c38718849,2,2,0,0,2
4,59a0b17c38718849,2,2,1,1,4
5,59a0b17c38718849,2,2,2,1,4


You also can use domain to slice results:

In [22]:
results.to_df(domain=Domain(c=[Alias(1, 'one'), 2]))

Unnamed: 0,id,c,d,iteration,f,res
0,4472f3a930003004,2,1,0,0,1
1,4472f3a930003004,2,1,1,1,3
2,4472f3a930003004,2,1,2,1,3
3,86cc322725919769,1,2,0,0,2
4,86cc322725919769,1,2,1,1,3
5,86cc322725919769,1,2,2,1,3
6,59a0b17c38718849,2,2,0,0,2
7,59a0b17c38718849,2,2,1,1,4
8,59a0b17c38718849,2,2,2,1,4
9,16158ce673079978,1,1,0,0,1


In addition to filtering by parameter values, filtering by iterations, experiment id and variable name is provided:

In [23]:
results.to_df(iterations=[0, 2])

Unnamed: 0,id,c,d,iteration,f,res
0,98773a5857513592,3,1,0,0,1
1,98773a5857513592,3,1,2,1,4
2,4472f3a930003004,2,1,0,0,1
3,4472f3a930003004,2,1,2,1,3
4,86cc322725919769,1,2,0,0,2
5,86cc322725919769,1,2,2,1,3
6,59a0b17c38718849,2,2,0,0,2
7,59a0b17c38718849,2,2,2,1,4
8,973188cd38339321,3,2,0,0,2
9,973188cd38339321,3,2,2,1,5


In [24]:
results.to_df(experiment_id='59a0b17c95311861')

In [25]:
results.to_df(name='f')

Unnamed: 0,id,c,d,iteration,f
0,98773a5857513592,3,1,0,0
1,98773a5857513592,3,1,1,1
2,98773a5857513592,3,1,2,1
3,4472f3a930003004,2,1,0,0
4,4472f3a930003004,2,1,1,1
5,4472f3a930003004,2,1,2,1
6,86cc322725919769,1,2,0,0
7,86cc322725919769,1,2,1,1
8,86cc322725919769,1,2,2,1
9,59a0b17c38718849,2,2,0,0


Of course, you can use several parameters of `to_df` in the same time:

In [26]:
results.to_df(name='f', experiment_id='98773a5857513592', iterations=0)

Unnamed: 0,id,c,d,iteration,f
0,98773a5857513592,3,1,0,0
