# Research

Research class is intended for multiple running of the same pipelines with different parameters in order to get some statistics.

Let's compare `VGG7` and `VGG16` performance on `MNIST` dataset with different layouts of convolutional blocks. For each combination of layout and model class, we train model for 1000 iterations and repeat that process 10 times. 

In [1]:
import sys
import dill
import tensorflow as tf

import matplotlib.pyplot as plt
%matplotlib inline

sys.path.append("../..")
from dataset import Pipeline, B, C, V
from dataset.opensets import MNIST
from dataset.models.tf import VGG7, VGG16
from dataset.research import Research, Option

  from ._conv import register_converters as _register_converters


In [2]:
BATCH_SIZE=64

Define model config. All parameters that we want to vary we define as ``C('parameter_name')``. In our case it's a `'body/block/layout'`.

In [3]:
model_config={
    'session/config': tf.ConfigProto(allow_soft_placement=True),
    'inputs': dict(images={'shape': (28, 28, 1)},
                   labels={'classes': 10, 'transform': 'ohe', 'name': 'targets'}),
    'input_block/inputs': 'images',
    'body/block/layout': C('layout'),
    'output/ops': 'accuracy',
    'device': C('device') # it's technical parameter for TFModel
}

Define loading as a separate pipeline with lazy run. This is optional.

In [4]:
mnist = MNIST()
train_root = mnist.train.p.run(BATCH_SIZE, shuffle=True, n_epochs=None, lazy=True)
test_root = mnist.test.p.run(BATCH_SIZE, shuffle=True, n_epochs=1, lazy=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
DownloadingDownloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
 Extracting /tmp/t10k-labels-idx1-ubyte.gz
Extractinghttp://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gzExtracting  /tmp/train-images-idx3-ubyte.gz
/tmp/t10k-images-idx3-ubyte.gz

Extracting /tmp/train-labels-idx1-ubyte.gz


Now define main parts of pipelines where we want to vary model:

In [5]:
train_template = (Pipeline()
            .init_variable('loss', init_on_each_run=list)
            .init_variable('accuracy', init_on_each_run=list)
            .init_model('dynamic', C('model'), 'conv', config=model_config)
            .to_array()
            .train_model('conv', 
                         fetches=['loss', 'output_accuracy'], 
                         feed_dict={'images': B('images'), 'labels': B('labels')},
                         save_to=[V('loss'), V('accuracy')], mode='a')
)

In [6]:
test_template = (Pipeline()
            .init_variable('accuracy', init_on_each_run=list)
            .import_model('conv', C('import_from'))
            .to_array()
            .predict_model('conv', 
                         fetches=['output_accuracy'], 
                         feed_dict={'images': B('images'), 'labels': B('labels')},
                         save_to=[V('accuracy')], mode='a')
)

In [7]:
train_ppl = train_root + train_template
test_ppl = test_root + test_template

Create an instance of `Research` class and add train pipeline:

In [8]:
research = Research()
research.pipeline(train_ppl, variables='loss', name='train')

<dataset.research.research.Research at 0x7f6f559575c0>

Parameter ``name`` defines pipeline name inside ``research``. At each iteration that pipeline will be executed with ``.next_batch()`` and all ``variables`` from the pipeline will be saved so that variables should be added with ``mode='w'``.

All parameter combinations we define through the dict where a key is a parameter name and value is a list of possible parameter values.
Create a grid of parameters in the following way:

In [9]:
grid = Option('layout', ['cna', 'can']) * Option('model', [VGG7, VGG16])
research.grid(grid)

<dataset.research.research.Research at 0x7f6f559575c0>

You can get all variants of config:

In [10]:
configs = list(grid.gen_configs())
configs

[ConfigAlias({'layout': 'cna', 'model': 'VGG7'}),
 ConfigAlias({'layout': 'cna', 'model': 'VGG16'}),
 ConfigAlias({'layout': 'can', 'model': 'VGG7'}),
 ConfigAlias({'layout': 'can', 'model': 'VGG16'})]

Each element is a ConfigAlias. It's a Config dict of parameter values and dict with aliases for parameter values:

In [11]:
configs[0].config(), configs[0].alias()

(Config({'layout': 'cna', 'model': <class 'dataset.models.tf.vgg.VGG7'>}),
 {'layout': 'cna', 'model': 'VGG7'})

Note that in the first case `model` is a class but in the second `str`.

In [12]:
research.pipeline(test_ppl, variables='accuracy', name='test', run=True, execute_for=100, import_from='train')

<dataset.research.research.Research at 0x7f6f559575c0>

That pipeline will be executed with ``.run()`` at each 100 iterations because of parameters ``run=True``  and ``exec_for=100``. Pipeline variable ``accuracy`` will be saved after each execution. In order to add a mean value of accuracy on test dataset, you can define a function


In [13]:
def get_accuracy(iteration, experiment, pipeline):
    import numpy as np
    pipeline = experiment[pipeline].pipeline
    acc = pipeline.get_variable('accuracy')
    return np.mean(acc)

research.function(get_accuracy, returns='accuracy', name='test_accuracy', execute_for=100, pipeline='test')

<dataset.research.research.Research at 0x7f6f559575c0>

That function will get iterartion, experiment and kwargs (in that case it's `pipeline='test'`").

Experiment is an OrderedDict for all pipelines and functions that were added to Research
and are running in current job. Key is a name of ExecutableUnit (class for function and pipeline), value is ExecutableUnit.
Each pipeline and function added to Research is saved as an ExecutableUnit. Each ExecutableUnit
has the following attributes:

    function : callable
        is None if ExecutableUnit is a pipeline
    pipeline : Pipeline
        is None if ExecutableUnit is a function
    root_pipeline : Pipeline
        is None if ExecutableUnit is a function or pipeline is not divided into root and branch
    result : dict
        current results of the ExecutableUnit. Keys are names of variables (for pipeline)
        or returns (for function) values are lists of variable values
    path : str
        path to the folder where results will be dumped
    exec_for : int, list of ints or None
    dump_for : int, list of ints or None
    to_run : bool
    variables : list
        variables (for pipeline) or returns (for function)
    on_root : bool
    args : list
    kwargs : dict()


And now you can start research:

In [14]:
# research.run(n_reps=10, n_iters=1000, name='my_research', progress_bar=True)

... but don't hurry up if you have a lot of gpus because you can do research much more faster, just define `workers=4` and `gpu = [0, 1, 2, 3]` as a list of available devices. In that case you can run 4 jobs in parallel!

In [15]:
# research.run(n_reps=10, n_iters=1000, workers=4, gpu=[0,1,2,3], name='my_research', progress_bar=True)

Moreover, if you have heavy loading you can do it just one time for few pipelines with models. In that case you can define research in the following way:

In [16]:
research = (Research()
    .pipeline(root_pipeline=train_root, branch_pipeline=train_template, variables='loss', name='train')
    .pipeline(root_pipeline=test_root, branch_pipeline=test_template, variables='accuracy', name='test', run=True, execute_for=100, import_from='train')
    .grid(grid)
    .function(get_accuracy, returns='accuracy', name='test_accuracy', execute_for=100, pipeline='test')
)

In order to achieve parallelization in branches we added `device` into model_config.

And now you can define the number of branches in each worker:

In [None]:
research.run(n_reps=10, n_iters=1000, workers=2, branches=2, gpu=[0,1,2,3], name='my_research', progress_bar=True)

Research my_research is starting...


  0%|          | 0/20000 [00:00<?, ?it/s]

Distributor has 20 jobs with 1000 iterations. Totally: 20000
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



 10%|█         | 2000/20000 [01:40<15:04, 19.89it/s]

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



 10%|█         | 2077/20000 [01:56<16:48, 17.77it/s]

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



 20%|██        | 4000/20000 [03:03<12:15, 21.76it/s]

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



 20%|██        | 4099/20000 [03:20<12:58, 20.41it/s]

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



 30%|███       | 6000/20000 [04:26<10:22, 22.48it/s]

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



 30%|███       | 6079/20000 [04:43<10:49, 21.42it/s]

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



 35%|███▌      | 7008/20000 [05:17<09:47, 22.10it/s]

In [None]:
All results will be saved into `{research_name}/results/{config_alias}/{repetition_index}/{unitname_iteration}` add a dict pickled with dill wi