# Profiling BatchFlow code

A profile is a set of statistics that describes how often and for how long various parts of the program executed.

This notebooks shows how to profile various parts of BatchFlow: namely, pipelines and models.

In [1]:
import sys

sys.path.append("../../..")
from batchflow import B, V, W
from batchflow.opensets import MNIST
from batchflow.models.torch import ResNet18

In [2]:
dataset = MNIST()

To collect information about model training times (both on CPU and GPU), one must set `profile` option in the model configuration to `True`:

In [3]:
model_config = {
    'inputs/labels/classes': 10,
    'loss': 'ce',
    'profile': True,
    'device': 'cpu',
    'profile': False,
    'amp': False
}

In [4]:
pipeline = (dataset.train.p
                .init_variable('loss_history', [])
                .to_array(channels='first', dtype='float32')
                .multiply(multiplier=1/255., preserve_type=False)
                .init_model('resnet', ResNet18, 'dynamic',
                            config=model_config)
                .train_model('resnet',
                             B.images, B.labels,
                             fetches='loss',
                             save_to=V('loss_history', mode='a'))
                )

To gather statistics about how long each action takes, we must set `profile` to `True` inside `run` call:

In [5]:
BATCH_SIZE = 64
N_ITERS = 50

pipeline.run(BATCH_SIZE, n_iters=N_ITERS, bar=True, profile='d')

  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
100%|███████████████████████████████████████████| 50/50 [00:11<00:00,  4.27it/s]


<batchflow.batchflow.pipeline.Pipeline at 0x7f3bced2c2e8>

# Pipeline profiling

First of all, there is an `elapsed_time` attribute inside every instance of `Pipeline`: it stores total time of running the pipeline (even if it was used multiple times):

In [6]:
pipeline.elapsed_time

AttributeError: elapsed_time not found in class Pipeline

Note that `elapsed_time` attribute is created whether or not we set `profile` to `True`.

After running with `profile=True`, pipeline has attribute `profile_info`: this `DataFrame` holds collected information:

In [7]:
pipeline.profile_info.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,iter,total_time,pipeline_time,ncalls,tottime,cumtime,start_time,batch_id,eval_time
action,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
to_array #0,<method 'join' of 'bytes' objects>::/opt/conda/lib/python3.6/site-packages/PIL/Image.py::710::tobytes,1,0.009355,0.009332,64,9e-06,9e-06,1626968000.0,139894849693624,2.5e-05
to_array #0,<built-in method sys.getrecursionlimit>::/opt/conda/lib/python3.6/inspect.py::485::unwrap,1,0.009355,0.009332,1,1e-06,1e-06,1626968000.0,139894849693624,2.5e-05
to_array #0,<built-in method time.time>::../../../batchflow/batchflow/pipeline.py::795::_exec_one_action,1,0.009355,0.009332,2,1e-06,1e-06,1626968000.0,139894849693624,2.5e-05
to_array #0,<method 'match' of '_sre.SRE_Pattern' objects>::/opt/conda/lib/python3.6/inspect.py::760::findsource,1,0.009355,0.009332,1,4e-06,4e-06,1626968000.0,139894849693624,2.5e-05
to_array #0,<method 'match' of '_sre.SRE_Pattern' objects>::/opt/conda/lib/python3.6/tokenize.py::492::_tokenize,1,0.009355,0.009332,66,0.000129,0.000129,1626968000.0,139894849693624,2.5e-05


Note that there is a detailed information about exact methods that are called inside each of the actions. That is a lot of data which can give us precise understanding of parts of the code, that are our bottlenecks.

Columns of the `profile_info`:
- `action`, `iter`, `batch_id` and `start_time` are pretty self-explainable
- `id` allows to identify exact method with great details: it is a concatenation of `method_name`, `file_name`, `line_number` and `callee`
- `total_time` is a time taken by an action
- `pipeline_time` is `total_time` plus time of processing the profiling table at each iteration
- `tottime` is a time taken by a method inside action
- `cumtime` is a time take by a method and all of the methods that are called inside this method


More often than not, though, we don't need such granularity. Pipeline method `show_profile_info` makes some handy aggregations:

**Note:** by default, results are sorted on `total_time` or `tottime`, depending on level of details.

In [8]:
# timings for each action
pipeline.show_profile_info(per_iter=False, detailed=False)

Unnamed: 0_level_0,total_time,total_time,total_time,pipeline_time,pipeline_time,pipeline_time
Unnamed: 0_level_1,sum,mean,max,sum,mean,max
action,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
train_model #2,8.140517,0.16281,0.70974,8.136595,0.162732,0.707148
to_array #0,0.463756,0.009275,0.016101,0.462688,0.009254,0.01608
multiply #1,0.337302,0.006746,0.009966,0.336401,0.006728,0.009932


In [9]:
# for each action show 2 of the slowest methods, based on maximum `ncalls`
pipeline.show_profile_info(per_iter=False, detailed=True, sortby=('ncalls', 'max'), limit=2)

Unnamed: 0_level_0,Unnamed: 1_level_0,ncalls,ncalls,ncalls,tottime,tottime,tottime,cumtime,cumtime,cumtime
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,max,sum,mean,max,sum,mean,max
action,id,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
multiply #1,<built-in method builtins.isinstance>::../../../batchflow/batchflow/decorators.py::303::_make_args,9600,192.0,192,0.001557,3.1e-05,5.3e-05,0.001557,3.1e-05,5.3e-05
multiply #1,<built-in method numpy.core._multiarray_umath.implement_array_function>::<__array_function__ internals>::2::concatenate,6500,130.0,130,0.015555,0.000311,0.000448,0.015555,0.000311,0.000448
to_array #0,<built-in method builtins.isinstance>::../../../batchflow/batchflow/decorators.py::303::_make_args,9600,192.0,192,0.00197,3.9e-05,9.2e-05,0.00197,3.9e-05,9.2e-05
to_array #0,<built-in method numpy.core._multiarray_umath.implement_array_function>::<__array_function__ internals>::2::concatenate,6500,130.0,130,0.02043,0.000409,0.00088,0.02043,0.000409,0.00088
train_model #2,<method 'get' of 'dict' objects>::/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py::1133::__setattr__,45216,904.32,10083,0.00914,0.000183,0.001274,0.00914,0.000183,0.001274
train_model #2,<built-in method builtins.isinstance>::/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py::1133::__setattr__,31335,626.7,6933,0.00668,0.000134,0.001382,0.00668,0.000134,0.001382


In [10]:
# timings for each action for each iter
pipeline.show_profile_info(per_iter=True, detailed=False,)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_time,pipeline_time,batch_id
iter,action,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,train_model #2,0.709740,0.707148,1.398948e+14
1,to_array #0,0.009355,0.009332,1.398948e+14
1,multiply #1,0.007039,0.007026,1.398948e+14
2,train_model #2,0.139050,0.139026,1.398948e+14
2,to_array #0,0.007055,0.007037,1.398948e+14
...,...,...,...,...
49,to_array #0,0.008777,0.008757,1.398948e+14
49,multiply #1,0.006446,0.006428,1.398948e+14
50,train_model #2,0.147496,0.147471,1.398948e+14
50,to_array #0,0.009047,0.009030,1.398948e+14


In [11]:
# for each iter each action show 3 of the slowest methods, based on maximum `ncalls`
pipeline.show_profile_info(per_iter=True, detailed=True, sortby='tottime', limit=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ncalls,tottime,cumtime
iter,action,id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,multiply #1,<method 'spawn' of 'numpy.random.bit_generator.SeedSequence' objects>::../../../batchflow/batchflow/utils_random.py::83::spawn_seed_sequence,64,0.001124,0.001447
1,multiply #1,inner::../../../batchflow/batchflow/utils_random.py::7::make_rng,64,0.000866,0.001536
1,multiply #1,multiply::../../../batchflow/batchflow/batch.py::587::_apply_once,64,0.000419,0.000608
1,to_array #0,<method 'spawn' of 'numpy.random.bit_generator.SeedSequence' objects>::../../../batchflow/batchflow/utils_random.py::83::spawn_seed_sequence,64,0.001297,0.001847
1,to_array #0,inner::../../../batchflow/batchflow/utils_random.py::7::make_rng,64,0.000951,0.001727
...,...,...,...,...,...
50,to_array #0,inner::../../../batchflow/batchflow/utils_random.py::7::make_rng,64,0.000962,0.001654
50,to_array #0,<method 'astype' of 'numpy.ndarray' objects>::../../../batchflow/batchflow/batch.py::954::to_array,64,0.000642,0.000642
50,train_model #2,<method 'run_backward' of 'torch._C._EngineBase' objects>::/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py::68::backward,1,0.052696,0.052696
50,train_model #2,adam::/opt/conda/lib/python3.6/site-packages/torch/optim/adam.py::55::step,1,0.024636,0.037825


# Model profiling

In [None]:
model = pipeline.m('resnet')

There is an `info` property that, unsurprisingly, shows a lot of interesting details regarding model itself or the training process:

In [None]:
model.info

As with pipeline, there is a `profile_info` attribute, as well as `show_profile_info` method. Depending on type of the used device (`CPU` or `GPU`)

In [None]:
# one row for every operation inside model; limit at 5 rows
model.show_profile_info(per_iter=False, limit=5)

In [None]:
# for each iteration show 3 of the slowest operations
model.show_profile_info(per_iter=True, limit=3)