# Profiling BatchFlow code

A profile is a set of statistics that describes how often and for how long various parts of the program executed.

This notebooks shows how to profile various parts of BatchFlow: namely, pipelines and models.

In [1]:
import sys

sys.path.append("../../..")
from batchflow import B, V, W
from batchflow.opensets import MNIST
from batchflow.models.torch import ResNet18

In [2]:
dataset = MNIST()

To collect information about model training times (both on CPU and GPU), one must set `profile` option in the model configuration to `True`:

In [3]:
model_config = {
    'inputs/labels/classes': 10,
    'loss': 'ce',
    'profile': True,
}

In [4]:
pipeline = (dataset.train.p
                .init_variable('loss_history', [])
                .to_array(channels='first', dtype='float32')
                .multiply(multiplier=1/255., preserve_type=False)
                .init_model('resnet', ResNet18, 'dynamic',
                            config=model_config)
                .train_model('resnet',
                             B.images, B.labels,
                             fetches='loss',
                             save_to=V('loss_history', mode='a'))
                )

To gather statistics about how long each action takes, we must set `profile` to `True` inside `run` call:

In [5]:
BATCH_SIZE = 64
N_ITERS = 50

pipeline.run(BATCH_SIZE, n_iters=N_ITERS, bar=True, profile=True,
                   bar_desc=W(V('loss_history')[-1].format('Loss is {:7.7}')))

Loss is 0.1426592: 100%|██████████| 50/50 [01:22<00:00,  1.29s/it] 


<batchflow.batchflow.pipeline.Pipeline at 0x7f52f4521048>

# Pipeline profiling

First of all, there is an `elapsed_time` attribute inside every instance of `Pipeline`: it stores total time of running the pipeline (even if it was used multiple times):

In [6]:
pipeline.elapsed_time

82.73438310623169

Note that `elapsed_time` attribute is created whether or not we set `profile` to `True`.

After running with `profile=True`, pipeline has attribute `profile_info`: this `DataFrame` holds collected information:

In [7]:
pipeline.profile_info.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,iter,total_time,pipeline_time,ncalls,tottime,cumtime,batch_id,start_time
action,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
to_array #0,<built-in method _abc._abc_instancecheck>::/home/tsimfer/anaconda3/lib/python3.7/abc.py::137::__instancecheck__,1,0.081367,0.0711,64,0.000166,0.001459,139994263066552,1582270000.0
to_array #0,<built-in method _abc._abc_subclasscheck>::/home/tsimfer/anaconda3/lib/python3.7/abc.py::141::__subclasscheck__,1,0.081367,0.0711,57,0.001096,0.001288,139994263066552,1582270000.0
to_array #0,<built-in method _operator.index>::/home/tsimfer/anaconda3/lib/python3.7/site-packages/numpy/core/numeric.py::1557::normalize_axis_tuple,1,0.081367,0.0711,128,3.7e-05,3.7e-05,139994263066552,1582270000.0
to_array #0,<built-in method posix.sched_getaffinity>::../../../batchflow/batchflow/decorators.py::18::_workers_count,1,0.081367,0.0711,1,4.8e-05,4.8e-05,139994263066552,1582270000.0
to_array #0,<built-in method builtins.any>::../../../batchflow/batchflow/decorators.py::86::any_action_failed,1,0.081367,0.0711,1,3e-06,2.6e-05,139994263066552,1582270000.0


Note that there is a detailed information about exact methods that are called inside each of the actions. That is a lot of data which can give us precise understanding of parts of the code, that are our bottlenecks.

Columns of the `profile_info`:
- `action`, `iter`, `batch_id` and `start_time` are pretty self-explainable
- `id` allows to identify exact method with great details: it is a concatenation of `method_name`, `file_name`, `line_number` and `callee`
- `total_time` is a time taken by an action
- `pipeline_time` is `total_time` plus time of processing the profiling table at each iteration
- `tottime` is a time taken by a method inside action
- `cumtime` is a time take by a method and all of the methods that are called inside this method


More often than not, though, we don't need such granularity. Pipeline method `show_profile_info` makes some handy aggregations:

**Note:** by default, results are sorted on `total_time` or `tottime`, depending on level of details.

In [8]:
# timings for each action
pipeline.show_profile_info(per_iter=False, detailed=False)

Unnamed: 0_level_0,total_time,total_time,total_time,pipeline_time,pipeline_time,pipeline_time
Unnamed: 0_level_1,sum,mean,max,sum,mean,max
action,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
train_model #2,75.821759,1.516435,3.217391,75.503571,1.510071,3.205411
to_array #0,2.835696,0.056714,0.106486,2.546398,0.050928,0.097141
multiply #1,1.69274,0.033855,0.06945,1.509747,0.030195,0.060036


In [9]:
# for each action show 2 of the slowest methods, based on maximum `ncalls`
pipeline.show_profile_info(per_iter=False, detailed=True, sortby=('ncalls', 'max'), limit=2)

Unnamed: 0_level_0,Unnamed: 1_level_0,ncalls,ncalls,ncalls,tottime,tottime,tottime,cumtime,cumtime,cumtime
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,max,sum,mean,max,sum,mean,max
action,id,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
multiply #1,<built-in method builtins.isinstance>::../../../batchflow/batchflow/components.py::105::find_in_index,409600,8192.0,8192,0.050672,0.001013,0.001615,0.050672,0.001013,0.001615
multiply #1,<built-in method numpy.where>::../../../batchflow/batchflow/components.py::105::find_in_index,204800,4096.0,4096,0.111507,0.00223,0.003821,0.111507,0.00223,0.003821
to_array #0,<built-in method builtins.isinstance>::../../../batchflow/batchflow/components.py::105::find_in_index,409600,8192.0,8192,0.069341,0.001387,0.002048,0.069341,0.001387,0.002048
to_array #0,<built-in method numpy.where>::../../../batchflow/batchflow/components.py::105::find_in_index,204800,4096.0,4096,0.167002,0.00334,0.009804,0.167002,0.00334,0.009804
train_model #2,<method 'append' of 'list' objects>::/home/tsimfer/anaconda3/lib/python3.7/site-packages/torch/autograd/profiler.py::641::parse_cpu_trace,1082936,21658.72,21686,0.151943,0.003039,0.005326,0.151943,0.003039,0.005326
train_model #2,<lambda>::~::0::<method 'sort' of 'list' objects>,541468,10829.36,10843,0.091438,0.001829,0.004184,0.091438,0.001829,0.004184


In [10]:
# timings for each action for each iter
pipeline.show_profile_info(per_iter=True, detailed=False,)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_time,pipeline_time,batch_id
iter,action,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,train_model #2,3.217391,3.205411,139994263066555
1,to_array #0,0.081367,0.071100,139994263066552
1,multiply #1,0.024790,0.021719,139994263066552
2,train_model #2,1.652845,1.647868,139994266422688
2,to_array #0,0.043760,0.039731,139994266422688
2,multiply #1,0.029897,0.024791,139994266422688
3,train_model #2,1.298541,1.293026,139994121185768
3,to_array #0,0.045227,0.040213,139994121185768
3,multiply #1,0.033502,0.030644,139994121185768
4,train_model #2,1.454606,1.450196,139994266422688


In [11]:
# for each iter each action show 3 of the slowest methods, based on maximum `ncalls`
pipeline.show_profile_info(per_iter=True, detailed=True, sortby='tottime', limit=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ncalls,tottime,cumtime
iter,action,id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,multiply #1,find_in_index::../../../batchflow/batchflow/components.py::120::<listcomp>,4096,0.010762,0.013369
1,multiply #1,<listcomp>::../../../batchflow/batchflow/components.py::113::get_pos,64,0.002169,0.015538
1,multiply #1,_get::../../../batchflow/batchflow/components.py::145::get,64,0.002132,0.017950
1,to_array #0,find_in_index::../../../batchflow/batchflow/components.py::120::<listcomp>,4096,0.031077,0.037401
1,to_array #0,<listcomp>::../../../batchflow/batchflow/components.py::113::get_pos,64,0.006795,0.044196
1,to_array #0,<built-in method numpy.where>::../../../batchflow/batchflow/components.py::105::find_in_index,4096,0.004464,0.004464
1,train_model #2,<method 'run_backward' of 'torch._C._EngineBase' objects>::/home/tsimfer/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py::44::backward,1,1.213085,1.213085
1,train_model #2,<built-in method conv2d>::/home/tsimfer/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py::334::conv2d_forward,143,1.080935,1.080935
1,train_model #2,<built-in method batch_norm>::/home/tsimfer/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py::1643::batch_norm,143,0.241610,0.241610
2,multiply #1,find_in_index::../../../batchflow/batchflow/components.py::120::<listcomp>,4096,0.011132,0.013721


# Model profiling

In [12]:
model = pipeline.m('resnet')

There is an `info` property that, unsurprisingly, shows a lot of interesting details regarding model itself or the training process:

In [13]:
model.info


##### Config:
{'benchmark': True,
 'body': {'encoder': {'blocks': {'base': <class 'batchflow.models.torch.blocks.ResBlock'>,
                                 'bottleneck': False,
                                 'downsample': [False, True, True, True],
                                 'filters': [64, 128, 256, 512],
                                 'layout': 'cnacn',
                                 'n_reps': [2, 2, 2, 2],
                                 'se': False},
                      'downsample': {'layout': 'p',
                                     'pool_size': 2,
                                     'pool_strides': 2},
                      'num_stages': 4,
                      'order': ['skip', 'block']}},
 'common': {'data_format': 'channels_first'},
 'decay': None,
 'device': None,
 'head': {'classes': 10,
          'dropout_rate': 0.4,
          'filters': 10,
          'layout': 'Vdf',
          'target_shape': (64,),
          'units': 10},
 'initial_block': {'filters'

As with pipeline, there is a `profile_info` attribute, as well as `show_profile_info` method. Depending on type of the used device (`CPU` or `GPU`)

In [14]:
# one row for every operation inside model; limit at 5 rows
model.show_profile_info(per_iter=False, limit=5)

Unnamed: 0_level_0,ncalls,ncalls,ncalls,CPU_tottime,CPU_tottime,CPU_tottime,CPU_cumtime,CPU_cumtime,CPU_cumtime,CPU_tottime_avg,CPU_tottime_avg,CPU_tottime_avg
Unnamed: 0_level_1,sum,mean,max,sum,mean,max,sum,mean,max,sum,mean,max
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
mkldnn_convolution_backward,1000,20.0,20,38.598603,0.771972,1.102136,38.598603,0.771972,1.102136,1.92993,0.038599,0.055107
mkldnn_convolution,1000,20.0,20,6.414861,0.128297,0.219037,6.414861,0.128297,0.219037,0.320743,0.006415,0.010952
sqrt,3038,62.0,62,3.609079,0.073655,0.117845,3.609079,0.073655,0.117845,0.058211,0.001188,0.001901
native_batch_norm_backward,1000,20.0,20,2.508889,0.050178,0.078185,2.508889,0.050178,0.078185,0.125444,0.002509,0.003909
add_,10114,202.28,206,1.842055,0.036841,0.05612,1.842055,0.036841,0.05612,0.008953,0.000179,0.000272


In [15]:
# for each iteration show 3 of the slowest operations
model.show_profile_info(per_iter=True, limit=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,ncalls,CPU_tottime,CPU_cumtime,CPU_tottime_avg
iter,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,mkldnn_convolution_backward,20,1.102136,1.102136,0.055107
0,mkldnn_convolution,20,0.116687,0.116687,0.005834
0,native_batch_norm_backward,20,0.073042,0.073042,0.003652
1,mkldnn_convolution_backward,20,0.910873,0.910873,0.045544
1,mkldnn_convolution,20,0.109726,0.109726,0.005486
1,sqrt,62,0.061738,0.061738,0.000996
2,mkldnn_convolution_backward,20,0.674379,0.674379,0.033719
2,mkldnn_convolution,20,0.142355,0.142355,0.007118
2,sqrt,62,0.070446,0.070446,0.001136
3,mkldnn_convolution_backward,20,0.741461,0.741461,0.037073
