# [cknowledge.org/ai](http://cknowledge.org/ai): Crowdsourcing benchmarking and optimisation of AI

A suite of open-source tools for [collecting knowledge on optimising AI](http://bit.ly/hipeac49-ckdl):
* [Android app](https://play.google.com/store/apps/details?id=openscience.crowdsource.video.experiments&hl=en_GB)
* [Desktop app](https://github.com/dividiti/ck-crowdsource-dnn-optimization)
* [CK-Caffe](https://github.com/dividiti/ck-caffe)
* [CK-Caffe2](https://github.com/ctuning/ck-caffe2)
* [CK-TensorFlow](https://github.com/ctuning/ck-tensorflow)
* [CK-TensorRT](https://github.com/dividiti/ck-tensorrt)
* [CK-KaNN](https://github.com/dividiti/ck-kann)
* etc.

# [PUBLIC] Benchmarking Caffe on Firefly-RK3399

In [None]:
platform_id = 'Rockchip RK3399 Firefly Board (Linux Opensource)\x00'
platform_name = 'Firefly-RK3399'
platform_tag = 'firefly-rk3399'
platform_os = 'linux'
repo_uoa = 'ck-caffe-firefly-rk3399'

In [None]:
pretty_print_models = {
    'bvlc-alexnet':'AlexNet',
    'deepscale-squeezenet-1.1':'SqueezeNet',
    'bvlc-googlenet':'GoogleNet'
}

speedup_sort_models = [
    'AlexNet',
    'SqueezeNet',
    'GoogleNet'
]

In [None]:
if platform_os == 'linux':
    # All OpenCL BLAS libs and libDNN with OpenCL BLAS libs.
    opencl_libs = [ 
        # OpenCL BLAS libs.
        'opencl-clblas',
        'opencl-clblast',
        'opencl-clblast-tune',
        'opencl-viennacl',
        # libDNN with OpenCL BLAS libs.
        'opencl-libdnn-clblas',
        'opencl-libdnn-clblast',
        'opencl-libdnn-clblast-tune',
        'opencl-libdnn-viennacl'
    ]
    # libDNN with CLBlast-tune is typically the fastest, so the others can be omitted.
    libdnn_libs_to_drop = [
        'opencl-libdnn-clblas',
        'opencl-libdnn-clblast',
        'opencl-libdnn-viennacl'
    ]
    # Used by plot_max_num_images_per_second().
    pretty_print_libs = {
        'opencl-libdnn-clblast-tune': '[GPU] libDNN',
        'opencl-clblas':              '[GPU] clBLAS',
        'opencl-viennacl':            '[GPU] ViennaCL',
        'opencl-clblast':             '[GPU] CLBlast',
        'opencl-clblast-tune':        '[GPU] CLBlast (dv/dt)',
        'cpu':                        '[CPU] OpenBLAS'
    }
    speedup_sort_libs = [
        '[GPU] libDNN',
        '[GPU] clBLAS',
        '[GPU] ViennaCL',
        '[GPU] CLBlast',
        '[GPU] CLBlast (dv/dt)',
        '[CPU] OpenBLAS'
    ]    
else: # 'android'
    # As above but no clBLAS-based libs.
    opencl_libs = [
        # OpenCL BLAS libs.
        'opencl-clblast',
        'opencl-clblast-tune',
        'opencl-viennacl',
        # libDNN with OpenCL BLAS libs.
        'opencl-libdnn-clblast',
        'opencl-libdnn-clblast-tune',
        'opencl-libdnn-viennacl'
    ]
    # As above but no clBLAS-based libDNN.
    libdnn_libs_to_drop = [
        'opencl-libdnn-clblast',
        'opencl-libdnn-viennacl'
    ]
    # Used by plot_max_num_images_per_second().
    pretty_print_libs = {
        'opencl-libdnn-clblast-tune': '[GPU] libDNN',
        'opencl-viennacl':            '[GPU] ViennaCL',
        'opencl-clblast':             '[GPU] CLBlast',
        'opencl-clblast-tune':        '[GPU] CLBlast (dv/dt)',
        'cpu':                        '[CPU] OpenBLAS'
    }
    speedup_sort_libs = [
        '[GPU] libDNN',
        '[GPU] ViennaCL',
        '[GPU] CLBlast',
        '[GPU] CLBlast (dv/dt)',
        '[CPU] OpenBLAS'
    ]
# OpenBLAS.
cpu_libs = ['cpu']
all_libs = cpu_libs + opencl_libs

## Table of Contents

1. [Overview](#overview)
1. [See the code](#code) [for developers]
1. [Get the data](#data) [for developers]
1. [See the tables](#tables)
  1. [All data](#df_all)
  1. [All execution time data](#df_time)
  1. [Mean execution time per batch](#df_mean_time_per_batch)
  1. [Mean execution time per image](#df_mean_time_per_image)
  1. [Best mean execution time per image](#df_best_mean_time_per_image)
1. [See the graphs - grouped by models](#plot_models)
  1. [All libs](#plot_models_all)
  1. [GPU libs](#plot_models_gpu)
  1. [CLBlast-based libs](#plot_models_clblast)
  1. [libDNN-based libs](#plot_models_libdnn)
1. [See the graphs - grouped by libs](#plot_libs)
  1. [All models](#plot_libs_all)
  1. [All models, GPU libs](#plot_libs_gpu)
  1. [Models with AlexNet-level accuracy](#plot_libs_alexnet)
  1. [Models with AlexNet-level accuracy, CLBlast-based libs](#plot_libs_alexnet_clblast)
  1. [Models with AlexNet-level accuracy, libDNN-based libs](#plot_libs_alexnet_libdnn)
  1. [Models with AlexNet-level accuracy, CPU lib](#plot_libs_alexnet_cpu)
1. [See the graphs - per layer execution time profiling](#plot_per_layer)
1. [See the graphs - the ideal adaptive solution](#plot_ideal)
  1. [Using all libs for adaptation](#plot_ideal_all)
  1. [Using OpenBLAS and libDNN for adaptation](#plot_ideal_cpu_libdnn)
  1. [Using OpenBLAS and CLBlast for adaptation](#plot_ideal_cpu_clblast)
  1. [Using OpenBLAS and ViennaCL for adaptation](#plot_ideal_cpu_viennacl)
1. [See the memory consumption graph](#plot_memory)
  1. [Balance memory consumption and execution time per image](#balance_memory_time)
1. [Compare AlexNet and SqueezeNet 1.1](#alexnet_vs_squeezenet)
  1. [Memory consumption](#alexnet_vs_squeezenet_memory)
  1. [Execution time](#alexnet_vs_squeezenet_time)
1. [Summary](#summary)
  1. [Absolute performance](#performance_absolute)
  1. [Relative performance](#performance_relative)

<a id="overview"></a>
## Overview

This Jupyter Notebook studies the performance (execution time, memory consumption):
- on the [Firefly-RK3399](http://en.t-firefly.com/en/firenow/Firefly_RK3399/) development board:
  - Chip:
    - [Rockchip RK3399](http://rockchip.wikidot.com/rk3399)
  - CPU ("big"):
    - ARM&reg; Cortex&reg;-A72 architecture
    - Max clock 1800 MHz;
    - 2 cores;
  - CPU ("LITTLE"):
    - ARM&reg; Cortex&reg;-A53 architecture;
    - Max clock 1416 MHz;
    - 4 cores;
  - GPU:
    - ARM&reg; Mali&trade;-T860 architecture;
    - Max clock 800 MHz;
    - 4 cores;
    - OpenCL driver:
```
$ ck run program:tool-print-opencl-devices | grep "version:"
OpenCL 1.2 v1.r13p0-00rel0-git(a4271c9).1ad3fc5b648bfed782705ad469a9797a
```

  - RAM:
    - Samsung dual-channel DDR3;
    - 4 GB;
  - BSP:
    - [Firefly-rk3399_xubuntu1604_201704100952_Beta.7z](http://bbs.t-firefly.com/forum.php?mod=viewthread&tid=1876&extra=page%3D1)
```
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
$ uname -a
Linux firefly 4.4.52 #9 SMP Fri Mar 24 10:53:56 HKT 2017 aarch64 aarch64 aarch64 GNU/Linux
```

- of the following **Caffe models**:

```
$ ck show env --tags=caffemodel
Name:                                                      Version:                 Tags:

Caffe model (net and weights) (deepscale, squeezenet, 1.1) 1.1      32bits,bvlc,caffe,caffemodel,deepscale,host-os-linux-32,net,squeezenet,target-os-linux-32,v1,v1.1,weights

Caffe model (net and weights) (bvlc, googlenet)            trunk    32bits,bvlc,caffe,caffemodel,googlenet,host-os-linux-32,mirror,net,target-os-linux-32,v0,weights

Caffe model (net and weights) (bvlc, alexnet)              trunk    32bits,alexnet,bvlc,caffe,caffemodel,host-os-linux-32,mirror,net,target-os-linux-32,v0,weights
```

- with the following **Caffe libraries**:

```
$ ck show env --tags=lib,caffe
Name:                                             Version:       Tags:

BVLC Caffe framework (opencl,clblast,tune) trunk-880111e 32bits,bvlc,caffe,host-os-linux-32,lib,target-os-linux-32,v0,v0.0,vclblast,vopencl,vtune

BVLC Caffe framework (opencl,clblast)      trunk-880111e 32bits,bvlc,caffe,host-os-linux-32,lib,target-os-linux-32,v0,v0.0,vclblast,vopencl

BVLC Caffe framework (cpu)                 trunk-eeebdab 32bits,bvlc,caffe,host-os-linux-32,lib,target-os-linux-32,v0,v0.0,vcpu,vmaster

...
```

- built with the following **BLAS libraries**:

```
$ ck show env --tags=lib,blas
Name:            Version:            Tags:

clBLAS library   2.10-d16f7b3        32bits,blas,clblas,host-os-linux-32,lib,opencl-blas,target-os-linux-32,v2,v2.10,v2.10.0

ViennaCL library dvdt-master-5494bef 32bits,blas,host-os-linux-32,lib,opencl-blas,src,target-os-linux-32,v0,v0.0,v0.0.0,vdvdt,viennacl,vsrc

ViennaCL library master-5494bef      32bits,blas,host-os-linux-32,lib,opencl-blas,target-os-linux-32,v0,v0.0,vcpu,viennacl,vmaster

ViennaCL library master-5494bef      32bits,blas,host-os-linux-32,lib,opencl-blas,src,target-os-linux-32,v0,v0.0,viennacl,vmaster,vsrc

OpenBLAS library 0.2.19-85636ff      32bits,blas,cblas,host-os-linux-32,lib,openblas,target-os-linux-32,v0,v0.2,v0.2.19,v0.2.19.0

CLBlast library  development-3ec14df 32bits,blas,clblast,host-os-linux-32,lib,no-tune,opencl-blas,target-os-linux-32,v0,v0.0,vdev,vmaster

CLBlast library  development-c9f39ed 32bits,blas,clblast,clblast-tune,host-os-linux-32,lib,opencl-blas,target-os-linux-32,tune,v0,v0.0,vdev,vmaster

```

- with the **batch size** varying from 2 to 16 with step 2.

<a id="code"></a>
## Data wrangling code

**NB:** Please ignore this section if you are not interested in re-running or modifying this notebook.

### Includes

#### Standard

In [None]:
import os
import sys
import json
import re

#### Scientific

If some of the scientific packages are missing, please install them using:
```
# pip install jupyter pandas numpy matplotlib
```

In [None]:
import IPython as ip
import pandas as pd
import numpy as np
import matplotlib as mp

In [None]:
print ('IPython version: %s' % ip.__version__)
print ('Pandas version: %s' % pd.__version__)
print ('NumPy version: %s' % np.__version__)
print ('Matplotlib version: %s' % mp.__version__)

In [None]:
from IPython.display import Image, display
def display_in_full(df):
    pd.options.display.max_columns = len(df.columns)
    pd.options.display.max_rows = len(df.index)
    display(df)

In [None]:
# import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

In [None]:
default_title = platform_name
default_ylabel = 'Execution time (ms)'
default_colormap = cm.autumn
default_figsize = [16, 8]
default_dpi = 200
default_fontsize = 16

In [None]:
if mp.__version__[0]=='2': mp.style.use('classic')
mp.rcParams['figure.figsize'] = default_figsize
mp.rcParams['figure.dpi'] = default_dpi
mp.rcParams['font.size'] = default_fontsize
mp.rcParams['legend.fontsize'] = 'medium'

#### Collective Knowledge

If CK is not installed, please install it using:
```
# pip install ck
```

In [None]:
import ck.kernel as ck
print ('CK version: %s' % ck.__version__)

### Access the experimental data

In [None]:
def get_experimental_results(repo_uoa='local', default_tag='explore-batch-size-libs-models', time_ms='time_fw_ms'):
    module_uoa = 'experiment'
    r = ck.access({'action':'search', 'repo_uoa':repo_uoa, 'module_uoa':module_uoa, 'tags':default_tag})
    if r['return']>0:
        print ("Error: %s" % r['error'])
        exit(1)
    experiments = r['lst']
    
    dfs = []
    for experiment in experiments:
        data_uoa = experiment['data_uoa']
        r = ck.access({'action':'list_points', 'repo_uoa':repo_uoa, 'module_uoa':module_uoa, 'data_uoa':data_uoa})
        if r['return']>0:
            print ("Error: %s" % r['error'])
            exit(1)

        # Get (lib_tag, model_tag) from a list of tags in r['dict']['tags'].
        # This list includes a couple of irrelevant tags (e.g. cmd key), a model tag and a lib tag.
        # NB: Since it's typically easier to list all model tags than all lib tags, only the list
        # of models is specified explicitly.
        tags = r['dict']['tags']
        cmd_key = 'default'
        irrelevant_tags = [ cmd_key, default_tag, platform_tag ]
        model_tags = [ 'bvlc-alexnet', 'bvlc-googlenet', 'deepscale-squeezenet-1.1' ]
        lib_model_tags = [ tag for tag in tags if tag not in irrelevant_tags ]
        model_tags = [ tag for tag in lib_model_tags if tag in model_tags ]
        lib_tags = [ tag for tag in lib_model_tags if tag not in model_tags ]
        # print (tags, model_tags, lib_tags)
        if len(lib_tags)==1 and len(model_tags)==1:
            (lib, model) = (lib_tags[0], model_tags[0])
        else:
            continue
        
        for point in r['points']:
            with open(os.path.join(r['path'], 'ckp-%s.0001.json' % point)) as point_file:
                point_data_raw = json.load(point_file)
            characteristics_list = point_data_raw['characteristics_list']
            num_repetitions = len(characteristics_list)
            platform = point_data_raw['features']['platform']['platform']['model']
            batch_size = np.int64(point_data_raw['choices']['env'].get('CK_CAFFE_BATCH_SIZE',-1))
            # Obtain column data.
            data = [
                {
                    # features
                    'platform' : platform,
                    # choices
                    'lib' : lib,
                    'model' : model,
                    'batch_size' : batch_size,
                    # statistical repetition
                    'repetition_id': repetition_id,
                    # runtime characteristics
                    'time (ms)'   : characteristics['run'].get(time_ms, 1e+9), # 1e+9 is like "positive infinity"
                    'memory (MB)' : characteristics['run'].get('memory_mbytes', -1),
                    'success?'    : characteristics['run'].get('run_success', 'n/a'),
                    'per layer info' : characteristics['run'].get('per_layer_info', []),
                }
                for (repetition_id, characteristics) in zip(range(num_repetitions), characteristics_list) 
            ]
            # Deal with missing column data (resulting from failed runs).
            if len(data)==1:                
                print ('[Warning] Missing data for: lib=%s, model=%s, batch_size=%d' % (lib, model, batch_size))
                data = data * num_repetitions
            # Construct a DataFrame.
            df = pd.DataFrame(data)
            # Set columns and index names.
            df.columns.name = 'run characteristic'
            df.index.name = 'index'
            df = df.set_index([ 'platform', 'lib', 'model', 'batch_size', 'repetition_id' ])
            # Append to the list of similarly constructed DataFrames.
            dfs.append(df)
    # Concatenate all constructed DataFrames (i.e. stack on top of each other).    
    result = pd.concat(dfs)
    return result.sort_index(level=result.index.names)

### Plot execution time per image or memory consumption

In [None]:
def plot(mean, std, title=default_title, ylabel=default_ylabel, rot=0, ymax=0):
    ymax = mean.max().max() if ymax==0 else ymax
    ax = mean.plot(kind='bar', yerr=std, grid=True, legend=True, rot=rot, ylim=[0,ymax*1.05],
                   fontsize=default_fontsize, figsize=default_figsize, colormap=default_colormap)
    ax.set_title(title, fontsize=default_fontsize)
    ax.set_xlabel(mean.index.name, fontsize=default_fontsize)
    ax.set_ylabel(ylabel, fontsize=default_fontsize)
    return ax

### Plot maximum number of images per second (FPS)

In [None]:
def plot_max_num_images_per_second(df_mean_time_per_image, libs_to_drop=libdnn_libs_to_drop, rot=0, fontsize=None):
    min_time_per_image = df_mean_time_per_image.min(axis=1).unstack('lib')
    images_per_second = 1e+3 / min_time_per_image
    max_num_images_per_second = images_per_second \
        .drop(libs_to_drop, axis=1) \
        .rename(columns=pretty_print_libs, index=pretty_print_models) \
        .reindex(columns=speedup_sort_libs, index=speedup_sort_models)
    ax = max_num_images_per_second \
        .plot(kind='bar', rot=rot, width=0.95, grid=True, legend=True,
              fontsize=default_fontsize, figsize=default_figsize, colormap=default_colormap)
    ax.set_title(default_title, fontsize=default_fontsize)
    ax.set_xlabel(max_num_images_per_second.index.name, fontsize=default_fontsize)
    ax.set_ylabel('Images/second (with the best even batch size from 2 to 16)', fontsize=default_fontsize)
    ax.legend(loc='upper left');
    for patch in ax.patches:
        text = '{0:.2f}'.format(patch.get_height())[0:4]
        ax.annotate(text, (patch.get_x()*1.00, patch.get_height()*1.01), fontsize=fontsize)

### Plot the speedup over a given baseline

In [None]:
def plot_speedup_over_baseline(df_mean_time_per_image, baseline='opencl-clblast-tune', 
                               libs_to_drop=libdnn_libs_to_drop, rot=0, fontsize=None):
    speedup_over_baseline = df_mean_time_per_image.min(axis=1).unstack('model').loc[baseline] / \
                            df_mean_time_per_image.min(axis=1).unstack('model')
    speedup_over_baseline = speedup_over_baseline.T \
        .drop(libs_to_drop, axis=1) \
        .rename(index=pretty_print_models, columns=pretty_print_libs) \
        .reindex(columns=speedup_sort_libs)
    ax = speedup_over_baseline \
        .plot(kind='bar', rot=rot, width=0.95, grid=True, legend=True,
              fontsize=default_fontsize, figsize=default_figsize, colormap=default_colormap)
    ax.set_title(default_title, fontsize=default_fontsize)
    ax.set_xlabel(speedup_over_baseline.index.name, fontsize=default_fontsize)
    ax.set_ylabel('Speedup over the given baseline (%s)' % pretty_print_libs[baseline], fontsize=default_fontsize)
    for patch in ax.patches:
        text = '{0:.2f}'.format(patch.get_height())[0:4]
        ax.annotate(text, (patch.get_x()*1.00, patch.get_height()*1.01), fontsize=fontsize)

### Plot execution time per image per layer

In [None]:
# This transformation is time consuming, hence only call it once for multiple plots.
def get_per_layer_info(df_all):
    df_per_layer_info = df_all['per layer info']
    row_dfs = []
    for (row_info, row_id) in zip(df_per_layer_info, range(len(df_per_layer_info))):
        # Skip constructing a DataFrame when no layer info is available.
        if not row_info: continue
        # Augment each layer info with the row index: (model, lib, batch_size, repetition_id).
        for layer_info in row_info:
            layer_info.update({ k : v for k, v in zip(df_per_layer_info.index.names, df_per_layer_info.index[row_id]) })
        # Construct a DataFrame and move the row index to where it belongs.
        row_df = pd.DataFrame(data=row_info).set_index(df_per_layer_info.index.names)
        row_dfs.append(row_df)
    return pd.concat(row_dfs)

In [None]:
def plot_time_per_image_per_layer(df_per_layer_info, model, libs, batch_sizes,
                                  direction=['forward'], lower=0.0, upper=1.0, ymax=0, rot=90):
    df_time_per_batch = df_per_layer_info.loc[model, libs, batch_sizes] \
        .set_index(['direction', 'label'], append=True) \
        .reorder_levels(['direction', 'label', 'model', 'lib', 'batch_size', 'repetition_id']) \
        .loc[direction] \
        .reorder_levels(['label', 'model', 'lib', 'batch_size', 'repetition_id', 'direction']) \
        .groupby(level=['label', 'model', 'lib', 'batch_size', 'repetition_id']).sum() \
        ['time_ms']
    df_time_per_image = df_time_per_batch.unstack('batch_size') / batch_sizes
    df = df_time_per_image.unstack(['lib', 'model'])
    df = df.reorder_levels(['model', 'lib', 'batch_size'], axis=1)
    mean = df.groupby(level='label').mean()
    std = df.groupby(level='label').std()
    select = (lower*mean.sum() <= mean).any(axis=1) & (mean <= upper*mean.sum()).any(axis=1)
    ymax = mean[select].max().max() if ymax==0 else ymax
    ax = plot(mean=mean[select], std=std[select], ylabel='Execution per image time per layer (ms)', ymax=ymax, rot=rot)
    ax.set_xlabel('Layer', fontsize=default_fontsize)

### Plot the ideal adaptive solution

In [None]:
# The ideal adaptive solution for each layer selects the best performing library from the 'libs_for_adaptation' list.
def get_ideal_adaptive_solution(df_per_layer_info, libs_for_adaptation, direction=['forward']):
    df_for_adaptation = df_per_layer_info \
        .set_index(['direction', 'label'], append=True) \
        .reorder_levels(['direction', 'lib', 'model', 'label', 'batch_size', 'repetition_id']) \
        .loc[direction] \
        .sort_index() \
        .reorder_levels(['lib', 'model', 'label', 'batch_size', 'repetition_id', 'direction']) \
        .loc[libs_for_adaptation] \
        .reorder_levels(['model', 'label', 'lib', 'batch_size', 'repetition_id', 'direction']) \
        ['time_ms']
    # With every step, reduce the rightmost dimension until the min time per model is reached.
    df_cum_time_per_repetition = df_for_adaptation.groupby(level=df_for_adaptation.index.names[:-1]).sum()
    df_min_time_per_repetition = df_cum_time_per_repetition.groupby(level=df_cum_time_per_repetition.index.names[:-1]).min()
    batch_sizes = df_per_layer_info.index.levels[df_per_layer_info.index.names.index('batch_size')].values
    df_min_time_per_batch = df_min_time_per_repetition.unstack('batch_size') / batch_sizes
    df_min_time_per_image = df_min_time_per_batch.min(axis=1)
    df_min_time_per_layer = df_min_time_per_image.groupby(level=df_min_time_per_image.index.names[:-1]).min()
    #df_min_time_per_model = df_min_time_per_layer.groupby(level=df_min_time_per_layer.index.names[:-1]).sum()
    # Transform to get the models in the index and the libs in the columns.
    df_min_time_per_layer_idx = df_min_time_per_image.groupby(level=df_min_time_per_image.index.names[:-1]).idxmin()
    df_ideal = df_min_time_per_image[df_min_time_per_layer_idx] \
        .reorder_levels(['model', 'lib', 'label']) \
        .groupby(level=['model', 'lib']).sum() \
        .unstack('lib')
    # Sort in the order of increasing time per model.
    df_ideal_sorted = df_ideal.loc[df_ideal.sum(axis=1).sort_values(ascending=True).index]
    return df_ideal_sorted

In [None]:
def plot_ideal_adaptive_solution(df_ideal, df_real, tag=''):
    figsize=[15, 3]
    if not tag=='': figsize=[10, 2] # good for dumping png (e.g. 3 graphs fit well onto a slide).
    for model in df_ideal.index:
        df_data = {}; df_data['adaptive'] = df_ideal.loc[model]
        for lib in df_ideal.columns:
            df_data[lib] = pd.Series(index=df_ideal.columns)
            df_data[lib][lib] = df_real.loc[model, lib]
        df = pd.DataFrame(df_data).T \
            .rename(
                columns=pretty_print_libs,
                index={ 
                    key : pretty_print_libs[key].replace('[CPU]', 'Only').replace('[GPU]', 'Only')
                    for key in pretty_print_libs.keys()
                })
        ax = df.loc[df.sum(axis=1).sort_values(ascending=True).index] \
            .plot(kind='barh', stacked=True, width=0.9, grid=True, legend=True,
                  fontsize=default_fontsize, figsize=figsize, colormap=cm.summer_r)
            #.legend(loc='lower right')
        ax.set_title('%s - execution time per image (ms)' % model, fontsize=default_fontsize)
        if not tag=='': ax.get_figure().savefig('%s.%s.png' % (tag, model))

### Plot execution time per image and memory consumption

In [None]:
def plot_time_per_image_and_memory_consumption(df_all, model, lib):
    df = df_all[['time (ms)', 'memory (MB)']] \
        .groupby(level=df_all.index.names[:-1]).mean() \
        .loc[lib, model]
    df['time per image (ms)'] = df['time (ms)'].divide(df.index, axis=0)
    df['memory per image (MB)'] = df['memory (MB)'].divide(df.index, axis=0)
    df = df.drop('time (ms)', axis=1).sort_index()
    ax = df.plot(secondary_y=['memory (MB)', 'memory per image (MB)'], mark_right=False, grid=True,
                 figsize=[12, 8], fontsize=default_fontsize, colormap=cm.winter)
    ax.set_title('%s w/ %s' % (model, lib), fontsize=default_fontsize)
    ax.set_xlabel(df.index.name, fontsize=default_fontsize)
    ax.set_ylabel('execution time (ms)', fontsize=default_fontsize); ax.legend(loc='center left'); ax.set_ylim(0)
    ax.right_ax.set_ylabel('memory consumption (MB)', fontsize=default_fontsize); ax.right_ax.legend(loc='center right')

<a id="data"></a>
## Get the experimental data

**NB:** Please ignore this section if you are not interested in re-running or modifying this notebook. 

The Caffe experimental data was collected on the experimental platform (after installing all Caffe libraries and models of interest) as follows:
```
$ cd `ck find ck-caffe:script:explore-batch-size-libs-models`
$ python explore-batch-size-libs-models-benchmark.py
```

It can be downloaded and registered with CK as follows:
```
$ ck add repo:ck-caffe-firefly-rk3399 --url=https://github.com/dividiti/ck-caffe-firefly-rk3399 --quiet
```

<a id="tables"></a>
## Tables

<a id="df_all"></a>
### All data

In [None]:
df_all = get_experimental_results(repo_uoa)
display_in_full(df_all)

<a id="df_time"></a>
### All execution time data indexed by repetitions

In [None]:
df_time = df_all['time (ms)'].unstack(df_all.index.names[:-1])
display_in_full(df_time)

<a id="df_mean_time_per_batch"></a>
### Mean execution time per batch

In [None]:
df_mean_time_per_batch = df_all['time (ms)'] \
    .unstack('batch_size') \
    .reorder_levels(['platform', 'model', 'lib', 'repetition_id']) \
    .groupby(['platform', 'model', 'lib']).mean() \
    .sort_index()
display_in_full(df_mean_time_per_batch)

<a id="df_mean_time_per_image"></a>
### Mean execution time per image

In [None]:
batch_sizes = df_mean_time_per_batch.columns.tolist()
df_mean_time_per_image = df_mean_time_per_batch / batch_sizes
display_in_full(df_mean_time_per_image)

<a id="df_best_mean_time_per_image"></a>
### Best mean execution time per image

In [None]:
df_mean_time_per_image = df_mean_time_per_image.loc[platform_id]

In [None]:
plot_max_num_images_per_second(df_mean_time_per_image, libs_to_drop=[], fontsize=18)

In [None]:
df_mean_time_per_image.min(axis=1)

In [None]:
df_mean_time_per_image.idxmin(axis=1)

In [None]:
df_time = df_time[platform_id].reorder_levels(['model', 'lib', 'batch_size'], axis=1)
display_in_full(df_time)

In [None]:
df_time_per_image = df_time / (batch_sizes*(len(df_time.columns)/len(batch_sizes)))
display_in_full(df_time_per_image)

In [None]:
df_min_time_per_image_index = pd.DataFrame(df_mean_time_per_image.idxmin(axis=1)).set_index(0, append=True).index.values
df_min_time_per_image_index

In [None]:
df_model_lib = df_time_per_image[df_min_time_per_image_index] \
    .stack(['model', 'lib']) \
    .sum(axis=1)

<a id="plot_models"></a>
## Plot by Caffe models

In [None]:
df_model_lib_mean = df_model_lib.groupby(level=['model', 'lib']).mean()
df_model_lib_std  = df_model_lib.groupby(level=['model', 'lib']).std()

<a id="plot_models_all"></a>
### All libs

In [None]:
mean = df_model_lib_mean.unstack('lib')
std  = df_model_lib_std.unstack('lib')
plot(mean, std)

<a id="plot_models_gpu"></a>
### Only GPU libs

In [None]:
mean = df_model_lib_mean.unstack('lib').drop('cpu', axis=1, errors='ignore')
std  = df_model_lib_std.unstack('lib').drop('cpu', axis=1, errors='ignore')
plot(mean, std)

<a id="plot_models_clblast"></a>
### Only CLBlast-based libs (default vs tuned)

In [None]:
clblast_libs = [ 'opencl-clblast', 'opencl-clblast-tune' ]

In [None]:
df_model_clblast_libs = df_model_lib \
    .reorder_levels(['lib', 'model','repetition_id']) \
    .sort_index() \
    .loc[clblast_libs]
mean = df_model_clblast_libs.groupby(level=['lib', 'model']).mean().unstack('lib')
std = df_model_clblast_libs.groupby(level=['lib', 'model']).std().unstack('lib')
plot(mean, std)

In [None]:
mean.div(mean['opencl-clblast-tune'], axis='index')

<a id="plot_models_libdnn"></a>
### Only libDNN-based libs

In [None]:
libdnn_libs = libdnn_libs_to_drop + [ 'opencl-libdnn-clblast-tune' ]
libdnn_libs

In [None]:
df_model_libdnn_libs = df_model_lib \
    .reorder_levels(['lib', 'model','repetition_id']) \
    .sort_index() \
    .loc[libdnn_libs]
mean = df_model_libdnn_libs.groupby(level=['lib', 'model']).mean().unstack('lib')
std = df_model_libdnn_libs.groupby(level=['lib', 'model']).std().unstack('lib')
plot(mean, std)

In [None]:
mean.div(mean['opencl-libdnn-clblast-tune'], axis='index')

<a id="plot_libs"></a>
## Plot by Caffe libs

<a id="plot_libs_all"></a>
### All models, all libs

In [None]:
mean = df_model_lib_mean.unstack('model')
std  = df_model_lib_std.unstack('model')
plot(mean, std, rot=30)

<a id="plot_libs_gpu"></a>
### All models, only GPU libs

In [None]:
mean = df_model_lib_mean.unstack('model').drop('cpu', axis=0, errors='ignore')
std  = df_model_lib_std.unstack('model').drop('cpu', axis=0, errors='ignore')
plot(mean, std, rot=30)

<a id="plot_libs_alexnet"></a>
### Only models with AlexNet-level accuracy

In [None]:
alexnet_level_accuracy = [ 'bvlc-alexnet', 'deepscale-squeezenet-1.1' ]

In [None]:
mean = df_model_lib_mean[alexnet_level_accuracy].unstack('model')
std  = df_model_lib_std[alexnet_level_accuracy].unstack('model')
plot(mean, std, rot=30)

In [None]:
pd.options.display.max_rows=mean.index.size
mean['deepscale-squeezenet-1.1'] / mean['bvlc-alexnet']

<a id="plot_libs_alexnet_clblast"></a>
### Only models with AlexNet-level accuracy, only CLBlast-based libs

In [None]:
clblast_libs

In [None]:
mean = df_model_lib_mean[alexnet_level_accuracy].unstack('model').loc[clblast_libs]
std  = df_model_lib_std[alexnet_level_accuracy].unstack('model').loc[clblast_libs]
plot(mean, std)

In [None]:
mean['deepscale-squeezenet-1.1'] / mean['bvlc-alexnet']

<a id="plot_libs_alexnet_libdnn"></a>
### Only models with AlexNet-level accuracy, only libDNN libs

In [None]:
libdnn_libs

In [None]:
mean = df_model_lib_mean[alexnet_level_accuracy].unstack('model').loc[libdnn_libs]
std  = df_model_lib_std[alexnet_level_accuracy].unstack('model').loc[libdnn_libs]
plot(mean, std)

In [None]:
mean['deepscale-squeezenet-1.1'] / mean['bvlc-alexnet']

<a id="plot_libs_alexnet_cpu"></a>
### Only models with AlexNet-level accuracy, only CPU lib

In [None]:
mean = df_model_lib_mean[alexnet_level_accuracy].unstack('model').loc[['cpu']]
std  = df_model_lib_std[alexnet_level_accuracy].unstack('model').loc[['cpu']]
plot(mean, std)

In [None]:
mean['deepscale-squeezenet-1.1'] / mean['bvlc-alexnet']

<a id="plot_per_layer"></a>
## Plot execution time per image per layer

In [None]:
df_per_layer_info = get_per_layer_info(df_all)

In [None]:
# Transform to keep compatibility. Swap 'model' and 'lib'. Remove 'platform'. Sort index.
df_per_layer_info = df_per_layer_info \
    .reorder_levels(['platform', 'model', 'lib', 'batch_size', 'repetition_id']) \
    .loc[platform_id] \
    .sort_index()

In [None]:
# Plot for a list of batch sizes.
plot_time_per_image_per_layer(
    df_per_layer_info, model='bvlc-alexnet', libs='opencl-clblast-tune', batch_sizes=[2, 8, 16]
)

In [None]:
# Plot for a list of batch sizes. Only plot layers that consume at least 10% of the total execution time.
plot_time_per_image_per_layer(
    df_per_layer_info, model='bvlc-alexnet', libs='opencl-clblast-tune', batch_sizes=[8, 16], lower=0.10, rot=0)

In [None]:
# Plot for a list of libs. Only plot layers that consume between 5% and 10% of the total execution time.
plot_time_per_image_per_layer(
    df_per_layer_info, model='deepscale-squeezenet-1.1',
    libs=['opencl-clblast','opencl-clblast-tune'], batch_sizes=2, lower=0.05, upper=0.10, rot=10)

<a id="plot_ideal"></a>
## Plot the ideal adaptive solution

<a id="plot_ideal_all"></a>
### Using all reasonable libs for adaptation

In [None]:
libs_for_adaptation = df_per_layer_info.index.get_level_values('lib') \
    .drop_duplicates() \
    .drop(libdnn_libs_to_drop, errors='ignore') \
    .values
libs_for_adaptation

Each row specifies an ideal adaptive solution for a model. Each column specifies the execution time (in ms per image) that the ideal adaptive solution would cumulatively spend using a particular library.

In [None]:
df_ideal_all = get_ideal_adaptive_solution(df_per_layer_info, libs_for_adaptation)
display_in_full(df_ideal_all)

In [None]:
plot_ideal_adaptive_solution(df_ideal_all, df_model_lib_mean)

In [None]:
df_best_lib = df_model_lib_mean.reorder_levels(['lib', 'model']).unstack('lib')
df_ideal_all.sum(axis=1) / df_best_lib.min(axis=1)

<a id="plot_ideal_cpu_libdnn"></a>
### Using OpenBLAS and libDNN for adaptation

In [None]:
libs_for_adaptation = ['cpu', 'opencl-libdnn-clblast-tune']
df_ideal_cpu_libdnn = get_ideal_adaptive_solution(df_per_layer_info, libs_for_adaptation)
display_in_full(df_ideal_cpu_libdnn)

In [None]:
plot_ideal_adaptive_solution(df_ideal_cpu_libdnn, df_model_lib_mean)

In [None]:
# By how much is the execution time reduced compared to the best non-adaptive solution?
df_best_lib = df_model_lib_mean.reorder_levels(['lib', 'model']).sort_index()[libs_for_adaptation].unstack('lib')
df_ideal_cpu_libdnn.sum(axis=1) / df_best_lib.min(axis=1)

In [None]:
# By how much is the execution time increased compared to the best adaptive solution?
df_ideal_cpu_libdnn.sum(axis=1) / df_ideal_all.sum(axis=1)

<a id="plot_ideal_cpu_clblast"></a>
### Using OpenBLAS and CLBlast for adaptation

In [None]:
libs_for_adaptation = ['cpu', 'opencl-clblast-tune']
df_ideal_cpu_clblast = get_ideal_adaptive_solution(df_per_layer_info, libs_for_adaptation)
display_in_full(df_ideal_cpu_clblast)

In [None]:
plot_ideal_adaptive_solution(df_ideal_cpu_clblast, df_model_lib_mean)

In [None]:
# By how much is the execution time reduced compared to the best non-adaptive solution?
df_best_lib = df_model_lib_mean.reorder_levels(['lib', 'model']).sort_index()[libs_for_adaptation].unstack('lib')
df_ideal_cpu_clblast.sum(axis=1) / df_best_lib.min(axis=1)

In [None]:
# By how much is the execution time increased compared to the best adaptive solution?
df_ideal_cpu_clblast.sum(axis=1) / df_ideal_all.sum(axis=1)

<a id="plot_ideal_cpu_viennacl"></a>
### Using OpenBLAS and ViennaCL for adaptation

In [None]:
libs_for_adaptation = ['cpu', 'opencl-viennacl']
df_ideal_cpu_viennacl = get_ideal_adaptive_solution(df_per_layer_info, libs_for_adaptation)
display_in_full(df_ideal_cpu_viennacl)

In [None]:
plot_ideal_adaptive_solution(df_ideal_cpu_viennacl, df_model_lib_mean)

In [None]:
# By how much is the execution time reduced compared to the best non-adaptive solution?
df_best_lib = df_model_lib_mean.reorder_levels(['lib', 'model']).sort_index()[libs_for_adaptation].unstack('lib')
df_ideal_cpu_viennacl.sum(axis=1) / df_best_lib.min(axis=1)

In [None]:
# By how much is the execution time increased compared to the best adaptive solution?
df_ideal_cpu_viennacl.sum(axis=1) / df_ideal_all.sum(axis=1)

<a id="plot_memory"></a>
## Plot memory consumption

In [None]:
df_memory = df_all['memory (MB)'].loc[platform_id]
# Batch size of 2; repetition 0 (should always be available).
df_memory = df_memory.unstack(['model','lib']).loc[2].loc[0].unstack('lib')
plot(mean=df_memory, std=pd.DataFrame(), ylabel='Memory size (MB)')

<a id="balance_memory_time"></a>
### Balance memory consumption and execution time per image

The above, however, does not tell the full story. The memory consumption, as reported by Caffe, increases linearly with the batch size. In other words, the memory consumption per image is constant. (Note that extra memory may be required e.g. for GPU buffers in host memory.)

The execution time per image, however, decreases asymptotically. Since minimizing the execution time almost always should be balanced with minimizing the memory consumption, we should select the batch size that results in "good enough" performance.

We give several examples below. Note that the execution time per batch is omitted to make the execution time per image more pronounced.

In [None]:
plot_time_per_image_and_memory_consumption(df_all.loc[platform_id], model='bvlc-alexnet', lib='opencl-clblast-tune')

In [None]:
plot_time_per_image_and_memory_consumption(df_all.loc[platform_id], model='deepscale-squeezenet-1.1', lib='opencl-viennacl')

<a id="alexnet_vs_squeezenet"></a>
## Compare AlexNet and SqueezeNet

<a id="alexnet_vs_squeezenet_memory"></a>
### Memory consumption

In [None]:
# SqueezeNet consumes ~4 times more activation memory than AlexNet.
df_memory.loc[['bvlc-alexnet', 'deepscale-squeezenet-1.1']].iloc[1] / \
df_memory.loc[['bvlc-alexnet', 'deepscale-squeezenet-1.1']].iloc[0]

<a id="alexnet_vs_squeezenet_time"></a>
### Execution time

In [None]:
mean = df_model_lib_mean[['bvlc-alexnet', 'deepscale-squeezenet-1.1']].unstack('lib')
std  = df_model_lib_std[['bvlc-alexnet', 'deepscale-squeezenet-1.1']].unstack('lib')
plot(mean, std)

In [None]:
df_model_lib_mean[['bvlc-alexnet', 'deepscale-squeezenet-1.1']].unstack('lib').iloc[1] / \
df_model_lib_mean[['bvlc-alexnet', 'deepscale-squeezenet-1.1']].unstack('lib').iloc[0]

<a id="summary"></a>
## Summary

<a id="performance_absolute"></a>
### What's the absolute performance?

In [None]:
plot_max_num_images_per_second(
    df_mean_time_per_image,
    fontsize=18
)

<a id="performance_relative"></a>
### What's the relative performance?

#### ... over OpenBLAS?

In [None]:
plot_speedup_over_baseline(
    df_mean_time_per_image,
    baseline='cpu',
    fontsize=18
)

#### ... over CLBlast (dv/dt)?

In [None]:
plot_speedup_over_baseline(
    df_mean_time_per_image,
    baseline='opencl-clblast-tune',
    fontsize=18
)

In [None]:
plot_speedup_over_baseline(
    df_mean_time_per_image.loc[['bvlc-alexnet', 'deepscale-squeezenet-1.1']],
    baseline='opencl-clblast-tune',
    fontsize=26
)