# Evaluating OpenCL Caffe caching mechanisms

## Table of Contents

1. [Overview](#overview)
1. [Experimental analysis code](#code) [for developers]
1. [Mali-T628](#mali_t628)
  1. [Original caching mechanism](#mali_t628_original)
  1. [Proposed caching mechanism](#mali_t628_proposed)
  1. [Compare the proposed mechanism vs the original mechanism](#mali_t628_compare)
1. [GTX 1080](#gtx_1080)
  1. [Original caching mechanism](#gtx_1080_original)
  1. [Proposed caching mechanism](#gtx_1080_proposed)
  1. [Compare the proposed mechanism vs the original mechanism](#gtx_1080_compare)

<a id="overview"></a>
## Overview

This Jupyter notebook studies the performance (speed) of OpenCL API build and compile calls for the Caffe framework built with the ViennaCL library using two mechanisms:
- [`original`] the current ViennaCL caching mechanism, in which the program binary is cached after the `clBuildProgram()` call but before the `clCreateKernelsInProgram()` call;
- [`proposed`] the new caching mechanism proposed by [dividiti](http://dividiti.com), in which the program binary is cached after both the `clBuildProgram()` and `clCreateKernelsInProgram()` calls;

on two experimental platforms:
- [[Mali-T628](#mali_t628)] ARM Mali-T628 GPU in the Odroid-XU3 development platform (GPU driver v12.0);
- [[GTX 1080](#gtx_1080)] NVIDIA GTX 1080 GPU installed in a HP 640 workstation (GPU driver v375.26).

Our experiments show that ARM's OpenCL implementation only compiles kernels when the user invokes the `clCreateKernelsInProgram()` API call; therefore, on the Mali-T628 platform, the original mechanism is ineffective; the proposed mechanism accelerates the OpenCL Caffe initialisation time by over 50 times on subsequent invocations.

Our experiments also show that NVIDIA's OpenCL implementation compiles kernels when the user invokes the `clBuildProgram()` API call; therefore, on the GTX 1080 platform, both the original and the proposed mechanisms perform similarly.

### Common experimental setup

The experiments were performed using the [Collective Knowledge](http://cknowledge.org) framework for reproducible and collaborative R&amp;D using the following preparatory steps on each of the platforms.

For simplicity, only one instance of Caffe, one instance of Caffe model and one instance of [dividiti's OpenCL profiler](https://github.com/dividiti/dvdt-prof) were registered on the platform at the time e.g.:
```
$ ck show env --tags=caffemodel
Env UID:         Target OS: Bits: Name:                                         Version: Tags:
e8811419b9c1149c   linux-32    32 Caffe model (net and weights) (bvlc, alexnet) trunk    32bits,alexnet,bvlc,caffe,caffemodel,host-os-linux-32,mirror,net,target-os-linux-32,v0,weights
$ ck show env --tags=lib,caffe     
Env UID:         Target OS: Bits: Name:                                  Version:       Tags:
53834f239eff6c18   linux-32    32 BVLC Caffe framework (opencl,viennacl) master-69f35c5 32bits,bvlc,caffe,host-os-linux-32,lib,target-os-linux-32,v0,v0.0,vmaster,vopencl
$ ck show env --tags=dvdt,prof
Env UID:         Target OS: Bits: Name:                                  Version: Tags:
2ea881239f688658   linux-32    32 dividiti's OpenCL API profiler (cjson) trunk    32bits,cjson,dividiti,dvdt,host-os-linux-32,opencl,prof,profiler,target-os-linux-32,tool,v0,vtrunk
```

To start with, clean previous installations and experiments, and install ViennaCL source-only packages:
1. Clean all previous OpenCL Caffe caching mechanism experiments:
```
$ ck rm local:experiment:original*cache* --force
$ ck rm local:experiment:proposed*cache* --force
```

1. Clean all previous installations of Caffe with ViennaCL and ViennaCL sources:
```
$ ck clean env --tags=lib,caffe,vviennacl
$ ck clean env --tags=lib,viennacl,vsrc
```

1. Install [ViennaCL's master](https://github.com/viennacl/viennacl-dev) (**original**) and [dividiti's fork](https://github.com/dividiti/viennacl-dev) (**proposed**):
```
$ ck install package:lib-viennacl-master-src
$ ck install package:lib-viennacl-dvdt-src
```

First, perform the **original** set of experiments:
1. Install Caffe with ViennaCL, while selecting [ViennaCL's master](https://github.com/viennacl/viennacl-dev):
  1. On the GTX 1080 platform:
```
$ ck install package:lib-caffe-bvlc-opencl-viennacl-universal
``` 
Also, on this platform, `ck-caffe:program:caffe` sometimes fails after reporting the benchmarking results; to perform the requested number of repetitions even if some of them fail, change `"ignore_return_code":"no"` to `"ignore_return_code":"yes"` (for the `time_gpu` command in `program/caffe/.cm/meta.json`).

  1. On the Mali-T628 platform:
```
$ ck install package:lib-caffe-bvlc-opencl-viennacl-universal \
  --env.DISABLE_DOUBLE_SUPPORT=ON \
  --env.DISABLE_DEVICE_HOST_UNIFIED_MEMORY=ON \
  --env.CK_HOST_CPU_NUMBER_OF_PROCESSORS=3
```
1. Run the experiments as detailed below ([Mali-T628](#mali_t628_original) or [GTX 1080](#gtx_1080_original)).

1. Remove Caffe with ViennaCL (***necessary*** before performing the **proposed** set of experiments):
```
$ ck clean env --tags=lib,caffe,vviennacl
```

Then, perform the **proposed** set of experiments:
1. Install Caffe with ViennaCL, while selecting [dividiti's fork](https://github.com/dividiti/viennacl-dev) e.g.
```
$ ck install package:lib-caffe-bvlc-opencl-viennacl-universal
```

1. Run the experiments as detailed below ([Mali-T628](#mali_t628_proposed) or [GTX 1080](#gtx_1080_proposed)).
1. Remove Caffe with ViennaCL (***optional***):
```
$ ck clean env --tags=lib,caffe,vviennacl
```

<a id="code"></a>
## Data wrangling code

**NB:** Please ignore this section if you are not interested in re-running or modifying this notebook.

### Includes

#### Standard

In [None]:
import os
import sys
import json
import re

#### Date util 

In [None]:
import dateutil.parser

#### Scientific

If some of the scientific packages are missing, please install them using:
```
# pip install jupyter pandas numpy matplotlib
```

In [None]:
import IPython as ip
import pandas as pd
import numpy as np
import matplotlib as mp

In [None]:
print ('IPython version: %s' % ip.__version__)
print ('Pandas version: %s' % pd.__version__)
print ('NumPy version: %s' % np.__version__)
print ('Matplotlib version: %s' % mp.__version__)

In [None]:
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

In [None]:
from IPython.display import Image
from IPython.core.display import HTML

#### Collective Knowledge

If CK is not installed, please install it using:
```
# pip install ck
```

In [None]:
import ck.kernel as ck
print ('CK version: %s' % ck.__version__)

### Create, build and compile OpenCL API calls

In [None]:
# OpenCL API calls to create program.
create_calls = [ 'clCreateProgramWithSource', 'clCreateProgramWithBinary' ]

In [None]:
# OpenCL API calls to build program.
build_calls = [ 'clBuildProgram' ]

In [None]:
# OpenCL API calls to compile kernels.
compile_calls = [ 'clCreateKernel', 'clCreateKernelsInProgram' ]

In [None]:
# All OpenCL API calls to create program, build program and compile kernels.
create_build_compile_calls = create_calls + build_calls + compile_calls

### Calculate time elapsed between two ISO timestamps

In [None]:
# Return the difference between the end and start timestamps in seconds.
def ts_delta_s(ts_end, ts_start):
    delta = dateutil.parser.parse(ts_end) - dateutil.parser.parse(ts_start)
    delta_s = delta.total_seconds()
    return delta_s

# Return the difference between the end and start timestamps in milliseconds.
def ts_delta_ms(ts_end, ts_start):
    delta_s = ts_delta_s(ts_end, ts_start)
    delta_ms = delta_s * 1e3
    return delta_ms

### Access the results

In [None]:
def get_results(repo_uoa, common_tags):
    module_uoa = 'experiment'
    r = ck.access({'action':'search', 'repo_uoa':repo_uoa, 'module_uoa':module_uoa, 'tags':common_tags})
    if r['return']>0:
        print ("Error: %s" % r['error'])
        exit(1)
    experiments = r['lst']
    
    experiment_dfs = []
    for experiment in experiments:
        data_uoa = experiment['data_uoa']
        r = ck.access({'action':'list_points', 'repo_uoa':repo_uoa, 'module_uoa':module_uoa, 'data_uoa':data_uoa})
        if r['return']>0:
            print ("Error: %s" % r['error'])
            exit(1)

        unique_tags = ','.join([ tag for tag in r['dict']['tags'] if tag not in common_tags])
        point_dfs = []
        for point in r['points']:
            point_file_path = os.path.join(r['path'], 'ckp-%s.0001.json' % point)
            with open(point_file_path) as point_file:
                print ('Reading: %s...' % point_file_path)
                point_data_raw = json.load(point_file)
            # Traces for all repetitions of this point.
            trace_list = [
                characteristics['run'].get('dvdt_prof',[]) for characteristics in point_data_raw['characteristics_list']
            ]                
            # All OpenCL API calls to create program, build program and compile kernels.
            create_build_compile_dfs = []
            for trace in trace_list:
                # Only include the first repetition of the 'cache-cold' experiments
                # (as the subsequent ones are in fact 'cache-warm').
                if (unique_tags=='cache-cold' or unique_tags=='cuda-cache-cold') and create_build_compile_dfs: continue
                create_build_compile_trace = [
                    { 'call' : call['call'], 'time_ms': ts_delta_ms(call['timestamp']['end'], call['timestamp']['start']) }
                    for call in trace if call['call'] in create_build_compile_calls 
                ]
                create_build_compile_df = pd.DataFrame(create_build_compile_trace).set_index(['call'], append=True)
                create_build_compile_df.index.names = [ 'id', 'call' ]
                create_build_compile_dfs.append(create_build_compile_df)
            # Aggregate all calls.
            point_df = pd.concat(create_build_compile_dfs, axis=1)
            point_dfs.append(point_df)
        # Aggregate all points.
        experiment_df = pd.concat(point_dfs)
        experiment_df.columns = [ [unique_tags]*len(experiment_df.columns), range(len(experiment_df.columns)) ]
        experiment_df.columns.names = [ 'experiment', 'repetition' ]
        experiment_dfs.append(experiment_df)
    # Aggregate all experiments.
    result_df = pd.concat(experiment_dfs, axis=1)
    # Convert to preferred format (unify clCreateProgram* calls, repetitions as columns, replace missing data with zeros).
    result_df = result_df \
        .rename(index={'clCreateProgramWithBinary':'clCreateProgram*', 'clCreateProgramWithSource':'clCreateProgram*'}) \
        .stack('experiment')
    return result_df

### Show the results

In [None]:
def show_results(results):
    pd.options.display.max_columns = len(results.columns)
    pd.options.display.max_rows = len(results.index)
    return results

### Plot the results

In [None]:
def plot_results(results, title='Execution time (ms)', rot=0):
    mean = results.mean(axis=1).unstack('experiment')
    std  = results.std(axis=1).unstack('experiment')
    ymax = mean.max().max()
    mean.plot(yerr=std, kind='bar', title=title,
        rot=rot, figsize=[16, 8], ylim=[0,ymax*1.05],
        grid=True, legend=True, colormap=cm.autumn
    )

### Compare the results

In [None]:
def compare_results(original, proposed, experiment, call):
    def cumulative_per_experiment_per_call(results, experiment, call):
        return results \
            .reorder_levels(['experiment', 'call', 'id']) \
            .loc[experiment] \
            .loc[call] \
            .mean(axis=1).sum()
    
    original_per_experiment_per_call = cumulative_per_experiment_per_call(original, experiment, call)
    proposed_per_experiment_per_call = cumulative_per_experiment_per_call(proposed, experiment, call)
    print ('[%s] all %s() calls w/ original: %.1f (ms)' % (experiment, call, original_per_experiment_per_call))
    print ('[%s] all %s() calls w/ proposed: %.1f (ms)' % (experiment, call, proposed_per_experiment_per_call))
    
    proposed_vs_original_pc = \
        100.0 * (proposed_per_experiment_per_call-original_per_experiment_per_call) / original_per_experiment_per_call
    print ('[%s] all %s() calls (proposed-original)/original: %.1f%%' % (experiment, call, proposed_vs_original_pc))

<a id="mali_t628"></a>
## Mali-T628

<a id="mali_t628_original"></a>
### Mali-T628 - original caching mechanism

#### Experimental setup

The experimental data were collected on the Mali-T628 experimental platform as follows:
```
$ export VIENNACL_CACHE_DIR=/tmp/viennacl-cache/ && rm -rf $VIENNACL_CACHE_DIR && mkdir $VIENNACL_CACHE_DIR
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=original-cache-none \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 \
  --tags=caffe,opencl,build,compile,original,mali-t628,cache-none
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=original-cache-cold \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=1 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,original,mali-t628,cache-cold
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=original-cache-warm \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,original,mali-t628,cache-warm
``` 

The experimental data were archived as follows:
```
$ ck zip local:experiment:original-cache* \
  --archive_name=ck-caffe-opencl-build-compile-original-mali-t628.zip
```

The resulting archive was copied to another machine and prepared for analysis as follows:
```
$ ck add repo:ck-caffe-opencl-build-compile-original-mali-t628 \
  --zip=ck-caffe-opencl-build-compile-original-mali-t628.zip --quiet
```

#### Experimental analysis

In [None]:
repo_uoa = 'ck-caffe-opencl-build-compile-original-mali-t628'
common_tags = 'caffe,opencl,build,compile,original,mali-t628'
mali_t628_original = get_results(repo_uoa, common_tags)

In [None]:
show_results(mali_t628_original)

In [None]:
plot_results(mali_t628_original, rot=90)

The `clCreateProgram*()` calls only take considerable time for the 'cache-warm' experiment (i.e. `clCreateProgramWithBinary()`), while the `clBuildProgram()` calls only take considerable time for the 'cache-none' and 'cache-cold' experiments.

In [None]:
plot_results(mali_t628_original \
             .reorder_levels(['call', 'id', 'experiment']) \
             .ix[build_calls] \
             .reorder_levels(['id', 'call', 'experiment']))

The execution time of the `clCreateKernelsInProgram()` calls, however, is practically the same whether using the original ViennaCL caching mechanism or not, which suggests it's simply ineffective on this platform.

In [None]:
plot_results(mali_t628_original \
             .reorder_levels(['call', 'id', 'experiment']) \
             .ix[compile_calls] \
             .reorder_levels(['id', 'call', 'experiment']))

<a id="mali_t628_proposed"></a>
### Mali-T628 - proposed caching mechanism

#### Experimental setup

The experimental data were collected on the Mali-T628 experimental platform as follows:
```
$ export VIENNACL_CACHE_DIR=/tmp/viennacl-cache/ && rm -rf $VIENNACL_CACHE_DIR && mkdir $VIENNACL_CACHE_DIR
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=proposed-cache-none \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 \
  --tags=caffe,opencl,build,compile,proposed,mali-t628,cache-none
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=proposed-cache-cold \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=1 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,proposed,mali-t628,cache-cold
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=proposed-cache-warm \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,proposed,mali-t628,cache-warm
``` 

The experimental data were archived as follows:
```
$ ck zip local:experiment:proposed-cache* \
  --archive_name=ck-caffe-opencl-build-compile-proposed-mali-t628.zip
```

The resulting archive was copied to another machine and prepared for analysis as follows:
```
$ ck add repo:ck-caffe-opencl-build-compile-proposed-mali-t628 \
  --zip=ck-caffe-opencl-build-compile-proposed-mali-t628.zip --quiet
```

#### Experimental analysis

In [None]:
repo_uoa = 'ck-caffe-opencl-build-compile-proposed-mali-t628'
common_tags = 'caffe,opencl,build,compile,proposed,mali-t628'
mali_t628_proposed = get_results(repo_uoa, common_tags)

In [None]:
show_results(mali_t628_proposed)

In [None]:
plot_results(mali_t628_proposed, rot=90)

In [None]:
plot_results(mali_t628_proposed \
             .reorder_levels(['call', 'id', 'experiment']) \
             .ix[build_calls] \
             .reorder_levels(['id', 'call', 'experiment']))

In [None]:
plot_results(mali_t628_proposed \
             .reorder_levels(['call', 'id', 'experiment']) \
             .ix[compile_calls] \
             .reorder_levels(['id', 'call', 'experiment']))

<a id="mali_t628_compare"></a>
### Mali-T628 - compare the original mechanism vs the proposed mechanism

In [None]:
compare_results(mali_t628_original, mali_t628_proposed, 'cache-warm', 'clCreateKernelsInProgram')

In [None]:
compare_results(mali_t628_original, mali_t628_proposed, 'cache-warm', 'clBuildProgram')

<a id="gtx_1080"></a>
## GTX 1080

<a id="gtx_1080_original"></a>
### GTX 1080 - original caching mechanism

#### Experimental setup

The experimental data were collected on the GTX 1080 experimental platform as follows:
```
$ export CUDA_CACHE_DIR=$HOME/.nv/ComputeCache/ && rm -rf $CUDA_CACHE_DIR
$ export VIENNACL_CACHE_DIR=/tmp/viennacl-cache/ && rm -rf $VIENNACL_CACHE_DIR && mkdir $VIENNACL_CACHE_DIR
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=original-cuda-cache-cold \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=1 \
  --env.CK_CAFFE_BATCH_SIZE=1 \
  --tags=caffe,opencl,build,compile,original,gtx-1080,cuda-cache-cold
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=original-cuda-cache-warm \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 \
  --tags=caffe,opencl,build,compile,original,gtx-1080,cuda-cache-warm
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=original-cache-none \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.CUDA_CACHE_DISABLE=1 \
  --tags=caffe,opencl,build,compile,original,gtx-1080,cache-none
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=original-cache-cold \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=1 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.CUDA_CACHE_DISABLE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,original,gtx-1080,cache-cold
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=original-cache-warm \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.CUDA_CACHE_DISABLE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,original,gtx-1080,cache-warm
```

The experimental data were archived as follows:
```
$ ck zip local:experiment:original*cache* \
  --archive_name=ck-caffe-opencl-build-compile-original-gtx-1080.zip
```

The resulting archive was copied to another machine and prepared for analysis as follows:
```
$ ck add repo:ck-caffe-opencl-build-compile-original-gtx-1080 \
  --zip=ck-caffe-opencl-build-compile-original-gtx-1080.zip --quiet
```

#### Experimental analysis

In [None]:
repo_uoa = 'ck-caffe-opencl-build-compile-original-gtx-1080'
common_tags = 'caffe,opencl,build,compile,original,gtx-1080'
gtx_1080_original = get_results(repo_uoa, common_tags)

In [None]:
show_results(gtx_1080_original)

In [None]:
plot_results(gtx_1080_original, rot=90)

In [None]:
plot_results(gtx_1080_original \
             .reorder_levels(['call', 'id', 'experiment']) \
             .ix[build_calls] \
             .reorder_levels(['id', 'call', 'experiment']))

In [None]:
plot_results(gtx_1080_original \
             .reorder_levels(['call', 'id', 'experiment']) \
             .ix[compile_calls] \
             .reorder_levels(['id', 'call', 'experiment']))

<a id="gtx_1080_proposed"></a>
### GTX 1080 - proposed caching mechanism

#### Experimental setup

The experimental data were collected on the GTX 1080 experimental platform as follows:
```
$ export CUDA_CACHE_DIR=$HOME/.nv/ComputeCache/ && rm -rf $CUDA_CACHE_DIR
$ export VIENNACL_CACHE_DIR=/tmp/viennacl-cache/ && rm -rf $VIENNACL_CACHE_DIR && mkdir $VIENNACL_CACHE_DIR
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=proposed-cuda-cache-cold \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=1 \
  --env.CK_CAFFE_BATCH_SIZE=1 \
  --tags=caffe,opencl,build,compile,proposed,gtx-1080,cuda-cache-cold
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=proposed-cuda-cache-warm \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 \
  --tags=caffe,opencl,build,compile,proposed,gtx-1080,cuda-cache-warm
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=proposed-cache-none \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.CUDA_CACHE_DISABLE=1 \
  --tags=caffe,opencl,build,compile,proposed,gtx-1080,cache-none
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=proposed-cache-cold \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=1 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.CUDA_CACHE_DISABLE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,proposed,gtx-1080,cache-cold
$ ck benchmark program:caffe \
  --record --record_failed \
  --record_repo=local --record_uoa=proposed-cache-warm \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.CUDA_CACHE_DISABLE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,proposed,gtx-1080,cache-warm
```

The experimental data were archived as follows:
```
$ ck zip local:experiment:proposed*cache* \
  --archive_name=ck-caffe-opencl-build-compile-proposed-gtx-1080.zip
```

The resulting archive was copied to another machine and prepared for analysis as follows:
```
$ ck add repo:ck-caffe-opencl-build-compile-proposed-gtx-1080 \
  --zip=ck-caffe-opencl-build-compile-proposed-gtx-1080.zip --quiet
```

#### Experimental analysis

In [None]:
repo_uoa = 'ck-caffe-opencl-build-compile-proposed-gtx-1080'
common_tags = 'caffe,opencl,build,compile,proposed,gtx-1080'
gtx_1080_proposed = get_results(repo_uoa, common_tags)

In [None]:
show_results(gtx_1080_proposed)

In [None]:
plot_results(gtx_1080_proposed, rot=90)

In [None]:
plot_results(gtx_1080_proposed \
             .reorder_levels(['call', 'id', 'experiment']) \
             .ix[build_calls] \
             .reorder_levels(['id', 'call', 'experiment']))

In [None]:
plot_results(gtx_1080_proposed \
             .reorder_levels(['call', 'id', 'experiment']) \
             .ix[compile_calls] \
             .reorder_levels(['id', 'call', 'experiment']))

<a id="gtx_1080_compare"></a>
### GTX 1080 - compare the original mechanism vs the proposed mechanism

In [None]:
compare_results(gtx_1080_original, gtx_1080_proposed, 'cache-warm', 'clBuildProgram')

In [None]:
compare_results(gtx_1080_original, gtx_1080_proposed, 'cache-warm', 'clCreateKernelsInProgram')