# Evaluating OpenCL Caffe caching mechanisms

## Table of Contents

1. [Overview](#overview)
1. [Get the data](#data) [for developers]
1. [See the code](#code) [for developers]
1. [See the tables](#tables)
1. [See the graphs](#graphs)
  1. [All program create, program build and kernel compile OpenCL API calls](#graphs_all_calls)
  1. [Only kernel compile OpenCL API calls](#graphs_compile_calls)

<a id="data"></a>
## Get the experimental data

**NB:** Please ignore this section if you are not interested in re-running or modifying this notebook. 

The experimental data was collected on the Odroid-XU3 experimental platform and archived as follows:
```
$ export VIENNACL_CACHE_DIR=/tmp/viennacl-cache/
$ rm -rf $VIENNACL_CACHE_DIR
$ mkdir $VIENNACL_CACHE_DIR
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=caffe-opencl-build-compile-original-cache-none \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 \
  --tags=caffe,opencl,build,compile,original,cache-none
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=caffe-opencl-build-compile-original-cache-cold \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,original,cache-cold
$ ck benchmark program:caffe \
  --record --record_repo=local --record_uoa=caffe-opencl-build-compile-original-cache-warm \
  --dvdt_prof --skip_stat_analysis \
  --cmd_key=time_gpu --cpu_freq=max --repetitions=3 \
  --env.CK_CAFFE_BATCH_SIZE=1 --env.VIENNACL_CACHE_PATH=$VIENNACL_CACHE_DIR \
  --tags=caffe,opencl,build,compile,original,cache-warm
$ ck zip local:experiment:caffe-opencl-build-compile-original-cache* \
  --archive_name=caffe-opencl-build-compile-original-odroid-xu3.zip
```

The resulting archive was copied to another machine and extracted for analysis as follows:
```
$ ck add repo:caffe-opencl-build-compile-original-odroid-xu3 \
  --zip=caffe-opencl-build-compile-original-odroid-xu3.zip --quiet
```

<a id="code"></a>
## Data wrangling code

**NB:** Please ignore this section if you are not interested in re-running or modifying this notebook.

### Includes

#### Standard

In [None]:
import os
import sys
import json
import re

#### Date util 

In [None]:
import dateutil.parser

#### Scientific

If some of the scientific packages are missing, please install them using:
```
# pip install jupyter pandas numpy matplotlib
```

In [None]:
import IPython as ip
import pandas as pd
import numpy as np
import matplotlib as mp

In [None]:
print ('IPython version: %s' % ip.__version__)
print ('Pandas version: %s' % pd.__version__)
print ('NumPy version: %s' % np.__version__)
print ('Matplotlib version: %s' % mp.__version__)

In [None]:
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

In [None]:
from IPython.display import Image
from IPython.core.display import HTML

#### Collective Knowledge

If CK is not installed, please install it using:
```
# pip install ck
```

In [None]:
import ck.kernel as ck
print ('CK version: %s' % ck.__version__)

### Create, build and compile OpenCL API calls

In [None]:
# All OpenCL API calls to create program, build program and compile kernels.
create_build_compile_calls = [
    'clCreateProgramWithSource',
    'clCreateProgramWithBinary',
    'clBuildProgram',
    'clCreateKernel',
    'clCreateKernelsInProgram'
]

### Calculate time elapsed between two ISO timestamps

In [None]:
# Return the difference between the end and start timestamps in seconds.
def ts_delta_s(ts_end, ts_start):
    delta = dateutil.parser.parse(ts_end) - dateutil.parser.parse(ts_start)
    delta_s = delta.total_seconds()
    return delta_s

# Return the difference between the end and start timestamps in milliseconds.
def ts_delta_ms(ts_end, ts_start):
    delta_s = ts_delta_s(ts_end, ts_start)
    delta_ms = delta_s * 1e3
    return delta_ms

### Access the experimental data

In [None]:
def get_experimental_results(repo_uoa, common_tags):
    module_uoa = 'experiment'
    r = ck.access({'action':'search', 'repo_uoa':repo_uoa, 'module_uoa':module_uoa, 'tags':common_tags})
    if r['return']>0:
        print ("Error: %s" % r['error'])
        exit(1)
    experiments = r['lst']
    
    experiment_dfs = []
    for experiment in experiments:
        data_uoa = experiment['data_uoa']
        r = ck.access({'action':'list_points', 'repo_uoa':repo_uoa, 'module_uoa':module_uoa, 'data_uoa':data_uoa})
        if r['return']>0:
            print ("Error: %s" % r['error'])
            exit(1)

        unique_tags = ','.join([ tag for tag in r['dict']['tags'] if tag not in common_tags])
        point_dfs = []
        for point in r['points']:
            with open(os.path.join(r['path'], 'ckp-%s.0001.json' % point)) as point_file:
                point_data_raw = json.load(point_file)
            # Traces for all repetitions of this point.
            trace_list = [
                characteristics['run'].get('dvdt_prof',[]) for characteristics in point_data_raw['characteristics_list']
            ]                
            # All OpenCL API calls to create program, build program and compile kernels.
            create_build_compile_dfs = []
            for trace in trace_list:
                # Only include the first repetition of the 'cache-cold' experiment 
                # (as the subsequent ones are in fact 'cache-warm').
                if unique_tags=='cache-cold' and create_build_compile_dfs: continue
                create_build_compile_trace = [
                    { 'call' : call['call'], 'time_ms': ts_delta_ms(call['timestamp']['end'], call['timestamp']['start']) }
                    for call in trace if call['call'] in create_build_compile_calls 
                ]
                create_build_compile_df = pd.DataFrame(create_build_compile_trace).set_index(['call'], append=True)
                create_build_compile_dfs.append(create_build_compile_df)
            # Aggregate all calls.
            point_df = pd.concat(create_build_compile_dfs, axis=1)
            point_dfs.append(point_df)
        # Aggregate all points.
        experiment_df = pd.concat(point_dfs)
        experiment_df.columns = [ [unique_tags]*len(experiment_df.columns), range(len(experiment_df.columns)) ]
        experiment_dfs.append(experiment_df)
    # Aggregate all experiments.
    result_df = pd.concat(experiment_dfs, axis=1)
    # Convert to preferred format.
    result_df = result_df.rename(
        index={'clCreateProgramWithBinary':'clCreateProgram*', 'clCreateProgramWithSource':'clCreateProgram*'}
    )
    result_df = result_df.stack(0)
    return result_df

### Plot the execution time

In [None]:
def plot(mean, std, title='Execution time (ms)', ymax=0, rot=0):
    ymax = mean.max().max() if ymax==0 else ymax
    mean.plot(yerr=std, ylim=[0,ymax*1.05], title=title,
        kind='bar', rot=rot, figsize=[16, 8], grid=True, legend=True, colormap=cm.autumn)

<a id="tables"></a>
## Tables

In [None]:
repo_uoa = 'caffe-opencl-build-compile-original-odroid-xu3'
common_tags = 'caffe,opencl,build,compile,original'
result = get_experimental_results(repo_uoa, common_tags)

In [None]:
pd.options.display.max_columns = len(result.columns)
pd.options.display.max_rows = len(result.index)
result

<a id="graphs"></a>
## Graphs

<a id="graphs_all_calls"></a>
### All program create, program build and kernel compile OpenCL API calls

In [None]:
mean = result.mean(axis=1).unstack()
std = result.std(axis=1).unstack()
plot(mean, std, rot=90)

The `clCreateProgram*()` calls only take considerable time for the 'cache-warm' experiment (i.e. `clCreateProgramWithBinary()`), while the `clBuildProgram()` calls only take considerable time for the 'cache-none' and 'cache cold' experiments.

The execution time of the `clCreateKernelsInProgram()` calls, however, is practically the same whether using the original ViennaCL caching mechanism or not, which suggests it's simply ineffective on this platform.

<a id="graphs_compile_calls"></a>
### Only kernel compile OpenCL API calls

In [None]:
plot(mean.swaplevel().ix['clCreateKernelsInProgram'], std.swaplevel().ix['clCreateKernelsInProgram'])