Autotuning example hog

Grigori Fursin edited this page Jun 5, 2018 · 6 revisions

[ Home ]

Please do not forget to check Getting Started Guides to understand CK concepts!

Table of Contents

Example of autotuning, performance modeling and run-time adaptation (predictive scheduling) of the OpenCL HOG program from the CARP project

We expect that you have already followed the previous section of the Getting Started Guide, so you are familiar with the main CK concepts and have installed CK on your machine.

In this part, we will demonstrate how to reproduce performance/energy autotuning, modeling and predictive CPU/GPU scheduling experiments using a compute-intensive image processing program. The program computes Histogram of Oriented Gradients (HOG), a feature descriptor frequently used in computer vision for object detection.

This program was extracted by Realeyes from their flagship emotion estimation application, as part of the EU FP7 CARP project. We imported the HOG program into Collective Knowledge to make it easy to compile, run and tune across multiple platforms (and eventually convert it into a self-tuning adaptive library as we describe in our long-term vision paper).

The program takes one JPEG image as input and computes a feature descriptor, first on the CPU and then on the GPU (using OpenCL). The program compares output from the CPU and the GPU for correctness. Finally, the program reports the wall-clock execution time on the CPU, the wall-clock execution time on the GPU (including memory copies) and the kernel execution time on the GPU (obtained via the OpenCL profiling API).

Obtaining repository with shared artifacts

You can obtain shared components from GitHub (source code, data set, packages) via

 $ ck pull repo:reproduce-carp-project

You can check that you now have HOG program via:

 $ ck list program:*hog*
or
 $ ck search program --tags=hog

You should see two entries:

  • realeyes-hog-opencl - sequential CPU version
  • realeyes-hog-opencl-tbb - parallel CPU version (uses Intel TBB)

Note that HOG requires a C++11 compatible compiler (e.g. GCC 4.9). Please see our public notes via

 $ ck wiki program:realeyes-hog-opencl

Installing software dependencies

We will perform further experiments on realeyes-hog-opencl-tbb. This program has a dependency on 4 third-party libraries:

  • OpenCL - your GPU vendor's OpenCL library.
  • TBB - Intel's Threading Building Blocks.
  • OpenCV - popular open-source image/video processing library.
  • xOpenME - our runtime library to instrument applications and tools, and expose their various run-time properties to the outside world via JSON or to tune their internal heuristics.

We provide CK packages for all above dependencies except OpenCL. We expect that OpenCL is already installed for your target machine and you only have to register its path in CK via soft module. For this purpose, find the most close CK software description for your OpenCL library via:

 $ ck list soft:lib.opencl.*

You can register a generic Linux OpenCL library (libOpenCL.so) in CK via:

 $ ck setup soft:lib.opencl.linux

If you use MALI GPUs you can install a related OpenCL version for Linux (for example for Chromebooks) via

 $ ck setup soft:lib.opencl.mali
or for Android via
 $ ck setup soft:lib.opencl.mali --target_os=android19-arm

You will be asked a few questions about the installation path for this library excluding /lib, i.e. /usr if it is installed in /usr/lib!

To test whether the OpenCL library is installed correctly, compile and run a simple OpenCL program that prints some device info:

 $ ck compile program:tool-print-opencl-devices
 $ ck run program:tool-print-opencl-devices

If you need to run an OpenCL program as root, add --sudo and enter your root password when prompted:

 $ ck run program:tool-print-opencl-devices --sudo

You can also install TBB and OpenCV from shared CK packages:

 $ ck install package:lib-tbb43-20150424oss-src
 $ ck install package:lib-opencv-2.4.11-opencl-tbb-src-linux

In both cases, installation scripts will download sources from the TBB and OpenCV websites, and will rebuild them.

Due to some complexities of OpenCV installation, we strongly suggest alternative solution - installing OpenCV as a binary package using standard OS procedures (for example, using sudo apt-get install libopencv-dev or by downloading and installing OpenCV .exe package on Windows) and then registering it in CK similarly to OpenCL via

 $ ck setup soft:lib.opencv

If you are still willing to recompile OpenCV, we provided separate packages for Windows and Android, i.e. on Windows it is possible to install it via:

 $ ck install package:lib-opencv-2.4.11-opencl-tbb-src-win
while for Android on Windows via:
 $ ck install package:lib-opencv-2.4.11-opencl-tbb-src-android target_os=android19-arm
or for Android on Linux via:
 $ ck install package:lib-opencv-2.4.11-opencl-tbb-src-android-on-linux target_os=android19-arm

Note, that in the future, it is possible to improve installation procedures to have only one installation script per package for each hosts and targets - it can be a possible GSOC project.

As for our OpenME RTL to expose various internal application parameters to outside world, it is not strictly necessary to install it here since it will be installed during first compilation of a program using it. However, just in case, here is the explicit installation of this package:

 $ ck install package:lib-rtl-xopenme

Installing hardware and OS dependent scripts

Some of the low-level hardware functionality such as setting CPU and GPU frequency, obtaining hardware counters or monitoring energy consumed by various parts of hardware may require OS tools and scripts.

We started collecting and unifying such scripts using container platform.init in ck-autotuning repository. You can check available sub-directories in this entry found via

 $ ck list platform.init

Currently we provided support for the following platforms:

  • chromebook-ubuntu
  • generic-android
  • generic-linux
  • generic-odroid

Before starting experiments, you need to copy (and possibly customize) scripts from the directory with the most close platform to any directory in your PATH environment. Alternatively, you can create a new directory $HOME/bin, copy there all scripts and add it to your PATH directory:

 $ export PATH=$HOME/bin:$PATH

Now, you can try several scripts (may require SUDO password):

 $ ck-print-cpu-freq
 $ ck-print-gpu-freq 

Then, you can try to set you system to max frequency:

 $ ck-set-performance
 $ ck-print-cpu-freq
 $ ck-print-gpu-freq

Next, you can try to set you system to powersaving mode:

 $ ck-set-powersave
 $ ck-print-cpu-freq
 $ ck-print-gpu-freq

If something is not working, you may customize your local scripts or even share them for your platform.

Note, that if you use Android-based target, you should copy those scripts to /data/local/tmp .

Also note, that Odroid platform has scripts that can measure voltage, current, power and energy for GPU, memory system, A7 and A15 cores as well as processor temperature and fan speed.

Compiling and running HOG via program pipeline

If all above steps succeed, it is now possible to compile and run HOG via program pipeline.

First, let's try to run HOG while setting CPU and GPU frequency to max:

 $ ck run pipeline:program program_uoa=realeyes-hog-opencl-tbb --speed --cpu_freq=max --gpu_freq=max dataset_uoa=3fe364dc4d218734 --repetitions=3 --save_to_file=$PWD/tmp-output-max.json

Here, $PWD will be substituted by Linux bash with a current directory (otherwise, output will be recorded in the tmp directory of a compiler program). On Windows you may want to use %CD% instead.

It should set CPU/GPU freq to max, select one of the available datasets, run program 3 times, perform some basic statistical analysis of variation, and save workflow (pipeline) state into tmp-output-max.json

It is also possible to run the same experiment but CPU and GPU frequency set to minimal values, i.e.:

 $ ck run pipeline:program program_uoa=realeyes-hog-opencl-tbb --speed --cpu_freq=min --gpu_freq=min dataset_uoa=3fe364dc4d218734 --repetitions=3 --save_to_file=$PWD/tmp-output-min.json

Note that on Odroid board it is possible to add energy to characteristics simply via extra flag --energy as one of sub cases of CK-powered universal and multi-objective autotuning such as balancing performance, energy, speed, code size.

To record the pipeline output to the experiment entry with some tags:

 $ ck run pipeline:program ... --record --record_uoa=my_experiment --tags=autotuning,hog

If you start CK web front-end via ck start web, you can then view experimental results via universal experiment viewer (with sortable tables and CK cross-linked info):

 $ ck start web
 $ firefox http://localhost:3344/?wcid=experiment:

As usual, we collect various info about successful or problematic compilation and execution at wiki via:

 $ ck wiki program:realeyes-hog-opencl

Autotuning HOG via CK

Next, we will demonstrate how to automatically explore various choices via autotune function of a pipeline module.

We shared the following exploration/autotuning examples that can be found via ck list demo:*-hog*:

  • explore-hog-cpu-gpu-freq - exploring HOG CPU/GPU characteristics (execution time, energy, compilation time, etc) vs varying CPU/GPU frequency
  • explore-hog-datasets - exploring HOG CPU/GPU behavior for different data sets (you can check shared data sets for HOG via ck list reproduce-carp-project:dataset:)
  • explore-hog-lws - autotuning OpenCL local worksize
  • explore-hog-compilers - exploring HOG CPU/GPU behavior across all registered in CK compilers (LLVM, GCC, Intel, PGI, etc)
  • explore-hog-image-bs-freq - exploring CPU/GPU behavior vs randomly varying data set, CPU/GPU frequency and BLOCK SIZE. This example is also used for adaptive scheduling via predictive modeling and active learning.

For each exploration scenario, you need to prepare a program pipeline that will be used for autotuning. For simplicity, we prepared some scripts to help you perform repetitive tasks: Just execute _clean_program_pipeline.bat and then _setup_program_pipeline.bat This will create a program pipeline JSON input file _setup_program_pipeline_tmp.json

At any time, you can view experimental results via:

 $ ck start web
 $ firefox http://localhost:3344/?wcid=experiment:

CPU/GPU frequency exploration

In this scenario, we would like to analyze HOG behavior for different CPU and GPU frequencies. Since, we do not need to recompile code for each exploration step, we added 2 additional parameters when preparing program pipeline:

 {
  "compile_only_once":"yes"
  "no_state_check":"yes",
  "compiler_vars":{"HOG4X4":""}
 }

Note, that we added detection of frequency changes in program pipeline during autotuning to be able to correctly attribute speed ups. However, in this exploration scenario, we use "no_state_check":"yes" to skip such check.

We also added compiler_vars keys to the pipeline to specify which HOG algorithm to use during exploration. Currently this HOG implementation has 3 different image processing algorithms that use different grids for image processing). It is possible to select the following in above example:

  • {"HOG4X4":""} or
  • {"HOG2X2":""} or
  • {"HOG1X1":""}

We added a script autotune_program_pipeline_freq.bat to perform random exploration of frequencies. It is possible to customize this exploration via JSON file autotune_program_pipeline_freq.json:

 {
  "choices_order":[
    ["##cpu_freq"],
    ["##gpu_freq"]
  ],
  "choices_selection": [
    {"type":"random", "start":300000, "stop":1900000, "step":100000, "default":1900000},
    {"type":"random", "choice":[100000000,177000000,266000000,350000000,420000000,480000000,543000000,600000000], "default":600000000}
  ],
  "seed":12345,
  "iterations":1000,
  "repetitions":3,
  "record":"yes",
  "record_uoa":"demo-hog-cpu-gpu-freq-random",
  "features_keys_to_process":["##choices#*"],
  "record_params": {"search_point_by_features":"yes"}
 }

Here we tell our universal explorer to tune 2 dimensions in choices, and we define random exploration with given ranges for each of these dimensions in choices_selection.

Note, that we also provided a IPython Notebook script to run the same experiments in interactive mode:

 $ ipython notebook start_pipeline_hog.ipynb

Experimental results are continuously accumulated in the experiment:demo-hog-cpu-gpu-freq-random entry in a local CK repository.

We also prepared 5 scripts to visualize results:

  • plot_3d_scatter.bat
  • plot_3d_trisurf.bat
  • plot_heat_map.bat
  • reproduce_density_graph_all.bat
  • reproduce_hist_graph_all.bat

Each script invokes ck plot graph: @input.json

Here is a sample input to plot heat map (similar to graphs from this paper):

{

  "experiment_module_uoa":"experiment",
  "data_uoa_list":["demo-hog-cpu-gpu-freq-random"],
  "flat_keys_list":[
    "##choices#cpu_freq#min",
    "##choices#gpu_freq#min",
    "##characteristics#run#derived_cpu_time_div_by_gpu_with_mem_time#min"
  ],
  "plot_type":"mpl_2d_heatmap",
  "display_x_error_bar":"no",
  "display_y_error_bar":"no",
  "title":"Powered by Collective Knowledge",
  "axis_x_desc":"CPU frequency",
  "axis_y_desc":"GPU frequency",
  "axis_z_desc":"CPU / GPU with mem",
  "plot_grid":"no",
  "mpl_image_size_x":"12",
  "mpl_image_size_y":"6",
  "mpl_image_dpi":"100",
  "point_style":{"0":{"elinewidth":"0", "marker":"s", "size":400, "colorbar_orietation":"horizontal", "colorbar_label":"test"}}
 }

Note, that you can substitute flat dimensions from the experimental results in flat_keys_list with any available (just look at the points in experiment:demo-hog-cpu-gpu-freq-random). It is also possible to do it via interactive graphs that can be integrated with interactive papers as described in other sections of this guide.

Dataset exploration

Example of exploration of HOG behavior versus multiple shared data sets is available in demo:explore-hog-datasets - it can be useful to find unexpected behavior and representative data sets for a given program to reduce tuning time and enable adaptive applications (see 1, 2).

You can start exploration via explore_datasets.bat script that takes explore_datasets.json as customizable input:

 {
  "choices_order":[
    ["##dataset_uoa"]
  ],
  "choices_selection": [
    {"type":"loop", "default":"3fe364dc4d218734"}
  ],
  "pipeline_update":{
     "cpu_freq":"max",
     "gpu_freq":"max",
     "compiler_vars": {
       "HOG4X4":"",
       "BLOCK_SIZE":64
     }
   },
  "seed":12345,
  "iterations":1000,
  "repetitions":1,
  "record":"yes",
  "record_uoa":"explore-hog-4x4-datasets",
  "features_keys_to_process":["##choices#*"],
  "record_params": {
    "search_point_by_features":"yes"
  }
 }

Here, we set CPU and GPU to max frequency, select HOG4X4 algorithm, fix BLOCKING_SIZE to 64 and explore HOG behavior for all available data sets.

Experimental results are aggregated in experiment:explore-hog-4x4-datasets and can be visualized via 2 shared scripts:

  • plot_heat_map.bat
  • plot_with_variation.bat

OpenCL local worksize autotuning

In this autotuning example, we will try to exhaustively find the optimal OpenCL local worksize (LWS) parameters that minimize execution time (can also be energy or any other exposed characteristic supported by platform), for a fixed algorithm and data set.

First, we run and record HOG behavior for a default LWS using autotune_program_pipeline_base_best.bat script.

Then, we can perform autotuning via autotune_program_pipeline_lws.bat which takes the following JSON input:

 {
  "choices_order":[
    ["##compiler_vars#LWS_X"],
    ["##compiler_vars#LWS_Y"],
    ["##compiler_vars#LWS_Z"]
  ],
  "choices_selection": [
    {"type":"loop", "choice":[2,4,8,16,24,32,48,64], "default":2},
    {"type":"loop", "choice":[1,2,4,8], "default":1},
    {"type":"loop", "choice":[2,4,8,16,32,48,64,96,128], "default":2}
  ],
 "pipeline_update":{
  "cpu_freq":"max",
  "gpu_freq":"max",
  "compiler_vars": {
    "HOG2X2":"",
    "BLOCK_SIZE":64
  }
 },
  "seed":12345,
  "iterations":1000,
  "repetitions":3,
  "record":"yes",
  "record_uoa":"autotune-demo-hog-lws-4x4-bs64-gcc-loop",
  "features_keys_to_process":["##choices#*"],
  "record_params": {
    "search_point_by_features":"yes"
  }
 }

It is then possible to visualize results using plot_with_variation.bat script or customize it via plot_with_variation.json.

Exploration of various compilers

Whenever user has a choice of several compilers or their versions registered in CK, it is possible to explore program behavior across all available compilers (useful to automatically detect compiler optimization heuristic regressions across all shared benchmarks and data sets).

We prepared a sample script in demo:explore-hog-compilers which can be executed via explore_compilers.bat with the following JSON input file explore_compilers.json:

 {
  "choices_order":[
    ["##compiler_env_uoa"]
  ],
  "choices_selection": [
    {"type":"loop"}
  ],
  "pipeline_update":{
     "cpu_freq":"max",
     "gpu_freq":"max",
     "compiler_vars": {
       "HOG4X4":"",
       "BLOCK_SIZE":64
     }
   },
  "seed":12345,
  "iterations":1000,
  "repetitions":2,
  "record":"yes",
  "record_uoa":"explore-hog-compilers",
  "features_keys_to_process":["##choices#*"],
  "record_params": {
    "search_point_by_features":"yes"
  }
 }

Our autotuner will query CK to detect environments for compilers via:

 $ ck show env --tags=compiler,lang-cpp

and then will iterate over all of them while compiling, running HOG and saving results in experiment:explore-hog-compilers.

Exploring all choices and building predictive model for adaptive scheduling

In this shared example, we attempt to reproduce our work on predictive scheduling for heterogeneous architectures via CK.

You may start continuous exploration via explore.bat that takes explore.json as input:

 {
  "choices_order":[
    ["##dataset_uoa"],
    ["##compiler_vars#BLOCK_SIZE"],
    ["##cpu_freq"],
    ["##gpu_freq"]
  ],
  "choices_selection": [
    {"type":"random-with-next", "default":"image-jpeg-carp-eso1113a"},
    {"type":"random-with-next", "choice":[1,4,8,16,32,48,64,80,96,112,128], "default":64},
    {"type":"random-with-next", "start":300000, "stop":1900000, "step":100000, "default":1900000},
    {"type":"random-with-next", "choice":[100000000,177000000,266000000,350000000,420000000,480000000,543000000,600000000], "default":600000000}
  ],
  "pipeline_update":{
   "compiler_vars": {
     "HOG4X4":""
   },
   "no_state_check":"yes"
  },
  "seed":12345,
  "iterations":1000,
  "repetitions":3,
  "record":"yes",
  "record_uoa":"explore-hog-4x4-all-random",
  "features_keys_to_process":["##choices#*"],
  "record_params": {
    "search_point_by_features":"yes"
  }
 }

During each exploration step, all choices will be randomly selected, code recompiled, executed, and results aggregated in experiment:explore-hog-4x4-all-random.

At any time, it is possible to build a decision tree via CK that will correlate features (choices, run-time information such as data set features including height and width exposed via OpenME interface, etc.) with our derived characteristics ##characteristics#run#derived_gpu_only_is_much_better_cpu#min.

We prefer decision trees to other models (such as deep neural networks), because our point is not only to show that it is possible to predict something (pitfall in many papers on using machine learning in computer engineering), but to understand and improve correlations between features and results, fix unexpected behavior (by finding new features or changing models), and thus help software/hardware developers improve their technology.

In the future, we plan to automatically integrate such compact decision trees with any given library such as HOG to enable adaptive library that is continuously updated whenever more knowledge from other users is collected (as described in this paper).

We prepared several scripts to build such decision trees from scikit-learn with different maximum depth (to be able to adapt precision versus complexity and cost of the decision tree):

  • model-sklearn-dtc-build.bat - depth 2
  • model-sklearn-dtc-build-ds3.bat - depth 3
  • model-sklearn-dtc-build-ds4.bat - depth 4

These scripts call model script from ck-analytics repository via:

 $ ck build model: @model-input.json @model-sklearn-dtc.json

where model-sklearn-dtc.json selects and parameterize decision tree as follows:

 {
  "model_module_uoa":"model.sklearn",
  "model_name":"dtc",
  "model_file":"model-sklearn-dtc",
  "model_params":{"max_depth":2}
 }

and model-input.json selects inputs (fetures) and output (characteristics) from experiment:explore-hog-4x4-all-random to be correlated as follows:

 {
  "data_uoa":"explore-hog-4x4-all-random",
  "features_flat_keys_ext":"#min",
  "features_flat_keys_list":[
    "##choices#compiler_vars#BLOCK_SIZE",
    "##choices#cpu_freq",
    "##choices#gpu_freq",
    "##characteristics#run#run_time_state#lws_x", 
    "##characteristics#run#run_time_state#lws_y", 
    "##characteristics#run#run_time_state#lws_z", 
    "##features#dataset#height",
    "##features#dataset#width",
    "##features#dataset#total_size"
   ],
   "features_flat_keys_desc": {
    "##choices#compiler_vars#BLOCK_SIZE":{"name":"block size"},
    "##choices#cpu_freq":{"name":"CPU frequency"},
    "##choices#gpu_freq":{"name":"GPU frequency"},
    "##characteristics#run#run_time_state#lws_x":{"name":"LWS X"}, 
    "##characteristics#run#run_time_state#lws_y":{"name":"LWS Y"}, 
    "##characteristics#run#run_time_state#lws_z":{"name":"LWS Z"}, 
    "##features#dataset#height":{"name":"image height"},
    "##features#dataset#width":{"name":"image width"},
    "##features#dataset#total_size":{"name":"image size"}
  },
  "characteristics_flat_keys_list":[
    "##characteristics#run#derived_gpu_only_is_much_better_cpu#min"
  ],
  "remove_points_with_none":"yes"
 }

Here, inputs (features) are defined as flat keys via features_flat_keys_list, outputs (characteristics) are defined as flat keys via characteristics_flat_keys_list. Note, that we can also provide user-friendly description of features via features_flat_keys_desc to be able to easily interpret created decision tree.

After executing such script, a number of auxiliary files model-sklearn-dtc* will be created:

  • model-sklearn-dtc.model.obj - Python object with the decision tree (can be shared and validated by other users)
  • model-sklearn-dtc.model.obj - decision tree in PDF
  • model-sklearn-dtc.model.dot - decision tree in dot format
  • model-sklearn-dtc.model.decision_tree.json - decision tree in JSON format
  • model-sklearn-dtc.model.ft.txt - description of all features
  • model-sklearn-dtc.model.inp.char.json - list of all characteristics in numerical format
  • model-sklearn-dtc.model.inp.ft.json - list of all features in numerical format

Here is example of such decision tree with depth 4 after 100 iterations when exploring HOG on Samsung Chromebook 2:

We plan to use these files to eventually reproduce our past prototype of predictive scheduling via active learning by continuously updating and integrating decision tree with the HOG code - we are just limited with time and resources so any help is appreciated.

Finally, it is possible to validate prediction using model-sklearn-dtc-validate.bat script:

 $ ck validate model: @model-input.json  @model-sklearn-dtc.json > model-sklearn-dtc-validate.txt

The output will show Model RMSE, prediction rate, and all mispredictions - this information can be used to improve model, find missing features, increase depth, etc. Interestingly, we may use CK autotuner to optimize the model itself, i.e. its size and speed versus RMSE (future work). We envision that such compact and shared models will be actively used to pack large experimental data and keep only unexpected behavior (see our papers 1, 2 for further details) thus solving our own "big data" problem.

Exposing run-time parameters via xOpenME interface

As you may note in the above example, we need to expose various run-time features (for example of data sets) to correlate with choices and characteristics.

We developed a simple xOpenME library that can record such parameters to JSON file. Here is an example of xOpenME instrumentation in test_hog.cpp:

 #ifdef XOPENME
 #include <xopenme.h>
 #endif
 int main(...) {
  #ifdef XOPENME
     xopenme_init(1,16); # defines how many clocks (1) and parameters (16) to allocate
     xopenme_clock_start(0);
  #endif
  ...
  time_hog(...);
  #ifdef XOPENME
     xopenme_clock_end(0);
     xopenme_dump_state(); # dumping state to ''tmp-ck-timer.json'' that will be processed by CK pipeline
     xopenme_finish();
  #endif
 ...
 void time_hog(...) {
 ...
 #ifdef XOPENME
   xpenme_add_var_i(0, (char*) "  \"input_size_x\":%u", cpu_gray.rows);
   xopenme_add_var_i(1, (char*) "  \"input_size_y\":%u", cpu_gray.cols);
 #endif

Whenever HOG is executed, the run-time state is dumped to tmp-ck-timer.json and later embedded into characteristics in program pipeline output.

Here are all functions available in xOpenME (you can find it via ck find package:lib-rtl-xopenme) to record integer, float, double and string parameters:

 extern void xopenme_add_var_i(int var, char* desc, int svar);
 extern void xopenme_add_var_f(int var, char* desc, float svar);
 extern void xopenme_add_var_d(int var, char* desc, double svar);
 extern void xopenme_add_var_s(int var, char* desc, void* svar);

Note, that we also have a function to dump run-time array to a file:

 extern void xopenme_dump_memory(char* name, void* array, long size);

It allows to dump and validate run-time state at various check points or help extract kernels with their run-time state. Such kernels can be replayed with various datasets and used for program or compiler crowd-tuning (particularly using spare Android-based mobile devices) as we show in Ref1, Ref2 and Collective Mind Node.

Converting HOG into adaptive library (on-going)

As described in this paper, we would like to gradually convert all time consuming libraries into adaptive ones connected to CK.

For this purpose, we added "hog.adapt.cpp" which has the following adaptation function:

 int adapt_hog(bool* output, int* input, float* input_f)

This function takes 2 feature vectors that are prepared just before running CPU or GPU HOG version:

  • input[] – integer vector with features used to prepare an output (i.e. making a decision which algorithm to run)
  [0] - image width
  [1] - image height
  [2] - image width*height
  [3] - hardware species (future work - all hardware should get representative CK label, obtained via CK_HARDWARE_SPECIES environment)
  • input_f[] – float vector with features
  [0] - CPU frequency (passed via CK_CPU_FREQUENCY environment)
  [1] - GPU frequency (passed via CK_GPU_FREQUENCY environment)

Then, it should prepare output vector to decide which code to run:

  • output[] - boolean vector specifying which HOG algorithm to run:
 [0] - CPU
 [1] - OpenCL
 [2] - PENCIL

Features are prepared in "test_hog.cpp"

 if ((getenv("CK_ADAPT_HOG")!=NULL) && (atoi(getenv("CK_ADAPT_HOG"))==1))
 {
    const auto adapt_tree_start = std::chrono::high_resolution_clock::now();
    adapt_features[0]=cpu_gray.rows;
    adapt_features[1]=cpu_gray.cols;
    adapt_features[2]=cpu_gray.rows*cpu_gray.cols;
    adapt_features[3]=0;
    if (getenv("CK_CPU_HARDWARE_SPECIES")!=NULL) adapt_features[3]=atof(getenv("CK_CPU_HARDWARE_SPECIES"));
    adapt_features_f[0]=0;
    if (getenv("CK_CPU_FREQUENCY")!=NULL) adapt_features_f[0]=atof(getenv("CK_CPU_FREQUENCY"));
    adapt_features_f[1]=0;
    if (getenv("CK_GPU_FREQUENCY")!=NULL) adapt_features_f[1]=atof(getenv("CK_GPU_FREQUENCY"));
    adapt_hog(adapt_run, adapt_features, adapt_features_f);
    const auto adapt_tree_end = std::chrono::high_resolution_clock::now();
    elapsed_adapt_tree = adapt_tree_end - adapt_tree_start;
 }

As you may see, you need to set CK_ADAPT_HOG environment variable to 1 before running program pipeline to use this adaptation emulation.

We also added recording of 2 additional run-time state variables into JSON via OpenME:

  • “adapt_time” – time of CPU or GPU function (depending what was executed).

By the way, note that we do not record time spent for preparing/compiling OpenCL code – we expect that it’s done once and time is amortized over iterations ...

  • “adapt_tree_time” – timing of decision making function with features selection.


Currently, timing of decision making is quite small, so not sure if it’s reliable, but it is possible either to provide a loop around this function or profile this code using --gprof when executing pipeline.

Our current work is to automatically convert our decision tree from the previous subsection into this adapt_hog function. We will also implement this function as a dynamic plugin to be able to rebuild/improve decision trees when more knowledge collected for such adaptive libraries without recompiling original code - useful for binary distributions. We also plan to add decision tree/predictive model tuning in terms of size, feature extraction time, decision making time and prediction rate. It can be a possible internship project.

Autotuning HOG via external scripts

Some of our colleagues integrated CK program pipeline with their own autotuning framework. Next we will show how is it possible to explore various design and optimization choices of program pipeline and monitor characteristics via JSON input and output files.

Before starting exploration of any parameters of a given program on a given machine, you need to setup a program pipeline (resolve all dependencies for a given machine, find available choices such as data sets, compilers, their flags, etc) and then save it into tmp-pipeline-input.json.

You can do the following:

 $ ck pipeline program:realeyes-hog-opencl-tbb --prepare --speed --save_to_file=tmp-pipeline-input.json

Normally, it will ask you some questions about data sets, command line, etc but you can simply choose the defaults, i.e. just press Enter for questions. Also, instead of actually compiling and running program, it will just record all info and possible choices needed to compile and run program on a given machine into tmp-pipeline-input.json.

It is now possible to run this pipeline (for testing) with defaults via:

 >ck run pipeline:program @tmp-pipeline-input.json --save_to_file=$PWD/tmp-pipeline-output.json

CK should compile, run HOG and save output (including characteristics) to tmp-pipeline-output.json.

So, if you have your own exploration tool, you just need to pre-load this pipeline and then change various parameters (see further), run pipeline (experiment workflow), read results from tmp-output-max.json (see further), and then select next exploration point (random, genetic, probabilistic, machine learning based, etc).

Let us have a closer look at tmp-pipeline-input.json - this is a pipeline input dictionary. Though, technically speaking, it is possible to change anything in this input file, we are trying to unify the tuning process and move tunable choices to input[“choices”].

For example, to explore HOG behavior under different CPU and GPU frequency, it is possible to set:

  • input[“choices”][“cpu_freq”] = CPU freq as integer (on Samsung Chromebook 2 it can be loop, start: 300000, stop: 1900000, step:100000)
  • input[“choices”][“gpu_freq”] = GPU freq as integer (on Samsung Chromebook 2 it can be a list [100000000,177000000,266000000,350000000,420000000,480000000,543000000,600000000])
  • input[“choices”][“dataset_uoa”] = UID or alias of a data set (UOA) – you can find the list of all available CK data sets for HOG (detected by tags) in input[“choices_desc”][“##dataset_uoa”][“choices”]

It is also possible to tune GPU code via BLOCK_SIZE parameter (exposed via HOG meta-description):

  • input[“choices”][“compiler_vars”][“BLOCK_SIZE”] = integer (can be loop, start:8, stop:128, step:8)

Also, it is possible to check various algorithmic implementations of HOG - our shared version has 3 different algorithms implemented (using different grids for image processing). You can change them via:

  • input[“choices”][“compiler_vars”][“HOG1X1”]=”YES” or
  • input[“choices”][“compiler_vars”][“HOG2X2”]=”YES” or
  • input[“choices”][“compiler_vars”][“HOG4X4”]=”YES”

Let’s say that after changing choices and running pipeline, you read tmp-pipeline-output.json into output dictionary.

The first thing to check, is whether pipeline executed successfully or failed. It is possible to check via output[“last_iteration_output”][“fail”] - string (“yes”/”no”). If this value is “yes”, the pipeline failed. In such case it is possible to check why pipeline failed via string output[“last_iteration_output”][“fail_reason”].

If pipeline succeed, you can check the following characteristics in output (for example, needed for predictive CPU/GPU scheduling via active learning as described in our old paper that we are gradually re-implementing via CK and OpenME plugin interface):

  • output[“last_stat_analysis”][“dict_flat”][“##characteristics#run#dim_cpu#min”] - min CPU code execution time as float
  • output[“last_stat_analysis”][“dict_flat”][“##characteristics#run#dim_cpu#max”] - max CPU code execution time as float
  • output[“last_stat_analysis”][“dict_flat”][“##characteristics#run#dim_cpu#mean”] - mean CPU code execution time as float.
  • output[“last_stat_analysis”][“dict_flat”][“##characteristics#run#dim_gpu_only#mean”] - mean GPU code execution time (without memory transfer timer)
  • output[“last_stat_analysis”][“dict_flat”][“##characteristics#run#dim_gpu_with_mem#mean”] - mean GPU code execution time (with memory transfer timer)

There can be extra values from statistical analysis (such as calculating multiple expected values, etc) - however it requires extra scientific Python packages to be installed such as scipy.

Note, that at the end of HOG execution, we run extra python script to produce extra characteristics. Again, for adaptive scheduling, we produce the following derived characteristics:

  • output[“last_stat_analysis”][“dict_flat”][“##characteristics#run#derived_cpu_time_div_by_gpu_with_mem_time#mean”] - CPU code time / GPU code with memory transfer time
  • output[“last_stat_analysis”][“dict_flat”][“##characteristics#run#derived_gpu_with_mem_is_much_better_cpu#min”] - true if (CPU time / GPU with memory transfer time) > 1.07, i.e. it’s better to run HOG on GPU than CPU.

Questions and comments

You are welcome to get in touch with the CK community if you have questions or comments!

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.