Compiler autotuning

Grigori Fursin edited this page Aug 30, 2018 · 90 revisions

[ Home ]

Introducing portable compiler autotuning to students

Table of Contents

Introduction

News: Please, check our latest interactive report on compiler autotuning and machine learning supported by the Raspberry Pi foundation: http://cKnowledge.org/rpi-crowd-tuning.

Many software developers still believe that to get the best possible code out of their compiler they simply need to specify the "best optimisation" flag (e.g. "-O3" or "-Os"). Unfortunately, compilers are very complex pieces of software with too many available optimizations tuned on a very limited set of programs, inputs (benchmarks) and platforms. Consequently, it is often possible to automatically find a combination of compiler flags that makes the compiler generate code that is much faster and considerably smaller than with the "best optimisation" flag ( 1 , 2 ).

At ARM TechCon'16 and CGO'17 we had very interesting discussions with Aaron Smith (Microsoft), Michel Steuwer (U.Edinburgh) and Eben Upton (Raspberry Pi foundation) about many difficulties when introducing students to compiler autotuning techniques. Furthermore, students really need simple hands-on experience of performing compiler flag exploration on their computers/gadgets with the latest OS, compilers, applications and datasets, visualizing empirical results, realizing the complexity of the design and optimization spaces, and finding optimal solutions that balance code size, execution time and other important characteristics on a Pareto frontier!

However, in spite of autotuning being a buzzword for a couple of decades, there is still a lack of simple "one-click" autotuning tools even for this "classical" case of autotuning (leave alone even more exciting cases such as algorithmic tuning of BLAS/DNN libraries and models).

The problem is actually more fundamental than it may seem at the first glance! Implementing portable and customizable autotuning scenarios is rather challenging due to the rapidly evolving software and hardware stack, coupled with highly stochastic behavior (variation and non-determinism). We have come to realize this through our own research into applying machine learning to compilation (which has recently received a CGO Test of Time award), and eventually decided to tackle this problem in a completely different way - by enlisting the help of the community, or crowdsourcing in modern parlance!

To this end, we have created Collective Knowledge (CK) ( 3 ), a generic, portable and customizable open-source workflow framework for computer systems research written in Python. CK allows the community to share and cross-link various artifacts (such as programs, kernels, datasets, tools, scripts, experimental results, predictive models) as reusable components with a unique ID, unified JSON API and JSON meta information. Such artifacts can be easily packed together and shared via any private or public repositories (e.g. GitHub, GitLab, Bitbucket).

CK also allows researchers to implement portable, customizable and reusable experimental workflows from shared artifacts akin to playing with LEGO(R) bricks. Such workflows can automatically adapt to the changing software/hardware stack. For this purpose, CK uses a simple package manager for automatically detecting or installing multiple versions of required software (compilers, libraries, benchmarks, datasets) across diverse hardware (CPUs, GPUs, DSPs) and operating systems (Linux, Windows, MacOS and Android).

As a result, autotuning becomes a relatively straightforward and high-level workflow for exploring compiler flag keys via unified JSON API while CK takes care of low-level compilation and execution of shared programs on a user machine as well as statistical analysis of experimental results and data mining.

Using Collective Knowledge Framework for compiling, running and autotuning shared programs

Installing CK

The minimal installation requires:

  • Python 2.7 or 3.3+ (this requirement is mainly due to unit-tests; CK should work with earlier versions too);
  • Python pip packaging system (if you would like to install CK via pip)
  • Git command line client.

On Ubuntu, you can install these dependencies via:

 $ apt-get install -y python python-pip git

(Here and in what follows, you may need to prefix your commands with "sudo".)

On Windows, you can download and install these tools from the following sites:

You can install the latest stable CK release via pip with sudo:

$ pip install ck

However, if you do not have a root access it is simple to install CK locally:

$ git clone https://github.com/ctuning/ck.git ck-master

and then set up PATH to ck front-end:

- On Linux:

$ export PATH=$PWD/ck-master/bin:$PATH
$ export PYTHONPATH=$PWD/ck-master:$PYTHONPATH

- On Windows:

$ set PATH={CURRENT PATH}\ck-master\bin;%PATH%
$ set PYTHONPATH={CURRENT PATH}\ck-master;%PYTHONPATH%

Installing compilers

Ubuntu

 $ apt install build-essential make cmake gcc clang

Windows

Download and install free Visual Studio Community Edition (we've tested 2015 and 2017). Do not forget to select "Desktop development with C++" option to install C++ compiler and other necessary CMD tools to be able to use MVSC and LLVM.

MacOS

 $ xcode-select --install

Android

Download and install the following tools for your host (Linux, Windows, MacOS):

  • Android NDK (just unzip it to CK-TOOLS in your home directory)
  • Java Run Time (install it and it to PATH environment variable)
  • Android SDK (just unzip it to CK-TOOLS in your home directory; then run android/android.bat; update SDK and install Android SDK platform tools; add platform-tools path with adb to PATH environment variable)

Installing CK repositories for autotuning

You can install all CK repositories required for autotuning in one go as follows:

$ ck pull repo:ck-crowdtuning

CK will then automatically install all the dependent repositories required for autotuning:

  • ck-env - portable package and environment manager
  • ck-autotuning - customizable, multi-objective and multi-dimensional autotuning
  • ck-analytics - statistical analysis and predictive analytics
  • ck-web - visualization of experimental results and interactive articles
  • ctuning-programs - shared benchmarks and kernels
  • ctuning-datasets-min - shared datasets for the above programs
  • ck-crowdtuning - crowd-benchmarking and crowd-tuning scenarios for shared workloads

Compiling and running shared programs via CK

You can view shared programs and datasets compatible with the CK autotuning workflows as follows:

$ ck list program | sort
$ ck list dataset | sort

You can now compile any given program, for example, cbench-automotive-susan, using any installed compiler on your host machine as follows:

$ ck compile program:cbench-automotive-susan --speed
NB: The integrated package and environment manager will be invoked to detect all available versions of installed compilers (LLVM, GCC, ICC, etc) and register their details with CK in the "local" repository. (CK allows multiple versions of software to easily co-exist on the same machine). You can view all the detected compilers via:
$ ck show env --tags=compiler
If only one compiler is detected, CK will compile this program using the detected compiler. If more than one compiler is detected, CK will prompt the user to select the compiler to use. If no compilers are detected, CK will attempt to install compilers on your machine using shared packages. You can see available packages for compilers as follows:
$ ck search package --tags=compiler
If no suitable compiler packages are available (i.e. not yet shared by the community), you can manually install the required compiler such as LLVM and request CK to automatically detect and register it as follows:
$ ck detect soft:compiler.llvm

Once compilation succeeds, you can run the program as follows:

$ ck run program:cbench-automotive-susan

Note that during the very first run of any program on your target platform, CK will ask you to select the most close platform description in the CK to reuse various tools and scripts that improve reproducibility (such as fixing or changing CPU and GPU frequency, etc) or enable SW/HW co-design.

CK will then prompt you to select a suitable dataset (if multiple are found by tags) and command line option (if multiple options are described in the program's metadata).

You can find the program binary and output inside the tmp sub-directory of the CK program entry which can be found as follows:

$ ck find program:cbench-automotive-susan

You can also view the program output in the JSON format (including multiple characteristics and properties collected during the execution) by running the program as follows:

$ ck run program:cbench-automotive-susan --out=json

Autotuning LLVM flags

Multi-objective and multi-dimensional autotuning is implemented in a CK as an extensible workflow (pipeline) ( 4 ). Researchers assemble such pipelines by plugging in their programs, datasets, compilers, libraries and runtimes using flexible tags and unique identifiers:

CK stores experimental results as free form files (such as logs and traces) and JSON meta information. Later, experimental results can be analyzed, visualized or included in interactive reports and articles via CK extensions (e.g. see http://cknowledge.org/repo/web.php?wcid=1e348bd6ab43ce8a:b0779e2a64c22907).

For example, you can autotune LLVM compiler flags for the cbench-automotive-susan program by invoking the following CK command:

$ ck autotune program:cbench-automotive-susan --llvm --record_uoa=my-first-ck-experiment

You will be prompted to select one of the shared autotuning scenarios for LLVM such as random exploration of compiler flags or search for the best compiler flag combination to minimize execution time, code size, compilation time, energy and other characteristics on a Pareto frontier. Let's first try the scenario "explore LLVM compiler flags (3b94ae3c43fc89ad)".

When prompted to select a command line option and a data set to be used during autotuning, select the following:

  • edges ($#BIN_FILE#$ $#dataset_path#$$#dataset_filename#$ tmp-output.tmp -e)
  • image-pgm-0001 (b2130844c38e4a56)

CK will then attempt to detect your compiler version, find the closest compiler CK entry with optimization flags (see ck list compiler), and perform 30 autotuning iterations (empirical experiments) by compiling the given program with a random selection of compiler flags (with a 0.5% probability to select a given flag), executing it and measuring various dynamic characteristics. Each experiment will be repeated 4 times to measure statistical variation. You can change the number of autotuning iterations and statistical repetitions using the `--iterations` and `--repetitions` options respectively from the command line e.g.

$ ck autotune program:cbench-automotive-susan --llvm --iterations=100 --repetitions=10

By default, the CK autotuning workflow will monitor any CPU/GPU frequency changes during the execution to improve reproducibility of results and may invalidate experiments on very unstable systems. For compiler flag autotuning, however, you may wish to turn off this check by using the `--no_state_check` option.

You can view other available autotuning options using:

$ ck autotune program --help
$ ck crowdsource program.optimization --help

One autotuning succeeds, you should see the following message (the reported unique IDs will be different on your machine):

Note that you can:
* replay above experiments via "ck replay experiment:44553ab72b1b25a1 (--point={above solution UID})" * plot non-interactive graph for above experiments via "ck plot graph:44553ab72b1b25a1" * view these experiments in a browser via "ck browse experiment:44553ab72b1b25a1"

This means that CK has recorded all the autotuning results in a newly created CK entry experiment:44553ab72b1b25a1 for further analysis and visualization. However you also specified "my-first-ck-experiment" as an alias and hence you can find where this entry is stored in your "local" CK repository using:

$ ck find experiment:my-first-ck-experiment

Autotuning GCC flags

CK workflow and portable package manager helps us abstract autotuning from underlying compiler and other software. Therefore, targeting autotuning to another compiler is straightforward - you just need to specify flag --gcc to let CK automatically detect all installed GCC compilers and tune their flags on your machine:

$ ck autotune program:cbench-automotive-susan --gcc

Visualizing autotuning results

Now you can visualize the autotuning results as follows:

$ ck plot graph:my-first-ck-experiment
NB: Just make you have all extra Python dependencies installed (matplotlib, scipy, numpy, sklearn-kit).
On Windows, we recommend to use Anaconda Python.
On Ubuntu, you can install these dependencies using:
$ sudo apt-get install python-numpy python-scipy python-matplotlib python-pandas

You should normally see a graph that looks like this:

The Y axis represents the program execution time (with statistical variation); while the X axis represents the program binary size.

NB: The graph above was produced from autotuning experiments on a random node in the Microsoft Azure cloud with Ubuntu 16.04.2 LTS and clang 3.8.

You can also view the autotuning results in a web browser as follows:

$ ck browse experiment:my-first-ck-experiment

Execution time and code size for -O3 is highlighted by red color. You can see that it is possible to get more than 10% execution time and code size improvements simply by exploring random combinations of compiler flags. It can also be used to test compiler and automatically find bugs. Though such improvements may not sound impressive at first glance, they are critical for embedded devices with limited resources such as mobile phones, IoT and Raspberry Pi (for example to optimize Chrome/Firefox browsers or deep learning) and for data centers to save energy and money for workloads running 24/7/365. Furthermore, you can see live results even with 2-10x speedups for some commonly used kernels across diverse hardware during collaborative autotuning in the CK live repository.

Packing and sharing experimental results

Since we use native file system for CK repositories and unique IDs for all entries, it is easy to exchange experiments in a workgroup or with a community. For example, a student may pack experimental results simply as follows:

$ ck zip experiment:my-first-ck-experiment

and then send created ckr.zip file to his or her adviser, preserve in a Digital Library and add it to a personal web page. Other users may then easily unpack these experiments on their machines simply as follows:

$ ck unzip repo --zip=ckr.zip

This, in turn, allows them to visualize and analyze these results on their machine as was already described above.

Replicating and reproducing experiments

Though CK repositories can be packed inside Docker images to replicate exactly the same experiment, computer systems' researchers are more interested to validate research techniques on a latest platform with a latest environment, libraries, tools and compilers. Portable CK package manager helps reproduce empirical experiments while adapting to underlying software and hardware.

It is possible to replay a given autotuning experiment as follows:

$ ck replay experiment:my-first-ck-experiment

You will see a list of available "points" (autotuning iterations) in this CK entry with their unique IDs. You can then replay a specific iteration as follows:

$ ck replay experiment:my-first-ck-experiment --point={available point UID}

When executed on a user machine or via Docker image, the experimental setup will be the same. However, when executed on another user machine, CK will pick up available software (compilers, libraries, tools) and will report any unexpected behavior. It is even possible to replay Linux experiments on Windows machines and vice versa. dividiti use this mechanism to share optimized code with colleagues and customers or share bugs and unexpected behavior with hardware and software vendors for further improvement. cTuning foundation also uses this approach to share and validate experiments with students and interns.

Interestingly, one of the tenets of research - reproducibility - follows naturally in the CK as a side effect of the community running such shared workflows and artifacts across diverse hardware, reporting unexpected behavior, and collaboratively improving workflows (e.g. fixing CPU/GPU frequencies), improving statistical analysis of empirical results, detecting the Pareto-optimal frontier, and applying complexity reduction ( 1 , 2 ).

As a result, CK workflow to compile and run a given program has been considerably extended over the past two years and now looks as follows:

This approach now helps describe submitted artifacts and workflows at CGO,PPoPP,PACT,SC and other conferences.

Renaming or removing entries

Note, that you can also rename or remove your local entry with experiments (for example to re-run autotuning experiments) as follows:

$ ck ren experiment:my-first-ck-experiment experiment:my-first-ck-experiment.arc
$ ck rm experiment:my-first-ck-experiment

Cross-autotuning for Android platforms

CK allows you to easily compile shared programs for Android devices connected to host machine via adb tool by adding a CMD option --target_os=android21-arm64 to all above examples. This will let you compile and run programs via CK for any Android 21 with ARM64-based processor, i.e.:

$ ck compile program:cbench-automotive-susan --speed --target_os=android21-arm64
$ ck run program:cbench-automotive-susan --target_os=android21-arm64
$ ck autotune program:cbench-automotive-susan --target_os=android21-arm64

You can view other available Android targets using the following CK command:

$ ck list os | sort
$ ck list os:android* | sort

You can see an example of a CK-powered interactive graphs with GCC and LLVM autotuning results on ARM-based Samsung Chromebook and x86-based Windows Lenovo laptop here: http://cknowledge.org/repo/web.php?wcid=1e348bd6ab43ce8a:d6e73da144db3899

Autotuning large applications

Though it is possible to autotuning even large applications, it may be simply too time consuming to perform even a few iterations. For such cases we originally developed a method to statically clone hot-spots with different optimizations and then evaluate them at run-time during stable run-time phases ( 7 ). This approach even enabled dynamic adaptation for statically compiled program. Furthermore, it motivated us to extract most time consuming kernels with representative data sets from various realistic application and collaborative optimize them outside application ( 5 ). You can find such "computational species" in the CK format here: https://github.com/ctuning/ctuning-programs .

NB: It is important to extract coarse-grain kernels to capture cache effects!

CK approach allows the community to gradually expose more and more optimizations at different levels (algorithms, models, data sets, MPI, networks, schedulers, source-to-source transformations, compiler flags, compiler passes, instructions, frequency, hardware configurations, etc) via JSON API for SW/HW autotuning and co-design - do not hesitate to use and extend CK autotuning workflows for your own research projects ( 1 ).

Customizing autotuning workflow

Based on user feedback we also provided functionality to simplifying customized autotuning. You can now describe your own autotuning scenario in a simple JSON file, perform autotuning, record results in the repository, plot and reproduce them with just a few simple CK commands.

For example, you can explore some combinations of compiler flags for a given program by creating the following JSON file my-autotuning.json:

{
  "experiment_1_pipeline_update": {
    "choices_order": [
      [
        "##compiler_flags#base_opt"
      ]
    ],
    "choices_selection": [
      {
        "notags": "",
        "choice": ["-Os","-O0","-O1","-O2","-O3"],
        "default": "-O3",
        "type": "loop"
      }
    ]
  },

  "repetitions": 1, 
  "seed": 12345, 
  "iterations": -1,
  "sleep":0
}

and then invoking the following CK command (on Linux, MacOS or Windows):

 $ ck autotune program:cbench-automotive-susan @my-autotuning.json --new --skip_collaborative --scenario=experiment.tune.compiler.flags --extra_tags=explore

Furthermore, you can use CK customized autotuner on top of your own script (which may compile and run applications or do something else). You can find such example in the following CK entry:

 $ ck find demo:customized-autotuning-via-external-cmd

You can find other demos of customized autotuning such as OpenMP thread tuning or batch size tuning in DNN engines (Caffe, TensorFlow) in the followin grepository:

 $ ck pull repo --url=https://github.com/dividiti/ck-caffe
 $ ck pull repo:ck-caffe2
 $ ck pull repo:ck-tensorflow

 $ ck list script:explore-batch-size-unified-and-customized --all

You can also check shared and unified auto/crowd-tuning scenarios to tune OpenCL-based BLAS libraries such as CLBlast, OpenBLAS threads, batch sizes, etc by invoking

 $ ck autotune program
and selecting appropriate autotuning scenario.

Enabling open computer systems' research

Participating in collaborative LLVM optimization and testing (crowd-tuning)

A decade ago Grigori Fursin's research to crowdsource machine learning based performance analysis and optimization of realistic workloads across diverse hardware provided by volunteers nearly stalled ( 2 ). It was often simply impossible to reproduce empirical performance results collected from multiple users and validate predictive models across continuously changing software and hardware stack. Worse, lack of diverse and representative benchmarks and data sets in our community severely limited the usefulness of such models.

All these problems forced motivated me and my colleagues to develop Collective Knowledge concept and workflow framework with portable package/environment manager and integrated JSON web service. It is no surprise that the first collaborative scenario which we implemented is to crowdsource LLVM and GCC flag tuning via public repository of knowledge (cKnowledge.org/repo) ( 5 , 3 ) - you just need to select the appropriate experimental workflow from the top menu such as "auto/crowd-tune LLVM compiler flags (minimize execution time)".

You can easily participate in LLVM crowd-tuning, validate optimizations already shared by the community, and continuously improve them on your own machine using the following command:

$ ck crowdtune program --llvm --quiet

We now use this collaborative approach to crowdsource optimization of realistic workloads across diverse platforms from mobile and IoT devices to data centers. For example, the community already helps us crowd-tune LLVM and GCC in Azure cloud or using numerous Android devices (mobile phones and tablets) provided by volunteers similar to SETI@HOME via our small Android application.

Collaboratively optimizing deep learning via CK

Deep learning is the hottest research topic today thanks to many artificial intelligence and computer vision applications. Still, designing and optimizing deep learning applications is a dark art, mainly because of lack of effective mechanisms for knowledge sharing: in spite of millions of individuals active in this area, the community is not necessarily becoming any wiser! This is because, like the blind men touching the elephant in the famous parable, individual observations do not easily add up to a coherent and comprehensive view of deep learning.

That is why we are developing a set of open-source tools and public repository of knowledge based on Collective Knowledge and leveraging capabilities of billions of available mobile and IoT devices to collaboratively benchmark and optimize deep learning (algorithms, libraries, models, data sets) to meet the performance, energy, memory consumption, prediction accuracy and cost requirements for a wide range of applications and for deployment on a wide range of form factors – from IoT to self-driving cars.

The first such tool is CK-Caffe, developed in collaboration with General Motors and other partners (github.com/dividiti/ck-caffe). CK-Caffe leverages the key capabilities of CK to crowd-source experimentation and perform multi-objective DNN/BLAS autotuning across diverse platforms, trained models, optimization options, and so on; exchange experimental data in a flexible JSON-based format; and apply leading-edge predictive analytics to extract valuable insights from the experimental data ( 1 , 8 ).

Another such tool is our engaging Android DNN optimization app letting user apply different engines (trained models, math libraries, etc.) to classify objects in images and report misclassifications along with a correct category and even possible bugs. At the same time, various characteristics (execution time, memory usage, energy, accuracy) are aggregated in an open repository and can be viewed by models, platforms, processors and so on.

We hope that CK will help students, scientists and engineers avoid wasting time on reinventing their own ad-hoc setups and autotuning tools, and focus on knowledge sharing when optimizing realistic workloads and deep learning – from low-level building blocks such as kernels and libraries to high-level blocks such as layer and network designs.

Enabling open research and building upon others' artifacts

As you may now see, Collective Knowledge approach supports our long-term vision for open, collaborative and reproducible computer systems' research ( 6 , 2 ).

We are now glad to see that researchers started sharing their customizable and portable experimental setups in the CK format for Artifact Evaluation at the premier ACM conferences on parallel programming, architecture and code generation (CGO, PPoPP, PACT and SC). For example, the distinguished artifact winner at CGO'17 was shared using Collective Knowledge Framework! In contrast with Docker images, it is now possible to run this CK workflow with the latest software, reuse and customize individual artifacts and easily build upon them simply by pulling this repository as follows:

$ ck pull repo --url=https://github.com/SamAinsworth/reproduce-cgo2017-paper

The community will then be able to run LLVM on ARM-based Linux platforms (such as Odroid and Raspberry Pi), build plugins for LLVM, help authors validate their research techniques on different machines with different environments, and report back any unexpected behavior for further improvements. See PDF snapshot of the CK interactive report , paper with AE appendix and CK workflow and GitHub sources of this artifact for further details.

Feedback

Collective Knowledge is an on-going and heavily evolving open project - if you have any questions or encounter problems feel free to open tickets at the related CK GitHub repositories or get in touch with the Collective Knowledge community!

Acknowledgments

Next steps

We plan to provide further tutorials on CK-powered customizable autotuning for OpenCL and CUDA-based applications:

  • customized OpenCL autotuners (parameters and algorithms)
  • customized CUDA autotuners (parameters and algorithms)
  • Lift compiler tuning
  • Customized autotuning of deep learning engines and models (Caffe, TensorFlow, libDNN) to optimize them at all levels (algorithms, models, data sets, libraries, compilers, run-time, hardware) in terms of execution time, accuracy, energy, memory usage and other costs across diverse SW/HW stacks from supercomputers to constrained IoT devices.

Please, stay tuned ;) !

References

  1. "Collective Mind: Towards practical and collaborative auto-tuning", Journal of Scientific Programming 22 (4), 2014
  2. "Collective Tuning Initiative: automating and accelerating development and optimization of computing systems", GCC Developers Summit, Montreal, Canada, 2009
  3. "Collective Knowledge: towards R&D sustainability", Proceedings of the Conference on Design, Automation and Test in Europe (DATE), 2016
  4. CK repository with customizable workflow for multi-objective and multi-dimensional autotuning: https://github.com/ctuning/ck-autotuning
  5. "Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science", 18th International Workshop on Compilers for Parallel Computing (CPC'15), London, UK, 2014
  6. "Community-driven reviewing and validation of publications", TRUST@PLDI, Edinburgh, UK, 2014
  7. "A practical method for quickly evaluating program optimizations", HiPEAC, Barcelona, Spain, 2005
  8. "Optimizing Convolutional Neural Networks on Embedded Platforms with OpenCL", Proceedings of the 4th International Workshop on OpenCL (IWOCL), Vienna, Austria, 2016
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.