Artifact sharing

Grigori Fursin edited this page Aug 12, 2018 · 23 revisions

[ Home ]

This page is outdated. Please check these new resources:








Table of Contents

Introduction

Since our first collaborative research projects started in 1998, we have been struggling sharing, porting, reproducing and adapting numerous and ever changing code, data and experimental results between researchers, workgroups and the community (see our report). Worse, most of the code and data becomes outdated and unusable shortly after the project.

Even after initiating and leading artifact evaluation at multiple leading conferences including PPoPP, PACT and CGO, we have noticed, that researchers rarely share artifacts in a reusable and customizable format, most of the time simply due to a lack of time.

Though it helps validate some research results, it usually does not help the community extend them and build upon them.

CK is intended to address some of these issues as described in the Getting Started Guide part 1 and part 2. Please read them first!

Basically, CK is a small, portable and customizable framework that helps users quickly develop and share their artifacts as reusable Python components with JSON API, organize all your code and data with distributed UIDs, quickly prototype experimental workflows (such as multi-objective autotuning), automate, crowdsource and reproduce experiments, unify predictive analytics (scikit-learn, R, DNN) and even enable interactive articles (see CK-powered live repository with all the results from our past computer systems' research here). We hope it will considerably simplify Artifact Evaluation and will enable truly collaborative and reproducible experimentation!

We use CK in our every day research life and in multiple collaborative projects, and publish reproducible and reusable papers and reports. You can find an example of our interactive report here. Please check out all available CK-based repositories and modules which you can reuse for your own experiments.

It is a quickly evolving project so if you would like to join or have questions, do not hesitate to get in touch with the CK community via this open mailing list. You may also find many public resources and tools related to Open Science here.

Next we describe how to use CK to share your own benchmarks, data sets, tools, experimental results, models, scripts, and other artifacts, extend existing experimental setups, and assemble new experimental workflows as reusable CK components with simple JSON API and meta-description via GitHub (or any other Git service), BitTorrent, or as a standard zip archive.

Preparing CK repository

If you consider using existing CK repository to add or share your artifacts, you can skip this section. Otherwise, if you plan to share your artifacts with a few colleagues within a private workgroup or with the whole community, you have two relatively simple options:

  • create local private CK repository and manually share it at any time as standard zip (or other) archive
  • create a shared CK repository which is synchronized via GitHub, Bitbucket or any similar Git service. Note, that it is possible to extend CK to support other sharing methods via plugins such as via SVN, CSV, rsync, Google Drive, Microsoft SkyDrive and others - we can do it if there will be enough interest from the community or we also very welcome your contributions to CK.

First, create a dummy repository at GitHub (or similar service). Let's say that new repo URL is a clone URL (can be https:// or git://) of this repository. For example, URL for our ctuning-programs repository with shared benchmarks at GitHub is https://github.com/ctuning/ctuning-programs.git .

Next, you can pull this repository to a new local CK repository with any arbitrary alias (say my_shared_repo) via

 $ ck pull repo --url=<new repo URL>

Note, that if you want this repository to be automatically synced with GitHub after each CK operation (add, rm, rename, update, ...), you should add --sync to the above CMD, i.e.:

 $ ck pull repo --url=<new repo URL> --sync

Otherwise, you will need to manually sync artifacts via git add ... or git rm ....

When ready, you can commit and push all updates to GitHub via

$ ck push repo:my_shared_repo

In case of issues with automatically pushing changes to GitHub, you can find a path to this repository via

$ ck where repo:my_shared_repo
and then manually add, commit and push changes to GitHub.

Now, you can use newly created repository to store, share and cross-link all your artifacts. Furthermore, your colleagues (or Artifact Evaluation Committee) can simply obtain this repository in a same way as above via:

 $ ck pull repo --url=<new repo URL>

Note, that if the created GitHub repository is private, CK will be asking a username and password to access it (via local Git client).

Adding new artifacts in the CK format

We expect that users have read this section describing CK repository structure and basic functions.

Basically, any data in the CK has an associated module serving as a light-weight Python container with a simple JSON API to access this data. CK modules are themselves stored inside a given repository thus allowing users share new artifacts with their API as reusable components.

Adding data using existing CK modules

It is possible to see existing modules via

$ ck list module

Note, that more CK modules will appear when installing various shared CK repositories such as ck-env to support multiple versions of tools and libraries, ck-analytics for collaborative experimentation and predictive analytics and ck-autotuning for universal multi-objective exploration of multi-dimensional design and optimization spaces (choices):

$ ck pull repo:ck-autotuning

Note, that when pulling ck-autotuning repository, CK will automatically pull all other dependencies, i.e. ck-env and ck-analytics repositories.

You can find the list of shared CK repositories here (and you may add your own one there too). You can also find a list of all available shared CK modules and their functions here or as an automatically generated and update table here.

Our idea is that rather that always using own ad-hoc scripts, descriptions and directories, researchers and engineers will start using and collaboratively improving shared containers with UIDs to abstract, interconnect and shared their data in a common, simple and extensible format.

For example, you can add a new data set using dataset' module (container) from the shared ck-autotuning repository to your newly created local repository simply via

 $ ck add my_repo:dataset:some-user-friendly-alias @@dict

At the same time, CK will ask you to enter some JSON description for this entry which can be of interest for your research or can be following some common conventions, described in the help of this module.

You can see detailed examples to add and share your own data sets and benchmarks (for research in computer engineering) in CK format here.

Note, that your entry will be assigned a permanent UID which you should use, when referencing such entries (while you can use some-user-friendly-alias for your own convenience). You can find a UID of your newly created data set via:

 $ ck info dataset:some-user-friendly-alias

Furthermore, when using modules from other shared repositories, do not forget to add their dependency to your local repository. You can do it by interactively updating your local repository via

$ ck update repo:my_repo

In this case, if you share such repository and someone else pulls it (for example during Artifact Evaluation), all dependent repositories will be also automatically installed.

You may check these examples to add your own data set and benchmarks in existing CK format that can take advantage of shared autotuning and crowd-tuning scenarios.

Adding new modules to abstract new data

If you did not find existing module in shared repositories, you can easily add your own module to group and keep your data while taking advantage of a unified CK JSON API, possibility to search and cross-reference such data locally or even easily sharing it with the community (since it will have its own UID) to be gradually validated, reused and extended:

 $ ck add my_repo:module:my_module

You may substitute my_module with your own user friendly alias. If such alias already exists, CK will warn you. In case of success, CK will create a module containers in my_repo repository (with assigned UID) which you can use immediately!

Now you can add and list a new data entry using this module simply via

 $ ck add my_repo:my_module:my_new_data
 $ ck list my_module --all
 $ ck find my_module:my_new_data

You should now see this entry in my_module repository including the local physical path where it was created.

You can now simply add your files/directories to this path, and you can describe this collection of files interactively using this entry's JSON meta description via

 $ ck update my_module:my_new_data @@dict

You can later check this meta description via

 $ ck load my_module:my_new_data --min

Adding new action to modules

Above module can be used not only to abstract some groups of data (with common CK functions such as add, update, rename, copy, move, rm, load, search), but also to implement some user actions preferably related to this data or to abstract some tools.

Adding a new action to a module is also very simple:

 $ ck add_action module:my_module --func=my_action

CK will ask you a few questions and voila, this action has a dummy function and ready to be customized. You can now test it via:

 $ ck my_action my_module

You can also extend it by editing python function my_action in file module.py in the related module entry found via

 $ ck find module:my_module

For example, you can add the following python code inside this function to load meta description of the entry my_data:

    r=ck.access({'action':'load',
                 'module_uoa':work['self_module_uid'],
                 'data_uoa':'my_data'})
    if r['return']>0: return r

    meta=r['dict'] # meta of the entry
    path=r['path'] # physical path to the entry
    uoa=r['uoa']   # UID of the entry or alias if exists
    uid=r['uid']   # only UID of the entry

Here ck is a CK kernel with various productivity functions, while access if the only (!) unified access function to all CK functionality with JSON as input and output.

You can find JSON API for CK load function via

 $ ck get_api --func=load

Note, that when executing action from a given module, CK converts command line

 $ ck my_action my_module --param1=value1 --param2 param3=value3 @input.json ...

to the following Python dictionary i variable before calling ck.access function:

 i={
     "action":"my_action",
     "module_uoa":"my_module",
     "param1":"value1",
     "param2":"yes",
     "param3":"value3"

     *keys merged from input.json*

     ...
 }

Also, note that in general you should not use aliases inside modules (i.e. my_data) but rather their UID to avoid alias clashes and incompatibility issues when sharing your artifacts (code and data).

As the same time, you can use above action to abstract some of your tools and scripts using CK API. Just convert input dictionary to command line and call you tool as follows

   import os
   
   tool_name=i['tool_name']
   tool_cmd=i['tool_cmd']

   r=os.system(tool_name+' '+tool_cmd)

For example, you can use it to list files in the root Linux directory via:

 $ ck my_action my_module --tool_name=ls --tool_cmd=/

or on Windows:

 $ ck my_action my_module --tool_name=dir --tool_cmd=c:

or check default gcc version:

 $ ck my_action my_module --tool_name=gcc --tool_cmd=--version

and so on.

Another useful example is to list all available local entries for some module:

 r=ck.access({'action':'list', 'module_uoa':'my_module'})
 if r['return']>0: return r

 lst=r['lst'] # List of entries in a special format

 import json
 print json.dumps(lst, indent=2) # print lst in user-readable format 
                                 # to understand its structure

Note, that we expect the community to share info about new modules and actions or ask questions via various public forums including our mailing lists:




Prototyping research techniques as workflows from shared components

Above simple organization of code and data in CK with unified API appeared to be powerful enough to gradually assemble and share complex research scenarios and experiment workflows as LEGO (TM) as conceptually shown in the following figures:

There two simple but not CK-reusable ways to create such pipelines or workflows:

  • as OS scripts (Linux, Windows, MacOS, etc) that take some JSON file as input and then chain together several modules while recording output to JSON and using it as input to another module via
 ck {action} {module_uoa} @input.json --out=json_file --out_file=output.json
 ck {action1} {module_uoa1} @output.json --out=json_file --out_file=output1.json
 ...
  • by invoking modules from any language either via standard system call or preferably via small OpenME interface to access CK from C/C++/Fortran/Java/PHP or existing tools like GCC and LLVM. However, we suggest two other CK-reusable way to prototype (and possibly share) research ideas.
  • Installing CK as a standard package from CK root directory (unless already installed) via
 $ python setup.py install

and then writing own python scripts while including ck.kernel and accessing all CK functionality as was shown above as follows

 import ck.kernel as ck

 r=ck.load_json_file({'json_file':'input.json'})
 if r['return']>0: ck.err(r)

 input_meta=r['dict']

 r=ck.access({'action':'list', 'module_uoa':'test'})
 if r['return']>0: ck.err(r)

 input_meta['list']=r['lst']

 r=ck.save_json_to_file({'json_file':'output.json', 'dict':input_meta})
 if r['return']>0: ck.err(r)
  • Adding standard CK module inside your new repository as described above. However, we suggest to use pipeline action to make users aware that it's an experimental workflow via
 $ ck add_action my_module --func=pipeline

Unifying JSON input, output and experiments

Above format allows to easily extend and gradually unify inputs and outputs. At the same time, it also possibly to easily replay experiments with a given input and compare changed in the output.

For example, module program from ck-autotuning repo has a pipeline action that takes JSON input and produces JSON output while implementing our program compilation and execution workflow (see papers 1 and 2):

We also agreed with the community to use the following keys choices, characteristics, features and the state in the input and output dictionaries that can unify various autotuning and machine learning techniques as conceptually shown below:

Furthermore, we added a module experiment in ck-analytics shared repository to provide a universal container for experimental results.

It is possible to record new experiments to a new entry my_experiment as follows:

  r=ck.access({'action':'add',
               'module_uoa':'experiment',
               'data_uoa':'my_experiment',
               'repo_uoa':'my_repo',
               'dict': {
                 'meta': { # any coarse-grain meta of the experiment to find this entry
                   'platform':'my-platform',
                   ...},
                 'tags':['my exciting experiment', 'reproducible', 'changing the world'] # any tags to find this entry
                 'characteristics':{...},
                 'choices':{...},
                 'features':{...}
               }
              }
 if r['return']>0: return r

Using this simple format allows you to take advantage of other modules already shared in CK to apply statistical analysis, predictive analytics, crowdsource experiments, plot graphs, visualize table in HTML, prepare web widgets with interactive graphs (for interactive papers), replay shred experiments, and so on. Please see first examples in this Getting Started Guide for more details.

Also, please check out the following examples to add your own data sets, benchmarks, scripts, etc in CK format, to be able to take advantage of above autotuning and predictive analytics techniques.

Supporting different platforms and environments

One of the major problems we faced when sharing experimental setups across our collaborators is that they often use considerably different OS (different flavors of Linux, Windows, Android).

Since CK is more a unifying JSON API proxy between user's code and native OS programs, users still need to somehow have all necessary native tools installed.

It is already possible to easily share experimental setups with CK installed as a Virtual Machine image (for example via Docker or Virtual Box).

However, since we are doing computer systems' R&D where software and even hardware is changing every day and becomes outdated very quickly, we often want to actually rebuild and rerun experimental setups using latest native tools, compilers and libraries rather than using quickly outdated VM images.

Furthermore, in CK concept, we want all artifacts and experimental setups to be easily evolvable, extensible and reusable rather than being kept as snapshots.

Hence, we reserved 3 keys in CK JSON API to identify host OS, target OS and target device ID (for example, for multiple Android devices connected to host):

  • host_os
  • target_os
  • target_device_id

The first two keys are actually UOA of a module os (ck-env repository) used to describe changes between different operating systems. You can see available OS descriptions via:

 $ ck list os

To have an idea about which kind of parameters we describe in OS, you can check meta description of the 64 bit generic Linux via:

 $ ck load os:linux-64 --min

You can also check out shared Android OS description via

 $ ck load os:android19-arm --min

We also added several CK module platform* that help researchers get or set parameters of host and target platforms:

 $ ck list module:platform*

For example, it is possible to detect all host platform parameters simply via

 $ ck detect platform

If you have an Android device connected to your host machine via adb, you can obtain its parameters via

 $ ck detect platform --target_os=android-32

It is also possible to share this description with cknowledge.org/repo (to keep track of CK-compatible devices) via:

 $ ck detect platform --share

If platform is not supported, we hope that researchers will collaboratively improve, extend and share above modules to gradually cover all possible platforms and operating systems.

Now, researchers can reuse above modules from their own CK-based scripts and modules to detect default host and target OS and their parameters, or change them on demand:

 def my_func(i):

    # Check if user would like to change default host/target OS and device ID
    hos=i.get('host_os','')
    tos=i.get('target_os','')
    tdid=i.get('target_device_id','')

    # Detecting host OS params
    r=ck.access({'action':'detect',
                 'module_uoa':cfg['module_deps']['platform.os'],
                 'os':hos})
    if r['return']>0: return r
    hos=r['os_uid']      # Meta description (dict) of host OS
    hosd=r['os_dict']

    # Detecting target os params
    r=ck.access({'action':'detect',
                 'module_uoa':cfg['module_deps']['platform.os'],
                 'os':tos,
                 'device_id':tdid})
    if r['return']>0: return r
    tos=r['os_uid']
    tosd=r['os_dict']    # Meta description (dict) of target OS
    tdid=r['device_id']  # Device ID (if unique, otherwise will ask to select)

Note, that JSON meta description of your module should have keys:

{
  "module_deps":{
    "platform.os":"41e31cc4496b8a8e",
    ...
  }
}

Alternatively, you can simply substitute

cfg['module_deps']['platform.os']
with
platform.os or 41e31cc4496b8a8e .

Sharing workloads

If you would like to add more workloads (programs and data sets) to be able to participate in collaborative benchmarking and tuning of computer systems, you should read this section.

Proxing (wrapping) native tools

You can read how to create portable workflows here.

Manually sharing CK repositories

Now you should be ready to share your experimental setups with your colleagues, the community or Artifact Evaluation Committee as CK repositories.

If your repository is local and not already shared via GitHub, you can easily share it with your colleagues, submit for Artifact Evaluation or add it to Digital Libraries as a standard zip (or any other) archive.

You can archive your CK repository to zip archive using the following command:

 $ ck zip repo:my_shared_repo

This will create an archive ckr-<my_shared_repo>.zip in your current directory. You can then easily share it via your web pages, Digital Libraries, Google/Microsoft Drive, etc. Anyone can then easily import such archive to their CK installation directly from the web via:

 $ ck add repo:my_shared_repo --zip={URL of the CK zip archive} --quiet

Note, that if you want to share such archives via BitTorrent, we use a specific naming convention, i.e. ckr-<repo UID>-YYYYMMDD.zip . You can create such archive via:

 $ ck zip repo:my_shared_repo --bittorrent

Alternatively, it is possible import CK archive from a local directory (useful when downloading large archives first or sharing them via DVD disks, USB keys and other media):

 $ ck add repo:my_shared_usb_repo --zip=ck-local-archive.zip --quiet

Note, that we suggest you to add an extra page to your related article or report describing your research technique as the following workflow template to let readers quickly understand which artifacts you use:

You may also want to add a link to your zipped CK archive, torrent or GitHub/Bitbucket repository (unless you share your archive as a supplementary material via some official Digital Library).

You can find more details about managing repositories here.

Preparing Docker images

It is possible to create Docker images for your project using ck-docker repository. You can obtain it via

 $ ck pull repo:ck-docker

You can read documentation about how to use CK to automate Docker functions including build, run and push here.

CK docker entries describe how to build and run images. Hence, you can list already available ones via

 $ ck list docker

and then take the most close to your project, copy it, update meta and and start building your own image in just a few clicks. However, unlike traditional Docker images, yours will have unified and reusable CK components with UIDs and JSON API!

Evaluating other's artifacts (light "virtualization")

Note that you can install user repositories and dependencies to a new CK space without mixing up your current installations (similar to virtual environment) as follows (substitute "export" with "set" on Windows):

 $ export CK_REPOS={path to a new CK space with all repos and dependencies}
 $ ck setup kernel --var.install_to_env=yes
 $ ck pull repo --url={URL of the artifact repository}

When CK kernels variable "install_to_env" is set to "yes", CK will be installing required packages inside CK space. In such case, when you stop evaluating artifacts, you can easily remove the whole CK space together with all installed packages.

CK-based open research and publication model

Our long-term dream (1, 2, 3, 4, 5) is to enable truly collaborative, rigorous and reproducible research and experimentation in computer engineering via realistic workload sharing, crowd-tuning and learning similar to physics and open-source software movements.

We also hope that CK will fix our broken publication model in computer engineering. Currently, authors spend most of the effort trying to publish an idea (since it sometimes make take a few years to publish it in good conferences while ideas are easily stolen during this process) and later spending a little bit of time trying to fit some experiments to support such idea (very common pitfall in computer engineering leading to numerous, incremental and often irreproducible and even wrong papers).

Instead, CK allows authors to quickly prototype their ideas from shared components, crowdsource experiments, publish open access paper with a shared experimental setup, validate idea across many platforms and environments by volunteers (and possibly remotely collecting results from volunteers in an author's CK repository), publicly discuss it with the community (for example via Reddit or Slashdot), fix problems, and later publish validated and extended (possibly journal) paper (see our vision paper). We have successfully validated this model at ADAPT'16 workshop and will continue promoting it!

Questions and comments

You are welcome to get in touch with the CK community if you have questions or comments!

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.