Skip to content
Reproducibility report and the Collective Knowledge workflow for the SysML'19 paper "Priority-based Parameter Propagation for Distributed DNN Training"
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.cm
package
program
soft
.ckr.json
.gitignore
COPYRIGHT.txt
LICENSE.txt
README.md

README.md

automation workflow

This repository contains the reproducibility report for the SysML'19 paper "Priority-based Parameter Propagation for Distributed DNN Training". Feel free to continue evaluating all experimental results from this paper and report your feedback here.

Artifact check-list (meta-information)

  • Program: python (3.5+), "Priority-based Parameter Propagation (P3)", MXNet, Models {ResNet-50, InceptionV3, VGG-19 and Sockeye}
  • Compilation: GCC 5.4 or above, CUDA 8 or above, cuDNN6 or above
  • Data set: ImageNet1K data set; IWSLT15 data set is included
  • Run-time environment: Ubuntu 16.04+
  • Hardware: Require more than one machine (four recommended) each equipped with Nvidia GPUs and high bandwidth interconnect (at least 10 Gbps) (tested on AWS and G5k)
  • Metrics: The primary metric of comparison is the average training throughput
  • How much disk space required (approximately)?: About200 GB per machine should be enough for running experiments. Data set preparation might require about 500 GB
  • How much time is needed to prepare workflow (approximately)?: About one hour to prepare the dataset and compile the source code
  • How much time is needed to complete experiments (approximately)?: About 15 minutes per benchmark
  • Code license: Apache License 2.0

Installation

We implemented a simple CK workflow (pipeline) with shared CK packages for this project, models and datasets to automate and facilitate validation of results.

CK framework

Install CK as described here.

CK workflow (pipeline) for this paper

$ ck pull repo:reproduce-sysml19-paper-p3

Note that CK will pull all other related repositories. If you already have installed CK repositories, you may update them at any time all as follows:

$ ck pull all

Installing packages

Install P3 tool from this paper via CK either from GitHub or Zenodo:

$ ck install package:sysml19-p3-github
or
$ ck install package:sysml19-p3-zenodo

CK will automatically attempt to detect GCC, CUDA, cuDNN, and install OpenCV and OpenBLAS to a user space.

Install small ImageNet1K train data set just to test workflow (with batch size 1):

$ ck install package:imagenet-2012-train-min

Install a package which will convert this dataset to P3 format:

$ ck install package:dataset-imagenet-2012-train-p3

Later you can install a complete ImageNet1K train data set (may take 1 day to download and may require 500GB of space)

$ ck install package:imagenet-2012-train
$ ck install package:dataset-imagenet-2012-train-p3

Note that if you already have ImageNet1K downloaded and extracted somewhere, you can ask CK to detect it rather then downloading it again:

$ ck detect soft:dataset.imagenet.train --search_dirs={path to downloaded and extracted ImageNet1K}

Evaluation

We created CK program workflow (pipeline) with meta information which describes dependencies (code, models and data sets), automates their installation during the first execution (P3, data sets, etc) and assembles different command lines.

Pre-processing CK script prepares list of hosts to run experiments: preprocess.py. Post-processing CK script parses output and unifies different metrics: postprocess.py.

Cluster preparation

You need to register a list of hosts to run experiments. You can do it as follows:

Just create a "hosts.json" with a list of IPs (make sure that you can ssh there without a password):

["chifflet-2", "chifflet-4"]

Now you must register this configuration in the CK with some name such as "grid5000" as follows:

$ ck add machine:grid5000 --type=cluster --config_file=hosts.json

When asked about remote node OS, just select linux-64. You can view all registered configurations of target platforms as follows:

$ ck show machine

ImageNet experiments

You can now run ImageNet experiments as follows:

$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=resnet
$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=inception-v3
$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=vgg

You can change default batch size (32) as follows:

$ ck run program:sysml19-p3 --target=grid5000 \
                            --cmd_key=resnet \
                            --env.BATCH_SIZE=32

IWSLT15 experiments

You can also run IWSLT15 experiments as follows:

$ ck run program:sysml19-p3 --target=grid5000 --cmd_key=sockeye --env.OUTPUT_FILE=/tmp/sockeye_1.5-iwslt15_en-vi.sh

Validated results on GRID5000: link.

Suggestions

We expect the community to continue validating results from this and other SysML'19 papers (see our notes and example).

Reproducibility badges

We awarded the following badges based on above evaluation:

ACM badges

  • ACM artifacts available
  • ACM artifacts evaluated - reusable

cTuning foundation badges

automation workflow

You can’t perform that action at this time.