Distributed TensorFlow benchmarks

This repository provides code and results for benchmarking distributed training on Piz Daint (a Slurm-based supercomputer) and on Amazon EC2 instances.

While we provide scripts to easily submit TensorFlow applications across multiple nodes on any of these systems, our main contribution is the comparison of different distributed settings to achieve the best performance given the number of nodes and the system under study. We use Google's benchmarking scripts for TensorFlow to obtain the number of trained images per second. We test clusters with both one and multiple GPUs per node, and with different inter-node networks.

Slides, report and an IPython notebook show the results for InceptionV3 in TensorFlow 1.1.0 (to compare our measurements with the ones available in the TensorFlow benchmarks page. Each test is run 5 times and the times are averaged together. For each test, we pick the configuration that gives the best performance. Following Google's approach, for each test, 10 warmup steps are done and then the next 100 steps are averaged. All the measurements are available in this spreadsheet.

Description of this repository

environments_setup/: folder containing scripts to quickly setup a local workstation, Piz Daint and AWS EC2 instances to run TensorFlow.
run_jupyter_notebooks/: folder containing scripts to easily start a Jupyter notebook on a local workstation, Piz Daint or an AWS EC2 instance.
distributed_tensorflow_launchers/: folder containing scripts to promptly launch a TensorFlow application across multiple processes on a local workstation, Piz Daint or AWS EC2 instances. In particular, a script takes care of starting Parameter Servers and Workers on the nodes allocated by Slurm.
MNIST/: folder containing TensorFlow's deep MNIST tutorial and two variations: a GPU-enhanced version that allows to specify data formats (NCHW or NHWC) and a distributed version of this one.
google-benchmarks/: folder containing scripts to run Google's benchmarking code on different systems.
report/: folder containing the report for the summer internship at CSCS.
presentation/: folder containing the slides for the end-of-internship seminar given at CSCS.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
MNIST		MNIST
distributed_tensorflow_launchers		distributed_tensorflow_launchers
environments_setup		environments_setup
google-benchmarks		google-benchmarks
official/resnet		official/resnet
presentation		presentation
report		report
resnet		resnet
resnetv2		resnetv2
run_jupyter_notebooks		run_jupyter_notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aws_private_ips.txt		aws_private_ips.txt
aws_public_ips.txt		aws_public_ips.txt

License

feifeibear/large-scale-tensorflow-benchmark

Folders and files

Latest commit

History

Repository files navigation

Distributed TensorFlow benchmarks

Description of this repository

About

Resources

License

Stars

Watchers

Forks

Languages