# 指定GPU

## 方式1：指定若干个GPU合并为一个GPU使用
这种方式与多GPU加速不一样，不能加速模型训练，只是为了“扩张”环境的显存。<br>
``
import os
os.environ["CUDA_VISIBLE_DEVICES"]='1,2'
``<br>
或者在终端中运行<br>
`CUDA_VISIBLE_DEVICES=2 python script.py`

## 方式2：指定网络在不同的GPU中
该方式能够指定不同变量在不同的GPU中，同样地不能加速模型训练，只是为了“扩张”环境的显存。与方式一不同的是，该方式能够对每个变量指定GPU位置。<br>
`tf.device('/GPU:1')`

# 显存控制

## 方式1：GPU 显存自动调用
程序根据自身需要自动申请显存空间<br>
``
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
``

## 方式2：设置程序最大GPU 显存空间
程序只能占据部分GPU显存空间<br>
``
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
session = tf.Session(config=config)
``

# 多GPU控制

该脚本在服务器上运行，终端无法执行ipynb文件。因此，只能讲结果写入注释(markdown)中。

``
服务器配置
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 23%   36C    P8    16W / 250W |     25MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 26%   44C    P8    18W / 250W |     25MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 29%   48C    P8    20W / 250W |     25MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:84:00.0 Off |                  N/A |
| 23%   33C    P8    17W / 250W |     25MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
``
<br>
``
nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-0fb5007e-f81e-4154-0b90-4e9e3f0b76aa)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-b188c27e-1cd2-e2a5-3272-bf5739833af6)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-858b2a5a-db6c-de6e-8757-894aed84b2fd)
GPU 3: GeForce GTX 1080 Ti (UUID: GPU-c2fb8f72-4fc0-bdca-c124-59071ed50bb4)
``

程序运行时
![avatar](mgpu-nvidia-smi.png)

多GPU训练原理图
![avatar](mgpu.png)

图片来源：https://github.com/normanheckscher/mnist-multi-gpu/blob/master/images/Parallelism.png

In [1]:
# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from six.moves import xrange

In [2]:
import time

from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.python.client import device_lib
import tensorflow as tf

FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_integer('batch_size', 100,
                            """Number of images to process in a batch.""")
tf.app.flags.DEFINE_string('data_dir', './MNIST_data',
                           """Path to the MNIST data directory.""")
tf.app.flags.DEFINE_integer('num_epochs', 10,
                            """Number of epochs to run trainer.""")


def _variable_on_cpu(name, shape):
    with tf.device('/cpu:0'):
        dtype = tf.float32
        var = tf.get_variable(name, shape, dtype=dtype)
    return var


def inference1(images, reuse=False):
    with tf.variable_scope('dense1', reuse=reuse):
        dense1 = tf.layers.dense(images, units=512, activation=tf.nn.relu,
                                 reuse=reuse)
    with tf.variable_scope('dense2', reuse=reuse):
        dense2 = tf.layers.dense(dense1, units=10, reuse=reuse)
    return dense2


def inference2(images):
    with tf.variable_scope('dense1') as scope:
        weight_w_1 = _variable_on_cpu(name='weight', shape=(784, 512))
        weight_b_1 = _variable_on_cpu(name='biases', shape=(512,))
        dense1 = tf.nn.relu(tf.matmul(images, weight_w_1) + weight_b_1, name=scope.name)
    with tf.variable_scope('dense2') as scope:
        weight_w_2 = _variable_on_cpu(name='weight', shape=(512, 10))
        weight_b_2 = _variable_on_cpu(name='biases', shape=(10,))
        dense2 = tf.nn.softmax(tf.matmul(dense1, weight_w_2) + weight_b_2, name=scope.name)
    return dense2


def loss(logits, labels):
    """Add L2Loss to all the trainable variables.

    Add summary for "Loss" and "Loss/avg".
    Args:
      logits: Logits from inference().
      labels: Labels from distorted_inputs or inputs(). 1-D tensor
              of shape [batch_size]

    Returns:
      Loss tensor of type float.
    """
    # Calculate the average cross entropy loss across the batch.
    labels = tf.cast(labels, tf.int64)
    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=labels, logits=logits, name='cross_entropy_per_example')
    cross_entropy_mean = tf.reduce_mean(cross_entropy, name='cross_entropy')
    tf.add_to_collection('losses', cross_entropy_mean)

    # The total loss is defined as the cross entropy loss plus all of the weight
    # decay terms (L2 loss).
    return tf.add_n(tf.get_collection('losses'), name='total_loss')


def average_gradients(tower_grads):
    average_grads = []
    for grad_and_vars in zip(*tower_grads):
        # Note that each grad_and_vars looks like the following:
        #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
        grads = []
        for g, _ in grad_and_vars:
            # Add 0 dimension to the gradients to represent the tower.
            expanded_g = tf.expand_dims(g, 0)

            # Append on a 'tower' dimension which we will average over below.
            grads.append(expanded_g)

        # Average over the 'tower' dimension.
        grad = tf.concat(grads, 0)
        grad = tf.reduce_mean(grad, 0)

        # Keep in mind that the Variables are redundant because they are shared
        # across towers. So .. we will just return the first tower's pointer to
        # the Variable.
        v = grad_and_vars[0][1]
        grad_and_var = (grad, v)
        average_grads.append(grad_and_var)
    return average_grads


def check_available_gpus():
    local_devices = device_lib.list_local_devices()
    gpu_names = [x.name for x in local_devices if x.device_type == 'GPU']
    gpu_list = len(gpu_names)
    print('{0} GPUs are detected : {1}'.format(gpu_list, gpu_names))
    gpu_list = ['/GPU:%s' % (i) for i in xrange(gpu_list)]
    # gpu_list = ['/GPU:0', '/CPU:0']
    return gpu_list


def show_variables_info():
    for var in tf.all_variables():
        print((var.name, var.device))



  from ._conv import register_converters as _register_converters


In [3]:
def train_on_gpus():
    '''
    build a new graph on the first GPU device, and other graphs reuse this graph
     and they are built on other devices. These graphs' outputs(loss) will be processed
     on the first device. (Faster than store graph on memory)
    RECOMMEND!
    '''
    with tf.Graph().as_default():
        with tf.name_scope('placehold'):
            x = tf.placeholder(dtype=tf.float32, shape=[None, 784], name='placehold_x')
            y = tf.placeholder(dtype=tf.int64, shape=[None, ], name='placehold_y')
        global_step = tf.get_variable(
            'global_step', [],
            initializer=tf.constant_initializer(0), trainable=False)

        gpu_list = check_available_gpus()
        x_blocks = tf.split(x, num_or_size_splits=len(gpu_list))
        y_blocks = tf.split(y, num_or_size_splits=len(gpu_list))

        losses = []
        logits = []
        for ind, gpu_id in enumerate(gpu_list):
            print('creating tensorflow variables on', gpu_id)
            with tf.device(tf.DeviceSpec(device_type='GPU', device_index=ind)):
                with tf.variable_scope(tf.get_variable_scope(), reuse=(ind > 0)):
                    batch_logits = inference1(x_blocks[ind], reuse=(ind > 0))
                    logits.append(batch_logits)
                    batch_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
                        logits=batch_logits,
                        labels=y_blocks[ind])
                    losses.append(batch_loss)
        logits_op = tf.concat(logits, axis=0)
        loss_op = tf.reduce_mean(tf.concat(losses, axis=0))
        optimizer_op = tf.train.AdamOptimizer(1e-3) \
            .minimize(loss_op, global_step=global_step, colocate_gradients_with_ops=True)
        accuracy_op = tf.reduce_mean(
            tf.cast(tf.equal(tf.argmax(logits_op, axis=-1), y), dtype=tf.float32))

        # Start running operations on the Graph.
        config = tf.ConfigProto()
        # 根据需要自动申请
        config.gpu_options.allow_growth = True
        with tf.Session(config=config) as sess:
            sess.run(tf.global_variables_initializer())
            show_variables_info()

            mnist = input_data.read_data_sets(FLAGS.data_dir)
            batch_size = FLAGS.batch_size * len(gpu_list)

            print('gpus are:', gpu_list)
            for step in xrange(FLAGS.num_epochs):
                start = time.time()
                for i in xrange(mnist.train.num_examples // batch_size):
                    feed_dict = {x: mnist.train.images[i * batch_size:(i + 1) * batch_size],
                                 y: mnist.train.labels[i * batch_size:(i + 1) * batch_size]}
                    _ = sess.run([optimizer_op], feed_dict=feed_dict)
                print('Episode %s / %s, training time cost: %-.5f' %
                      (step + 1, FLAGS.num_epochs, time.time() - start))

                loss_value, acc_value = 0.0, 0.0
                for i in xrange(mnist.test.num_examples // batch_size):
                    feed_dict = {x: mnist.test.images[i * batch_size:(i + 1) * batch_size],
                                 y: mnist.test.labels[i * batch_size:(i + 1) * batch_size]}
                    _, batch_loss, batch_acc = sess.run([optimizer_op, loss_op, accuracy_op], feed_dict=feed_dict)
                    loss_value += batch_loss
                    acc_value += batch_acc
                loss_value /= mnist.test.num_examples // batch_size
                acc_value /= mnist.test.num_examples // batch_size
                print('Evaluate, loss: %-.5f, accuracy: %-.5f' % (loss_value, acc_value))
                

``
2018-07-17 21:31:58.625245: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-17 21:31:58.922081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-07-17 21:31:59.117733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-07-17 21:31:59.315410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-07-17 21:31:59.538070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:84:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-07-17 21:31:59.542568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2018-07-17 21:32:00.758096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-17 21:32:00.758161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3
2018-07-17 21:32:00.758175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y N N
2018-07-17 21:32:00.758181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N N N
2018-07-17 21:32:00.758187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   N N N Y
2018-07-17 21:32:00.758193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   N N Y N
2018-07-17 21:32:00.759096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 10390 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-07-17 21:32:00.858688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:1 with 10390 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-07-17 21:32:00.957837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:2 with 10390 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2018-07-17 21:32:01.057371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:3 with 10390 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
2018-07-17 21:32:01.388051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2018-07-17 21:32:01.388251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-17 21:32:01.388267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3
2018-07-17 21:32:01.388276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y N N
2018-07-17 21:32:01.388283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N N N
2018-07-17 21:32:01.388289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   N N N Y
2018-07-17 21:32:01.388295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   N N Y N
2018-07-17 21:32:01.388684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10390 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-07-17 21:32:01.388806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10390 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-07-17 21:32:01.388905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10390 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2018-07-17 21:32:01.389443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10390 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
WARNING:tensorflow:From multi_gpu_test.py:168: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
WARNING:tensorflow:From multi_gpu_test.py:217: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /home/duanwei/dw/venv/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /home/duanwei/dw/venv/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /home/duanwei/dw/venv/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /home/duanwei/dw/venv/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
4 GPUs are detected : ['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3']
creating tensorflow variables on /GPU:0
creating tensorflow variables on /GPU:1
creating tensorflow variables on /GPU:2
creating tensorflow variables on /GPU:3
('global_step:0', '')
('dense1/dense/kernel:0', '/device:GPU:0')
('dense1/dense/bias:0', '/device:GPU:0')
('dense2/dense/kernel:0', '/device:GPU:0')
('dense2/dense/bias:0', '/device:GPU:0')
('beta1_power:0', '/device:GPU:0')
('beta2_power:0', '/device:GPU:0')
('dense1/dense/kernel/Adam:0', '/device:GPU:0')
('dense1/dense/kernel/Adam_1:0', '/device:GPU:0')
('dense1/dense/bias/Adam:0', '/device:GPU:0')
('dense1/dense/bias/Adam_1:0', '/device:GPU:0')
('dense2/dense/kernel/Adam:0', '/device:GPU:0')
('dense2/dense/kernel/Adam_1:0', '/device:GPU:0')
('dense2/dense/bias/Adam:0', '/device:GPU:0')
('dense2/dense/bias/Adam_1:0', '/device:GPU:0')
Extracting ./MNIST_data/train-images-idx3-ubyte.gz
Extracting ./MNIST_data/train-labels-idx1-ubyte.gz
Extracting ./MNIST_data/t10k-images-idx3-ubyte.gz
Extracting ./MNIST_data/t10k-labels-idx1-ubyte.gz
gpus are: ['/GPU:0', '/GPU:1', '/GPU:2', '/GPU:3']
Episode 1 / 10, training time cost: 1.25556
Evaluate, loss: 0.20625, accuracy: 0.93990
Episode 2 / 10, training time cost: 0.49396
Evaluate, loss: 0.12953, accuracy: 0.96330
Episode 3 / 10, training time cost: 0.49255
Evaluate, loss: 0.09254, accuracy: 0.97290
Episode 4 / 10, training time cost: 0.50786
Evaluate, loss: 0.07031, accuracy: 0.97900
Episode 5 / 10, training time cost: 0.49639
Evaluate, loss: 0.05516, accuracy: 0.98440
Episode 6 / 10, training time cost: 0.49450
Evaluate, loss: 0.04426, accuracy: 0.98860
Episode 7 / 10, training time cost: 0.48309
Evaluate, loss: 0.03612, accuracy: 0.99130
Episode 8 / 10, training time cost: 0.49725
Evaluate, loss: 0.02988, accuracy: 0.99290
Episode 9 / 10, training time cost: 0.49990
Evaluate, loss: 0.02475, accuracy: 0.99420
Episode 10 / 10, training time cost: 0.49276
Evaluate, loss: 0.02050, accuracy: 0.99500
``

In [4]:
def train_on_cpus_and_gpus():
    '''
    build graph on memory firstly, compute on GPU device
    '''
    with tf.Graph().as_default():
        with tf.name_scope('placehold'):
            x = tf.placeholder(dtype=tf.float32, shape=[None, 784], name='placehold_x')
            y = tf.placeholder(dtype=tf.int64, shape=[None, ], name='placehold_y')
        global_step = tf.get_variable(
            'global_step', [],
            initializer=tf.constant_initializer(0), trainable=False)

        gpu_list = check_available_gpus()
        x_blocks = tf.split(x, num_or_size_splits=len(gpu_list))
        y_blocks = tf.split(y, num_or_size_splits=len(gpu_list))

        optimizer_op = tf.train.AdamOptimizer(1e-3)
        gradients_list = []
        logits = []
        with tf.variable_scope(tf.get_variable_scope()):
            for ind, gpu_id in enumerate(gpu_list):
                print('creating tensorflow variables on', gpu_id)
                with tf.device(gpu_id):
                    batch_logits = inference2(x_blocks[ind])
                    logits.append(batch_logits)
                    _ = loss(batch_logits, y_blocks[ind])
                    losses = tf.get_collection('losses')
                    loss_op = tf.add_n(losses, name='total_loss')

                    # Reuse variables for the next tower.
                    tf.get_variable_scope().reuse_variables()

                    gradients = optimizer_op.compute_gradients(loss_op)
                    gradients_list.append(gradients)
        gradients = average_gradients(gradients_list)
        train_op = optimizer_op.apply_gradients(gradients, global_step=global_step)
        accuracy_op = tf.reduce_mean(tf.cast(tf.equal(
            tf.argmax(tf.concat(logits, axis=0), axis=-1), y),
            dtype=tf.float32))

        # Start running operations on the Graph.
        config = tf.ConfigProto()
        # 根据需要自动申请
        config.gpu_options.allow_growth = True
        with tf.Session(config=config) as sess:
            sess.run(tf.global_variables_initializer())
            show_variables_info()

            mnist = input_data.read_data_sets(FLAGS.data_dir)
            batch_size = FLAGS.batch_size * len(gpu_list)

            print('gpus are:', gpu_list)
            for step in xrange(FLAGS.num_epochs):
                start = time.time()
                for i in xrange(mnist.train.num_examples // batch_size):
                    feed_dict = {x: mnist.train.images[i * batch_size:(i + 1) * batch_size],
                                 y: mnist.train.labels[i * batch_size:(i + 1) * batch_size]}
                    _ = sess.run([train_op], feed_dict=feed_dict)
                print('Episode %s / %s, training time cost: %-.5f' %
                      (step + 1, FLAGS.num_epochs, time.time() - start))

                loss_value, acc_value = 0.0, 0.0
                for i in xrange(mnist.test.num_examples // batch_size):
                    feed_dict = {x: mnist.test.images[i * batch_size:(i + 1) * batch_size],
                                 y: mnist.test.labels[i * batch_size:(i + 1) * batch_size]}
                    _, batch_loss, batch_acc = sess.run([train_op, loss_op, accuracy_op], feed_dict=feed_dict)
                    loss_value += batch_loss
                    acc_value += batch_acc
                loss_value /= mnist.test.num_examples // batch_size
                acc_value /= mnist.test.num_examples // batch_size
                print('Evaluate, loss: %-.5f, accuracy: %-.5f' % (loss_value, acc_value))



``
2018-07-17 21:25:38.061286: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-17 21:25:38.368119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-07-17 21:25:38.566151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-07-17 21:25:38.793769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-07-17 21:25:39.025796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:84:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-07-17 21:25:39.030258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2018-07-17 21:25:40.249204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-17 21:25:40.249267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3
2018-07-17 21:25:40.249282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y N N
2018-07-17 21:25:40.249290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N N N
2018-07-17 21:25:40.249296: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   N N N Y
2018-07-17 21:25:40.249303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   N N Y N
2018-07-17 21:25:40.250104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 10390 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-07-17 21:25:40.350039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:1 with 10390 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-07-17 21:25:40.449366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:2 with 10390 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2018-07-17 21:25:40.550163: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:3 with 10390 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
2018-07-17 21:25:41.313835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2018-07-17 21:25:41.314027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-17 21:25:41.314042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3
2018-07-17 21:25:41.314051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y N N
2018-07-17 21:25:41.314058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N N N
2018-07-17 21:25:41.314065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   N N N Y
2018-07-17 21:25:41.314071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   N N Y N
2018-07-17 21:25:41.314438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10390 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-07-17 21:25:41.314570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10390 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-07-17 21:25:41.314677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10390 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2018-07-17 21:25:41.315636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10390 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
WARNING:tensorflow:From multi_gpu_test.py:168: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
WARNING:tensorflow:From multi_gpu_test.py:290: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /home/duanwei/dw/venv/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /home/duanwei/dw/venv/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /home/duanwei/dw/venv/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /home/duanwei/dw/venv/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
4 GPUs are detected : ['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3']
creating tensorflow variables on /GPU:0
creating tensorflow variables on /GPU:1
creating tensorflow variables on /GPU:2
creating tensorflow variables on /GPU:3
('global_step:0', '')
('dense1/weight:0', '/device:CPU:0')
('dense1/biases:0', '/device:CPU:0')
('dense2/weight:0', '/device:CPU:0')
('dense2/biases:0', '/device:CPU:0')
('beta1_power:0', '/device:CPU:0')
('beta2_power:0', '/device:CPU:0')
('dense1/weight/Adam:0', '/device:CPU:0')
('dense1/weight/Adam_1:0', '/device:CPU:0')
('dense1/biases/Adam:0', '/device:CPU:0')
('dense1/biases/Adam_1:0', '/device:CPU:0')
('dense2/weight/Adam:0', '/device:CPU:0')
('dense2/weight/Adam_1:0', '/device:CPU:0')
('dense2/biases/Adam:0', '/device:CPU:0')
('dense2/biases/Adam_1:0', '/device:CPU:0')
Extracting ./MNIST_data/train-images-idx3-ubyte.gz
Extracting ./MNIST_data/train-labels-idx1-ubyte.gz
Extracting ./MNIST_data/t10k-images-idx3-ubyte.gz
Extracting ./MNIST_data/t10k-labels-idx1-ubyte.gz
gpus are: ['/GPU:0', '/GPU:1', '/GPU:2', '/GPU:3']
Episode 1 / 10, training time cost: 1.56065
Evaluate, loss: 6.19163, accuracy: 0.92630
Episode 2 / 10, training time cost: 0.66851
Evaluate, loss: 6.09537, accuracy: 0.94500
Episode 3 / 10, training time cost: 0.64818
Evaluate, loss: 6.04356, accuracy: 0.95770
Episode 4 / 10, training time cost: 0.64884
Evaluate, loss: 6.01031, accuracy: 0.96330
Episode 5 / 10, training time cost: 0.65422
Evaluate, loss: 5.98554, accuracy: 0.96980
Episode 6 / 10, training time cost: 0.65530
Evaluate, loss: 5.96381, accuracy: 0.97430
Episode 7 / 10, training time cost: 0.65614
Evaluate, loss: 5.94788, accuracy: 0.97850
Episode 8 / 10, training time cost: 0.65916
Evaluate, loss: 5.93362, accuracy: 0.98070
Episode 9 / 10, training time cost: 0.68927
Evaluate, loss: 5.92296, accuracy: 0.98340
Episode 10 / 10, training time cost: 0.67347
Evaluate, loss: 5.91540, accuracy: 0.98610
``

In [5]:
# main function
def main(argv=None):
    print('choose a multi-gpu function to run')
    # train_on_gpus()
    # train_on_cpus_and_gpus()


if __name__ == '__main__':
    tf.app.run()

choose a multi-gpu function to run


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
