## Using GPU's on IBM POWER S822LC 
![](https://github.com/dustinvanstee/random-public-files/raw/master/s822lc_nvidia.png)

In this lab, you will familiarize yourself with how to compare GPU vs CPU performance using some simple matrix math.  We will cover how tensorflow gives access to GPU and CPU, and how you can specify to the system when they can be used.

You don't need to know tensorflow to complete this lab, but understanding some of tensorflow basics and matrix math will help.

### Lets get started, first lets import the tensorflow library that has been provided by the IBM PowerAI library.

In [23]:
import tensorflow as tf
import math
# Time helper
import timeit
def time_run(command_str) :
    #rv=%timeit -o -n 1 command
    rv= timeit.Timer(stmt=command_str).timeit(number=1)
    return rv

# Print Python Version
import os
import sys
print(sys.version)


2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609]


### Verify GPU Usage with Small Matrix
Now lets build a few small matrixes and multiply them together.  Here A is 2x3 matrix, an B is 3x2 matrix.  Then we multiply them together.  By default tensorflow will use the GPU on the system if available.  Lets run this code and verify that we are using the GPU

Note : To find out which devices your operations and tensors are assigned to, create the session with log_device_placement configuration option set to True as shown below

In [9]:
%%time
# Creates a graph.
tf.reset_default_graph() 
with tf.device('/gpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(c))
    sess.close()

[[ 22.  28.]
 [ 49.  64.]]
CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 31 ms


When we started the jupyter notebook, we also redirected all the messages to a file so that we can see if indeed we are using the GPU.  Lets read the last couple of lines in the log file (/tmp/tensorflow.log)

In [11]:
%%bash
tail -n 3 /tmp/tensorflow.log | grep gpu
#echo " ############" >>  /tmp/tensorflow.log 

2017-10-08 13:23:22.223420: I tensorflow/core/common_runtime/simple_placer.cc:841] MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
2017-10-08 13:23:22.223441: I tensorflow/core/common_runtime/simple_placer.cc:841] b: (Const)/job:localhost/replica:0/task:0/gpu:0
2017-10-08 13:23:22.223466: I tensorflow/core/common_runtime/simple_placer.cc:841] a: (Const)/job:localhost/replica:0/task:0/gpu:0


### Verify that we can use the CPU as well with same example

Here we will specify explicitly that we want to use the CPU.  By convention, tensorflow allows you to specify that by using the tf.device() method.

Devices are represented by the following strings

* /cpu:0 : The first cpu of your machine
* /gpu:0 : The first gpu of your machine
* /gpu:1 : The second gpu of your machine
* ...
* ...

In this cell we specify **/cpu:0** with the exact same code

In [12]:
%%time
tf.reset_default_graph() 
# Creates a graph.
with tf.device('/cpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)

# Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(c))
    sess.close()

[[ 22.  28.]
 [ 49.  64.]]
CPU times: user 20 ms, sys: 12 ms, total: 32 ms
Wall time: 29.7 ms


Now lets make sure that we used the cpu ...

In [13]:
%%bash
tail -n 3 /tmp/tensorflow.log | grep cpu

2017-10-08 13:23:42.170233: I tensorflow/core/common_runtime/simple_placer.cc:841] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0
2017-10-08 13:23:42.170257: I tensorflow/core/common_runtime/simple_placer.cc:841] b: (Const)/job:localhost/replica:0/task:0/cpu:0
2017-10-08 13:23:42.170283: I tensorflow/core/common_runtime/simple_placer.cc:841] a: (Const)/job:localhost/replica:0/task:0/cpu:0


Notice here that now we are mapping all the operations to the cpu. 


Also notice we dont really get a large speedup.  Thats becuase the example is to small to make use of the parallelism advantage inherent in the GPU.  Next, we will rerun but this time will very large matrices

### GPU Speedup test with 20 Billion cells

In [2]:
%%time
tf.reset_default_graph() 

mat1 = tf.random_normal([20000,10000],mean=0.0, stddev=1.0, dtype=tf.float32)
mat2 = tf.random_normal([10000,10000],mean=0.0, stddev=1.0, dtype=tf.float32)
mat3 = tf.matmul(mat1, mat2)

# Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(mat3))
    sess.close()

[[ -5.07367477e+01   1.39379993e-01  -1.53627777e+02 ...,   5.84249687e+01
    2.82826061e+01   1.28241638e+02]
 [  4.70148773e+01   2.84630737e+01   9.40551758e+01 ...,  -8.43005447e+01
   -4.17779198e+01  -7.09997330e+01]
 [  7.76378632e+01   8.77513790e+00   1.40971050e+01 ...,   2.48998404e+00
    1.97348862e+02   2.57942322e+02]
 ..., 
 [  9.67471390e+01  -1.07339268e+01   1.13605019e+02 ...,  -4.05519371e+01
    3.17055073e+01  -8.93909531e+01]
 [ -4.58382835e+01   4.80116558e+00  -1.30733433e+01 ...,   3.78868942e+01
   -3.17662792e+01  -5.00178635e-01]
 [  7.40732651e+01   1.87222099e+01   5.84146767e+01 ...,   1.65686737e+02
   -6.16308594e+01  -4.19055481e+01]]
CPU times: user 836 ms, sys: 380 ms, total: 1.22 s
Wall time: 1.16 s


In [None]:
Notice

In [4]:
%%time
tf.reset_default_graph() 

with tf.device('/cpu:0'):
    mat1 = tf.random_normal([20000,10000],mean=0.0, stddev=1.0, dtype=tf.float32)
    mat2 = tf.random_normal([10000,10000],mean=0.0, stddev=1.0, dtype=tf.float32)
    mat3 = tf.matmul(mat1, mat2)

    # Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(mat3))
    sess.close()

[[-139.45901489  -88.81645203  -43.53620148 ...,   52.38720703
    75.01547241   50.5382843 ]
 [  45.14087296   -1.33799887 -191.29249573 ...,   61.72611237
     2.22061539  -27.41448212]
 [-121.50714111 -121.08638763  -77.48414612 ...,   40.93670273
    76.60482025   36.93449402]
 ..., 
 [  64.00627899  105.14516449  122.32860565 ..., -151.68490601
   -96.64914703 -186.93591309]
 [ -12.83614635  -57.50984955  170.74624634 ...,  -59.76148987
   -66.51875305   41.46344376]
 [-172.24754333    4.39118052  -36.03226852 ..., -148.16416931
    40.98770905  -10.31962013]]
CPU times: user 17min 34s, sys: 1.04 s, total: 17min 35s
Wall time: 33.4 s


In [None]:
If you were to check vmstat on your termnical, you would notice that all 32 CPU cores are working to multiply this matrix. 

In [4]:
%timeit -o -n 1 a=1

1 loop, best of 3: 954 ns per loop


In [6]:
print "Runtime = " + str(a) + " sec"

### GPU SPEEDUP

In [24]:
def buildRuns(mata_size,matb_size) :
    mata_size = str(mata_size)
    matb_size = str(matb_size)

    GPURUN = """
import tensorflow as tf
tf.reset_default_graph() 

mat1 = tf.random_normal([""" + mata_size + ',' +matb_size + """],mean=0.0, stddev=1.0, dtype=tf.float32)
mat2 = tf.random_normal([""" + mata_size + ',' +matb_size + """],mean=0.0, stddev=1.0, dtype=tf.float32)
mat3 = tf.matmul(mat1, mat2)

# Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    sess.run(mat3)
    sess.close()
    """
    
    CPURUN = """
import tensorflow as tf
tf.reset_default_graph() 

with tf.device('/cpu:0'):
    mat1 = tf.random_normal([""" + mata_size + ',' +matb_size + """],mean=0.0, stddev=1.0, dtype=tf.float32)
    mat2 = tf.random_normal([""" + mata_size + ',' +matb_size + """],mean=0.0, stddev=1.0, dtype=tf.float32)
    mat3 = tf.matmul(mat1, mat2)

    # Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    sess.run(mat3)
    sess.close()
    """
    return (GPURUN,CPURUN)

In [25]:
%%time
(GPURUN,CPURUN) = buildRuns(10,10)
improt
#print CPURUN
runs = [10,100]
gpu_time=time_run(GPURUN)
cpu_time=time_run(CPURUN)
speedup = cpu_time / gpu_time
print "Speedup for a matrix multiply of 20B Cells = " + str(speedup)

NameError: name 'improt' is not defined

For more reading about GPUs ....

useful stackoverflows
https://stackoverflow.com/questions/46178961/duplicate-tensorflow-placeholder-variables
https://stackoverflow.com/questions/37660312/run-tensorflow-on-cpu