## Using GPU's on IBM POWER S822LC 
![](https://github.com/dustinvanstee/random-public-files/raw/master/s822lc_nvidia.png)

In this lab, you will familiarize yourself with how to compare GPU vs CPU performance using some simple matrix math.  We will cover how tensorflow gives access to GPU and CPU, and how you can specify to the system when they can be used.

You don't need to know tensorflow to complete this lab, but understanding some of tensorflow basics and matrix math will help.

### Lets get started, first lets import the tensorflow library that has been provided by the IBM PowerAI library.

In [1]:
import tensorflow as tf
import math
# Time helper
import timeit
def time_run(command_str) :
    #rv=%timeit -o -n 1 command
    rv= timeit.Timer(stmt=command_str).timeit(number=1)
    return rv

# Print Python Version
import os
import sys
print(sys.version)


2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609]


### Verify GPU Usage with Small Matrix
Now lets build a few small matrixes and multiply them together.  Here A is 2x3 matrix, an B is 3x2 matrix.  Then we multiply them together.  By default tensorflow will use the GPU on the system if available.  Lets run this code and verify that we are using the GPU

Note : To find out which devices your operations and tensors are assigned to, create the session with log_device_placement configuration option set to True as shown below

In [2]:
%%time
# Creates a graph.
tf.reset_default_graph() 
with tf.device('/gpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(c))
    sess.close()

[[ 22.  28.]
 [ 49.  64.]]
CPU times: user 136 ms, sys: 112 ms, total: 248 ms
Wall time: 1.19 s


When we started the jupyter notebook, we also redirected all the messages to a file so that we can see if indeed we are using the GPU.  Lets read the last couple of lines in the log file (/tmp/tensorflow.log)

In [3]:
%%bash
tail -n 3 /tmp/tensorflow.log | grep gpu
#echo " ############" >>  /tmp/tensorflow.log 

### Verify that we can use the CPU as well with same example

Here we will specify explicitly that we want to use the CPU.  By convention, tensorflow allows you to specify that by using the tf.device() method.

Devices are represented by the following strings

* /cpu:0 : The first cpu of your machine
* /gpu:0 : The first gpu of your machine
* /gpu:1 : The second gpu of your machine
* ...
* ...

In this cell we specify **/cpu:0** with the exact same code

In [4]:
%%time
tf.reset_default_graph() 
# Creates a graph.
with tf.device('/cpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)

# Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(c))
    sess.close()

[[ 22.  28.]
 [ 49.  64.]]
CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 36 ms


Now lets make sure that we used the cpu ...

In [13]:
%%bash
tail -n 3 /tmp/tensorflow.log | grep cpu

2017-10-08 13:23:42.170233: I tensorflow/core/common_runtime/simple_placer.cc:841] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0
2017-10-08 13:23:42.170257: I tensorflow/core/common_runtime/simple_placer.cc:841] b: (Const)/job:localhost/replica:0/task:0/cpu:0
2017-10-08 13:23:42.170283: I tensorflow/core/common_runtime/simple_placer.cc:841] a: (Const)/job:localhost/replica:0/task:0/cpu:0


Notice here that now we are mapping all the operations to the cpu. 


Also notice we dont really get a large speedup.  Thats becuase the example is to small to make use of the parallelism advantage inherent in the GPU.  Next, we will rerun but this time will very large matrices

### GPU Speedup test with 20 Billion cells

In [5]:
%%time
tf.reset_default_graph() 

mat1 = tf.random_normal([20000,10000],mean=0.0, stddev=1.0, dtype=tf.float32)
mat2 = tf.random_normal([10000,10000],mean=0.0, stddev=1.0, dtype=tf.float32)
mat3 = tf.matmul(mat1, mat2)

# Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(mat3))
    sess.close()

[[ -83.30841827 -111.36001587  -97.82980347 ..., -136.71658325
    12.19015217  122.71456909]
 [ 133.52589417   71.28384399  -13.48345375 ...,  -22.18257904
   -44.20858383  -21.3833828 ]
 [  41.2475853   -25.73634148  -60.75304794 ...,   73.06689453
   -36.35222626 -178.65823364]
 ..., 
 [ 156.53274536  -26.55888176   77.44717407 ..., -123.70824432
   -49.55231476   97.03790283]
 [   9.20714569 -139.72694397   32.86782455 ...,  161.72398376
   -65.95629883  -49.11345673]
 [-126.27427673  -82.06050873  -15.8848114  ...,  214.49969482
   -34.09952545  121.54418182]]
CPU times: user 836 ms, sys: 216 ms, total: 1.05 s
Wall time: 997 ms


In [None]:
Notice

In [6]:
%%time
tf.reset_default_graph() 

with tf.device('/cpu:0'):
    mat1 = tf.random_normal([20000,10000],mean=0.0, stddev=1.0, dtype=tf.float32)
    mat2 = tf.random_normal([10000,10000],mean=0.0, stddev=1.0, dtype=tf.float32)
    mat3 = tf.matmul(mat1, mat2)

    # Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    print(sess.run(mat3))
    sess.close()

[[ -17.78213501  -96.12410736  -97.17634583 ...,  -43.82975006
   -64.35961151  -96.83940125]
 [ -45.54906464 -108.26152039   86.52057648 ...,   51.022995    -44.05748367
   -54.17123032]
 [  88.69494629   32.88764954   79.31280518 ...,  -89.83449554
   -38.50896454   17.49689865]
 ..., 
 [  72.64311218  150.71180725   10.20463371 ...,   82.57032776
    50.01182556 -248.75682068]
 [ -27.59160805  104.66156006  -16.61383629 ...,   40.97966766
   -24.98047829   11.04336452]
 [ -17.87324905   -4.24131775   99.99557495 ...,  -48.55776596
   -56.71062469   36.33722305]]
CPU times: user 17min 36s, sys: 2.04 s, total: 17min 38s
Wall time: 33.5 s


In [None]:
If you were to check vmstat on your termnical, you would notice that all 32 CPU cores are working to multiply this matrix. 

In [4]:
%timeit -o -n 1 a=1

1 loop, best of 3: 954 ns per loop


In [6]:
print "Runtime = " + str(a) + " sec"

### GPU SPEEDUP

In [24]:
def buildRuns(mata_size,matb_size) :
    mata_size = str(mata_size)
    matb_size = str(matb_size)

    GPURUN = """
import tensorflow as tf
tf.reset_default_graph() 

mat1 = tf.random_normal([""" + mata_size + ',' +matb_size + """],mean=0.0, stddev=1.0, dtype=tf.float32)
mat2 = tf.random_normal([""" + mata_size + ',' +matb_size + """],mean=0.0, stddev=1.0, dtype=tf.float32)
mat3 = tf.matmul(mat1, mat2)

# Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    sess.run(mat3)
    sess.close()
    """
    
    CPURUN = """
import tensorflow as tf
tf.reset_default_graph() 

with tf.device('/cpu:0'):
    mat1 = tf.random_normal([""" + mata_size + ',' +matb_size + """],mean=0.0, stddev=1.0, dtype=tf.float32)
    mat2 = tf.random_normal([""" + mata_size + ',' +matb_size + """],mean=0.0, stddev=1.0, dtype=tf.float32)
    mat3 = tf.matmul(mat1, mat2)

    # Creates a session with log_device_placement set to True.
with tf.Session() as sess :
    writer = tf.summary.FileWriter('./graphs', graph=tf.get_default_graph())
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    # Runs the op.
    sess.run(mat3)
    sess.close()
    """
    return (GPURUN,CPURUN)

In [25]:
%%time
(GPURUN,CPURUN) = buildRuns(10,10)
improt
#print CPURUN
runs = [10,100]
gpu_time=time_run(GPURUN)
cpu_time=time_run(CPURUN)
speedup = cpu_time / gpu_time
print "Speedup for a matrix multiply of 20B Cells = " + str(speedup)

NameError: name 'improt' is not defined

For more reading about GPUs ....

useful stackoverflows
https://stackoverflow.com/questions/46178961/duplicate-tensorflow-placeholder-variables
https://stackoverflow.com/questions/37660312/run-tensorflow-on-cpu