<h1><center>Tensorflow Large Model Support (TFLMS) Tutorial - 01</center></h1>

## Large Model Support

TensorFlow Large Model Support (TFLMS) is a Python module that provides an approach to training large models and data that cannot normally be fit in to GPU memory. It takes a computational graph defined by users, and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa. During training and inferencing this makes the graph execution operate like operating system memory paging. The system memory is effectively treated as a paging cache for the GPU memory and tensors are swapped back and forth between the GPU memory and CPU memory.

Click <a href= https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_tflms.html>here</a> for further information on how to use TFLMS and best practices.

TFLMS source code is publicly available as a pull request in the <a href= https://github.com/tensorflow/tensorflow/pull/19845>TensorFlow repository</a>.

Here are links to blog posts, papers, and videos that describe TensorFlow Large Model support, use cases, and performance characteristics.

1) <a href = https://arxiv.org/pdf/1807.02037.pdf> TFLMS: Large Model Support in TensorFlow by Graph Rewriting here </a>

2) <a href = https://www.youtube.com/watch?vwdVPh3tUQ5A/> 4 minute introduction to TensorFlow Large Model support </a> – This video is a good quick introduction to TensorFlow Large Model Support. Note that the performance numbers at the end of this video are now out dated. See the performance links below for updated performance numbers.

3) <a href = https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionNames9426-using+tensor+swapping+and+nvlink+to+overcome+gpu+memory+limits+with+tensorflow/> NVIDIA GPU Technology Conference 2019 presentation </a> – A 40 minute presentation that discusses the use of TFLMS to overcome GPU memory limits and performance characteristics of TFLMS.

4) <a href = https://developer.ibm.com/linuxonpower/2019/05/17/performance-results-with-tensorflow-large-model-support-v2/> Performance results with TensorFlow Large Model Support v2 </a>

5) <a href = https://developer.ibm.com/linuxonpower/2018/07/27/tensorflow-large-model-support-case-study-3d-image-segmentation/> A case study using TensorFlow Large Model Support with 3D U-Net for 3D image segmentation </a>

6) <a href = https://arxiv.org/abs/1812.07816> Fast and Accurate 3D Medical Image Segmentation with Data-swapping Method </a> – This paper contains a comparison of using TFLMS versus patching method for large images. It also contains a comparison of TFLMS vs gradient checkpointing.

7) <a href = https://arxiv.org/abs/1811.12174> Data-parallel distributed training of very large models beyond GPU capacity </a> – This paper contains a real world use case of using TFLMS with IBM Distributed Deep Learning.

8) <a href = https://developer.ibm.com/linuxonpower/2018/12/19/performance-of-3dunet-multi-gpu-model-for-medical-image-segmentation-using-tensorflow-large-model-support/> Performance of 3DUnet Multi GPU Model for Medical Image Segmentation using TensorFlow Large Model Support </a> – This blog post contains performance comparisons of whole system training using TFLMS with IBM Distributed Deep Learning and Horovod on x86 and IBM AC922 servers.

## MNIST Classification using TFLMS and Session

In this example we will train a CNN on MNIST dataset, using Tensorflow Large Model Support along with a Tensorflow Session. 

<b>Note: This example just demonstrates how to deploy LMS into Keras applications. The program runs as well without TFLMS. The model itself is not big enough to necessitate TFLMS.</b>

In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys
import tempfile

from tensorflow.examples.tutorials.mnist import input_data
# Enable Large Model Support
from tensorflow_large_model_support import LMS

import tensorflow as tf
import numpy

tf.logging.set_verbosity(tf.logging.INFO)
FLAGS = None

<br>
Load the MNIST dataset and separate into training set and testing set.


## MNIST Dataset Overview

This example is using MNIST handwritten digits. The dataset contains 60,000 examples for training and 10,000 examples for testing. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 255. 

In this example, each image will be converted to float32 and normalized to [0, 1].

![MNIST Dataset](http://neuralnetworksanddeeplearning.com/images/mnist_100_digits.png)

More info: http://yann.lecun.com/exdb/mnist/

In [4]:
# Import data
mnist = input_data.read_data_sets(".")

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting ./train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting ./train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting ./t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting ./t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.



## CNN Overview

![CNN](http://personal.ie.cuhk.edu.hk/~ccloy/project_target_code/images/fig3.png)

In [5]:
def deepnn(x):
  """deepnn builds the graph for a deep net for classifying digits.

  Args:
    x: an input tensor with the dimensions (N_examples, 784), where 784 is the
    number of pixels in a standard MNIST image.

  Returns:
    A tuple (y, keep_prob). y is a tensor of shape (N_examples, 10), with values
    equal to the logits of classifying the digit into one of 10 classes (the
    digits 0-9). keep_prob is a scalar placeholder for the probability of
    dropout.
  """
  # Reshape to use within a convolutional neural net.
  # Last dimension is for "features" - there is only one here, since images are
  # grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc.
  with tf.name_scope('reshape'):
    x_image = tf.reshape(x, [-1, 28, 28, 1])

  # First convolutional layer - maps one grayscale image to 32 feature maps.
  with tf.name_scope('conv1'):
    W_conv1 = weight_variable([5, 5, 1, 32])
    b_conv1 = bias_variable([32])
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

  # Pooling layer - downsamples by 2X.
  with tf.name_scope('pool1'):
    h_pool1 = max_pool_2x2(h_conv1)

  # Second convolutional layer -- maps 32 feature maps to 64.
  with tf.name_scope('conv2'):
    W_conv2 = weight_variable([5, 5, 32, 64])
    b_conv2 = bias_variable([64])
    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

  # Second pooling layer.
  with tf.name_scope('pool2'):
    h_pool2 = max_pool_2x2(h_conv2)

  # Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
  # is down to 7x7x64 feature maps -- maps this to 1024 features.
  with tf.name_scope('fc1'):
    W_fc1 = weight_variable([7 * 7 * 64, 1024])
    b_fc1 = bias_variable([1024])

    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
    h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

  # Dropout - controls the complexity of the model, prevents co-adaptation of
  # features.
  with tf.name_scope('dropout'):
    keep_prob = tf.placeholder(tf.float32)
    h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

  # Map the 1024 features to 10 classes, one for each digit
  with tf.name_scope('fc2'):
    W_fc2 = weight_variable([1024, 10])
    b_fc2 = bias_variable([10])

    y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
  return y_conv, keep_prob


def conv2d(x, W):
  """conv2d returns a 2d convolution layer with full stride."""
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')


def max_pool_2x2(x):
  """max_pool_2x2 downsamples a feature map by 2X."""
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')


def weight_variable(shape):
  """weight_variable generates a weight variable of a given shape."""
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)


def bias_variable(shape):
  """bias_variable generates a bias variable of a given shape."""
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

LMS( ) is the Keras callback to activate Large Model Support. If we do not specify specific tuning parameters to LMS( ), the auto tuning will determine that TFLMS is not needed and disable it.

The computational graph is statically modified by TFLMS. Hence, it needs to be done before a session starts.

In [6]:
# Create the model
x = tf.placeholder(tf.float32, [None, 784])

# Define loss and optimizer
y_ = tf.placeholder(tf.int64, [None])

# Build the graph for the deep net
y_conv, keep_prob = deepnn(x)

with tf.name_scope('loss'):
    cross_entropy = tf.losses.sparse_softmax_cross_entropy(
        labels=y_, logits=y_conv)
    cross_entropy = tf.reduce_mean(cross_entropy)

with tf.name_scope('adam_optimizer'):
    train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

with tf.name_scope('accuracy'):
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), y_)
    correct_prediction = tf.cast(correct_prediction, tf.float32)
accuracy = tf.reduce_mean(correct_prediction)

  
lms_model = LMS(swapout_threshold=50, swapin_ahead=3, swapin_groupby=2)
lms_model.excl_output_by_scopes = {'loss', 'accuracy', 'dropout'}
lms_model.run(tf.get_default_graph())

graph_location = tempfile.mkdtemp()
print('Saving graph to: %s' % graph_location)
train_writer = tf.summary.FileWriter(graph_location)
train_writer.add_graph(tf.get_default_graph())

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(20000):
        batch = mnist.train.next_batch(50)
        if i % 100 == 0:
            train_accuracy = accuracy.eval(feed_dict={
                x: batch[0], y_: batch[1], keep_prob: 1.0})
            print('step %d, training accuracy %g' % (i, train_accuracy))
            train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

    # compute in batches to avoid OOM on GPUs
    accuracy_l = []
    for _ in range(20):
        batch = mnist.test.next_batch(500, shuffle=False)
        accuracy_l.append(accuracy.eval(feed_dict={x: batch[0],
                                                 y_: batch[1],
                                                 keep_prob: 1.0}))
print('test accuracy %g' % numpy.mean(accuracy_l))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:[LMS][0] Editing model for LMS
INFO:tensorflow:[LMS][0] The graph has 377 vertices and 483 edges.
INFO:tensorflow:[LMS][0] The graph has 12.49 MiB of learning parameters
INFO:tensorflow:[LMS][0] The largest operation is adam_optimizer/Adam/update_fc1/Variable/ApplyAdam consuming 0.06 GiB
INFO:tensorflow:[LMS][0] Original categorized topological sort has 94 levels
INFO:tensorflow:[LMS][0] swapout_threshold: 50
INFO:tensorflow:[LMS][0] swapin_ahead: 3
INFO:tensorflow:[LMS][0] swapin_groupby: 2
INFO:tensorflow:[LMS][0] A histogram of distances was written to /home/ceron/UIUC/Exercises/LMS/.lms/tmpdv6dfv4a
INFO:tensorflow:[LMS][0] Added 13 operations to the model (6 swap-out operations (0.0 GiB) and  7 swap-in operations (0.0 GiB))
INFO:tensor

step 18800, training accuracy 0.88
step 18900, training accuracy 0.92
step 19000, training accuracy 0.92
step 19100, training accuracy 0.88
step 19200, training accuracy 0.94
step 19300, training accuracy 0.86
step 19400, training accuracy 0.9
step 19500, training accuracy 0.92
step 19600, training accuracy 0.92
step 19700, training accuracy 0.88
step 19800, training accuracy 0.94
step 19900, training accuracy 0.94
test accuracy 0.9146
