# Tensorflow

Tensorflow is an open source machine learning framework developed primarily by Google and released for a variety of languages.  We only focus on Python here, since that is the primary use of Tensorflow on ALCF systems.  For support for other modes, please contact support@alcf.anl.gov.

The tensorflow documentation is here:
https://www.tensorflow.org/

To get started with Tensorflow, import it:

In [1]:
import tensorflow as tf

2022-09-28 15:14:42.460282: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-28 15:14:44.450891: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Tensorflow basics

### `Tensor`
Tensorflow uses the concept of `Tensors` as data types, and supports a variety of operations on them.  This document is not meant to be a tensorflow tutorial - instead, this is meant to inform you of the core concepts of using Tensorflow on Polaris, assuming you have some familiarity with Tensorflow already.

You can learn more about tensors in detail here:
https://www.tensorflow.org/guide/tensor

### GPU Computing

Tensorflow supports GPU operations for a large set of mathematical operations on Tensors.



In [9]:
# CPU Computing:

with tf.device("CPU"):
    cpu_input_data = tf.random.uniform(shape=(2,5000,500))

    print(cpu_input_data.device)
    # This runs on the CPU:

    %timeit product = tf.linalg.matmul(cpu_input_data, cpu_input_data, transpose_a=True)

    print(product.shape)
    print(product.device)

/job:localhost/replica:0/task:0/device:CPU:0
8.57 ms ± 58.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(2, 500, 500)
/job:localhost/replica:0/task:0/device:GPU:0


In [10]:
# GPU Computing:

with tf.device("GPU"):

    gpu_input_data = tf.random.uniform(shape=(2,5000,500))


    print(gpu_input_data.device)

    # This runs on the GPU:

    %timeit product = tf.linalg.matmul(gpu_input_data, gpu_input_data, transpose_a=True)

    print(product.shape)
    print(product.device)


/job:localhost/replica:0/task:0/device:GPU:0
139 µs ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
(2, 500, 500)
/job:localhost/replica:0/task:0/device:GPU:0


### Getting access to Data

We'll cover the Data Pipelines more completely in a later presentation.  For now, we'll use the cifar10 dataset, available from tensorflow:

In [11]:
# # Trying to run this on a mac?  Try these lines if you get an SSL error
# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()


Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [12]:
print(x_train.shape)
# This data has 50000 images, of size 32x32 pixels and 3 RGB colors

(50000, 32, 32, 3)


In [20]:
batch_size = 5000
batch_data   = tf.convert_to_tensor(x_train[0:batch_size], dtype=tf.float32) # Take the first 10 images
batch_labels = tf.convert_to_tensor(y_train[0:batch_size], dtype=tf.float32) # first 10 labels

In [21]:
batch_labels.shape

TensorShape([5000, 1])

### Machine Learning Models

Tensorflow is primarily developed as a machine learning framework, so may operations like convolution, dense layers, etc. are all well supported.

The easiest way to build a model is to use the Keras API for object-oriented model construction.  For example, building a few layers of a ResNet-likemodel can be done like so:

In [22]:
class ResidualBlock(tf.keras.Model):

    def __init__(self):
        # Call the parent class's __init__ to make this class functional with training loops:
        super().__init__()
        self.conv1  = tf.keras.layers.Conv2D(filters=16, kernel_size=[3,3], padding="same")
        self.conv2  = tf.keras.layers.Conv2D(filters=16, kernel_size=[3,3], padding="same")

    def call(self, inputs):
    
        # Apply the first weights + activation:
        outputs = tf.keras.activations.relu(self.conv1(inputs))
        # Apply the second weights:

        outputs = self.conv2(outputs)

        # Perform the residual step:

        outputs = outputs + inputs

        # Second activation layer:
        return tf.keras.activations.relu(outputs)



In [23]:
class MyModel(tf.keras.Model):
    
    def __init__(self):
        # Call the parent class's __init__ to make this class functional with training loops:
        super().__init__()
        
        self.conv_init = tf.keras.layers.Conv2D(filters=16, kernel_size=1)
        
        self.res1 = ResidualBlock()
        
        self.res2 = ResidualBlock()
        
        # 10 filters for each class:
        self.conv_final = tf.keras.layers.Conv2D(filters=10, kernel_size=1)
        
        self.pool = tf.keras.layers.GlobalAveragePooling2D()
        
    def call(self, inputs):
        
        x = self.conv_init(inputs)
        
        x = self.res1(x)
        
        x = self.res2(x)
        
        x = self.conv_final(x)
        
        return self.pool(x)

In [24]:
# Create a model:
model = MyModel()

In [25]:

output_data = model(batch_data)
print(output_data.shape)

(5000, 10)


You can visualize your networks easily with Keras:

In [26]:
print(model.summary())

Model: "my_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_6 (Conv2D)           multiple                  64        
                                                                 
 residual_block_2 (ResidualB  multiple                 4640      
 lock)                                                           
                                                                 
 residual_block_3 (ResidualB  multiple                 4640      
 lock)                                                           
                                                                 
 conv2d_11 (Conv2D)          multiple                  170       
                                                                 
 global_average_pooling2d_1   multiple                 0         
 (GlobalAveragePooling2D)                                        
                                                        

## Using Tensorflow on Polaris

Tensorflow supports both GPU 

# Automatic Differentiation

The big advantage of the Machine Learning Frameworks is automatic differentiation.  Tensorflow supports automatic differentiation with the `GradientTape` syntax:

In [27]:
loss_function = tf.keras.losses.SparseCategoricalCrossentropy()


In [28]:

with tf.GradientTape(persistent=True) as tape:
    logits = model(batch_data)
    loss = loss_function(batch_labels, logits)

Get the "normal" derivatives (with respect to parameters) with the tape:

In [29]:
grads = tape.gradient(loss, model.trainable_variables)

You can also get gradients of other components, by asking the tape to `watch` tensors:

In [30]:

with tf.GradientTape(persistent=True) as tape:
    tape.watch(batch_data)
    logits = model(batch_data)
    loss = loss_function(batch_labels, logits)

In [31]:
input_grads = tape.gradient(loss, batch_data)

In [33]:
input_grads.shape

TensorShape([5000, 32, 32, 3])

## TF Functions

Often, tensorflow code can run faster when you graph compile it:

In [64]:
def gradient_step():
    with tf.GradientTape() as tape:
        logits = model(batch_data)
        loss = loss_function(batch_labels, logits)
    grads = tape.gradient(loss, model.trainable_variables)
    return grads

In [65]:
%timeit gradient_step()

49.7 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [53]:
gradient_step_traced = tf.function(gradient_step)

In [57]:
%timeit gradient_step_traced()

47.5 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [58]:
%timeit gradient_step_traced()

47.5 ms ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Often, you can get further improvements with XLA:


In [55]:
gradient_step_XLA = tf.function(gradient_step, jit_compile=True)

In [56]:
%timeit gradient_step_XLA()

554 µs ± 80.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Reduced Precision

A100 GPUs have support for faster matrix operations with mixed precision, which you can enable in tensorflow

In [59]:
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')


INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0


INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0


In [60]:
gradient_step_XLA = tf.function(gradient_step, jit_compile=True)

In [63]:
%timeit gradient_step_XLA()

36.3 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
