# TensorFlow

[TensorFlow](https://www.tensorflow.org/) is an open source machine learning framework developed primarily by Google and released for a variety of languages.  We only focus on Python here, since that is the primary use of TensorFlow on ALCF systems.  For support for other modes, please contact support@alcf.anl.gov.

The TensorFlow documentation is here:
https://www.tensorflow.org/

To get started with TensorFlow, import it:

In [None]:
import tensorflow as tf

## TensorFlow basics

### `Tensor`
TensorFlow uses the concept of `Tensors` as data types, and supports a variety of operations on them.  This document is not meant to be a [TensorFlow tutorial](https://www.tensorflow.org/tutorials/) - instead, this is meant to inform you of the core concepts of using TensorFlow on Polaris, assuming you have some familiarity with TensorFlow already.

You can learn more about tensors in detail here:
https://www.tensorflow.org/guide/tensor

### GPU Computing

TensorFlow supports GPU operations for a large set of mathematical operations on Tensors.



In [None]:
# CPU Computing:

with tf.device("CPU"):
    cpu_input_data = tf.random.uniform(shape=(2,5000,500))

    print("input data location: ")
    print(cpu_input_data.device)
    print()
    
    # This runs on the CPU:
    product = tf.linalg.matmul(cpu_input_data, cpu_input_data, transpose_a=True)
    print("output data location: ")
    print(product.device)
    print()
    
    print(cpu_input_data.shape)
    print(product.shape)
    print()
    
    # Time the operation
    del product
    %timeit product = tf.linalg.matmul(cpu_input_data, cpu_input_data, transpose_a=True)

In [None]:
# GPU Computing:

with tf.device("GPU"):
    gpu_input_data = tf.random.uniform(shape=(2,5000,500))

    print("input data location: ")
    print(gpu_input_data.device)
    print()

    # This runs on the GPU:
    product = tf.linalg.matmul(gpu_input_data, gpu_input_data, transpose_a=True)
    print("output data location: ")
    print(product.device)
    print()
    
    print(gpu_input_data.shape)
    print(product.shape)
    print()
    
    # Time the operation
    del product
    %timeit product = tf.linalg.matmul(gpu_input_data, gpu_input_data, transpose_a=True)

### Getting access to Data

We'll cover the Data Pipelines more completely in a later presentation.  For now, we'll use the [cifar10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), available from TensorFlow. The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

In [None]:
# # Trying to run this on a mac?  Try these lines if you get an SSL error
# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# This data has 50000 images, of size 32x32 pixels and 3 RGB colors
print(x_train.shape)

# with 50000 labels from ten different classes (integers ranging from 0-10)
print(y_train.shape)
print(y_train[:5,0])

In [None]:
# The training data needs to be converted from numpy arrays to tensor objects
print(type(x_train))
print(x_train.dtype)

print(type(y_train))
print(y_train.dtype) 

In [None]:
# We'll start with a batch of 10% of the data
batch_size   = 5000
batch_data   = tf.convert_to_tensor(x_train[0:batch_size], dtype=tf.float32) # Take the first 10% of images
batch_labels = tf.convert_to_tensor(y_train[0:batch_size], dtype=tf.float32) # first 10% of labels

In [None]:
print(batch_data.shape)
print(batch_labels.shape)
print()

# We explicitly converted form uint8 to float32 so no surprises show up with mathematical operations later on
print(batch_data.dtype)
print(batch_labels.dtype)

### Machine Learning Models

TensorFlow is primarily developed as a machine learning framework, thus many operations like convolution, dense layers, etc. are all well supported.

The easiest way to build a model is to use the [Keras API](https://keras.io/) for object-oriented model construction.  For example, building a few layers of a [ResNet](
https://doi.org/10.48550/arXiv.1512.03385)-like model can be done like so:

In [None]:
class ResidualBlock(tf.keras.Model):

    def __init__(self):
        # Call the parent class's __init__ to make this class functional with training loops:
        super().__init__()
        self.conv1  = tf.keras.layers.Conv2D(filters=16, kernel_size=[3,3], padding="same")
        self.conv2  = tf.keras.layers.Conv2D(filters=16, kernel_size=[3,3], padding="same")

    def call(self, inputs):
    
        # Apply the first weights + activation:
        outputs = tf.keras.activations.relu(self.conv1(inputs))
        
        # Apply the second weights:
        outputs = self.conv2(outputs)

        # Perform the residual step:
        outputs = outputs + inputs

        # Second activation layer:
        return tf.keras.activations.relu(outputs)



In [None]:
class MyModel(tf.keras.Model):
    
    def __init__(self):
        # Call the parent class's __init__ to make this class functional with training loops:
        super().__init__()
        
        self.conv_init = tf.keras.layers.Conv2D(filters=16, kernel_size=1)
        
        self.res1 = ResidualBlock()
        
        self.res2 = ResidualBlock()
        
        # 10 filters, one for each possible label (classification):
        self.conv_final = tf.keras.layers.Conv2D(filters=10, kernel_size=1)
        
        self.pool = tf.keras.layers.GlobalAveragePooling2D()
        
    def call(self, inputs):
        
        x = self.conv_init(inputs)
        
        x = self.res1(x)
        
        x = self.res2(x)
        
        x = self.conv_final(x)
        
        return self.pool(x)

In [None]:
# Create a model:
model = MyModel()
model.build(batch_data.shape)

You can visualize your networks easily with Keras:

In [None]:
print(model.summary())

# Automatic Differentiation

The big advantage of the Machine Learning Frameworks is automatic differentiation.  TensorFlow supports automatic differentiation with the `GradientTape` syntax:

In [None]:
# We begin by defining a loss function (see https://www.tensorflow.org/api_docs/python/tf/keras/losses for more built-in loss functions)
loss_function = tf.keras.losses.SparseCategoricalCrossentropy()

In [None]:
with tf.GradientTape(persistent=True) as tape:
    logits = model(batch_data)
    loss = loss_function(batch_labels, logits)

Get the "normal" derivatives (with respect to parameters) with the tape:

In [None]:
grads = tape.gradient(loss, model.trainable_variables)

You can also get gradients of other components by asking the tape to `watch` tensors:

In [None]:
# Track gradients of the input data
with tf.GradientTape(persistent=True) as tape:
    tape.watch(batch_data)
    logits = model(batch_data)
    loss = loss_function(batch_labels, logits)

In [None]:
input_grads = tape.gradient(loss, batch_data)

In [None]:
input_grads.shape

## TF Functions

Often, TensorFlow code can run faster when you [graph compile](https://www.tensorflow.org/guide/intro_to_graphs) it:

In [None]:
def gradient_step():
    with tf.GradientTape() as tape:
        logits = model(batch_data)
        loss = loss_function(batch_labels, logits)
    grads = tape.gradient(loss, model.trainable_variables)
    return grads

In [None]:
# Get a baseline time for the execution of the gradient step
%timeit gradient_step()

In [None]:
# Graph compile the gradient step
gradient_step_traced = tf.function(gradient_step)

In [None]:
# Get the time to run the graph compiled code
%timeit gradient_step_traced()

Further improvements can be found [with XLA (Accelerated Linear Algebra)](https://www.tensorflow.org/xla):

In [None]:
# Graph compile with XLA
gradient_step_XLA = tf.function(gradient_step, jit_compile=True)

In [None]:
# Get the time to run the graph compiled code with XLA optimizations
%timeit gradient_step_XLA()

### Reduced Precision

NVIDIA A100 GPUs have support for faster matrix operations with [mixed precision](https://www.tensorflow.org/guide/mixed_precision), which can be enabled in TensorFlow

In [None]:
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')


In [None]:
# Recompile using mixed precision policy
gradient_step_XLA = tf.function(gradient_step, jit_compile=True)

In [None]:
# Redo the timing of the graph compiled code with XLA optimizations
%timeit gradient_step_XLA()