<a href="https://colab.research.google.com/github/ching-wong/my_notebooks/blob/main/Custom_and_Distributed_Training_with_TensorFlow/Week1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom and Distributed Training with TensorFlow - Week 1

Introduction to tensor and gradient tape.

>[Custom and Distributed Training with TensorFlow - Week 1](#scrollTo=VqW8A7V0q51z)

>>[Tensor - introduction](#scrollTo=GpV1fDf9gXlP)

>>[Tensor operations](#scrollTo=K7UHn1MgpjSQ)

>>[Two types of code execution](#scrollTo=uxkXr5oysJvv)

>>[Gradient Tape](#scrollTo=FpA7rpJLxsEH)

>>[Higher order gradients](#scrollTo=ZL81mOGY4zTJ)



## Tensor - introduction

There are two characteristics:

- Shape, e.g., shape = (2,)
- Data type, e.g., dtype = tf.int32

In [1]:
import tensorflow as tf

t1 = tf.Variable([[1,2],[3,4]], dtype = tf.float32) # mutable
t2 = tf.constant([123, 456]) # immutable

t3 = tf.Variable(1+2j, dtype = tf.complex64) # for complex numbers, use j instead of i

t4 = tf.constant([1,2,3,4], shape = (2,2))
t5 = tf.constant(1, shape = (3,2))
#t6 = tf.Variable([1,2,3,4], shape = (2,2)) this does not work, the shape must be correct

In [2]:
t1

<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[1., 2.],
       [3., 4.]], dtype=float32)>

Here we inspect the variables of a simple model. There are 2 variables of this simple model, namely weight (kernel) and bias.

Note: in Python, (1,) is a tuple of one element. Since shape must be a tuple, we must add a trailing comma.

In [3]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense

seq_model = Sequential()
seq_model.add(Input(shape=(1,1), name="input"))
seq_model.add(Dense(1, activation='relu', name="dense"))

seq_model.summary()

seq_model.variables

[<Variable path=sequential/dense/kernel, shape=(1, 1), dtype=float32, value=[[0.07548249]]>,
 <Variable path=sequential/dense/bias, shape=(1,), dtype=float32, value=[0.]>]

## Tensor operations



In [4]:
tf.add([1,2,3], [4,5,6]) # element-wise addition

<tf.Tensor: shape=(3,), dtype=int32, numpy=array([5, 7, 9], dtype=int32)>

In [5]:
tf.square([7,3])

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([49,  9], dtype=int32)>

In [6]:
tf.reduce_sum([[1,2,3], [4,5,6]]) # shape has no dimension, so it is a scalar

<tf.Tensor: shape=(), dtype=int32, numpy=21>

In [7]:
import numpy as np

x = tf.Variable(np.arange(24))
x

<tf.Variable 'Variable:0' shape=(24,) dtype=int64, numpy=
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])>

In [8]:
x = tf.reshape(x, (4,6))
x

<tf.Tensor: shape=(4, 6), dtype=int64, numpy=
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])>

In [9]:
x = tf.cast(x, tf.float32)
x

<tf.Tensor: shape=(4, 6), dtype=float32, numpy=
array([[ 0.,  1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10., 11.],
       [12., 13., 14., 15., 16., 17.],
       [18., 19., 20., 21., 22., 23.]], dtype=float32)>

## Two types of code execution

* Graph-based (via @tf.function): all of the data and operations are loaded into a graph before evaluating them within a session

  * get actual values right away
  * broadcasting support (add or minus tensors with different dimensions)
  * NumPy compatiability (e.g., can use ** 2 as squaring)
  * operator overloading
  * slightly slower for large models or production because it lacks some graph optimizations

* Eager-based (default since TensorFlow 2.x): all of the code is executed line by line
  * faster execution
  * used when saving models (like with SavedModel)
  * harder to debug (no Python print, no native exceptions unless using tf.print)

When to	use Eager Execution	or Graph Execution?
* Prototyping / debugging: Eager
* Training simple models: Eager (or Graph)
* Deploying models: Graph
* Performance-critical code: Graph

## Gradient Tape

tf.GradientTape() is used to record operations so TensorFlow can compute gradients later.

Here is a simple example of gradient tape, with 2 variables.

In [10]:
import random

w = tf.Variable(random.random(), trainable=True)
b = tf.Variable(random.random(), trainable=True)

LEARNING_RATE =0.001

def fit_data(real_x, real_y):
  with tf.GradientTape(persistent=True) as tape:
    pred_y = w * real_y + b
    reg_loss = tf.abs(real_y - pred_y)

  w_gradient = tape.gradient(reg_loss, w)
  b_gradient = tape.gradient(reg_loss, b)

  w.assign_sub(w_gradient * LEARNING_RATE)
  b.assign_sub(b_gradient * LEARNING_RATE)

Here is an example of gradient descent using gradient tape:

In [11]:
w = tf.Variable([[3.0]])

with tf.GradientTape() as tape:
  loss = w * w
  # gradient of loss is 2 * w

tape.gradient(loss, w)

<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[6.]], dtype=float32)>

If a tensor is not a tf.Variable, one has to explicitly tell the tape to "watch" it.

In the following example,
* z = y² = (sum of x)² = (4)² = 16
* dz/dy = 2 * y = 8
* dy/dx = 1 for all elements
* So dz/dx = dz/dy * dy/dx = 8 * 1 = 8 at every position

In [12]:
x = tf.ones((2,2))

with tf.GradientTape() as tape:
  tape.watch(x)
  y = tf.reduce_sum(x)
  z = tf.square(y)

tape.gradient(z,x) #dz/dx

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[8., 8.],
       [8., 8.]], dtype=float32)>

With persistent = True, one can reuse the tape to compute multiple gradients from the same recorded operations. (Don't forget to drop it when done.)

In [13]:
x = tf.ones((2,2))

with tf.GradientTape(persistent = True) as tape:
  tape.watch(x)
  y = tf.reduce_sum(x)
  z = tf.square(y)

tape.gradient(z,x) #dz/dx


<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[8., 8.],
       [8., 8.]], dtype=float32)>

In [14]:
tape.gradient(y,x) #dy/dx

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1., 1.],
       [1., 1.]], dtype=float32)>

In [15]:
tape.gradient(z,y) #dz/dy
del tape # since we set persistent = True, drop the reference to the tape when done

## Higher order gradients

We use nesting structure.

In [16]:
x = tf.Variable(1.0)

with tf.GradientTape() as tape2:
  with tf.GradientTape() as tape1:
    y = x * x * x

  dy_dx = tape1.gradient(y,x)

d2y_dx2 = tape2.gradient(dy_dx,x)

print("dy_dx =", dy_dx.numpy())
print("d2y_dx2 =", d2y_dx2.numpy())

dy_dx = 3.0
d2y_dx2 = 6.0
