# Chapter 12 - Custom Models and Training with Tensorflow

<center><img src="img/tfAPI.png"></img></center>
<center><img src="img/tfAPI2.png"></img></center>

### Using tensorflow like numpy

In [2]:
import tensorflow as tf
# tensorflow operations
m = tf.constant([[1., 2., 3.], [4., 5., 6.]]) # matrix
m

2022-01-11 19:18:17.038702: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [3]:
c = tf.constant(42)
c

<tf.Tensor: shape=(), dtype=int32, numpy=42>

In [6]:
print(m.shape)
print(m.dtype)

(2, 3)
<dtype: 'float32'>


In [8]:
# Indexing
m[:, 1:]

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 3.],
       [5., 6.]], dtype=float32)>

In [14]:
m[..., 1, tf.newaxis]

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[2.],
       [5.]], dtype=float32)>

In [15]:
m + 10

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[11., 12., 13.],
       [14., 15., 16.]], dtype=float32)>

In [16]:
tf.square(m)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)>

In [19]:
# @ is the matrix multiplication operator, equivalent to tf.matmul() 
m @ tf.transpose(m)

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

Keras low level API is limited (keras.backend), there are only a few operations available, so we stick with Tensorflow

In [2]:
from tensorflow import keras
K = keras.backend
K.square(K.transpose(m)) + 10

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[11., 26.],
       [14., 35.],
       [19., 46.]], dtype=float32)>

### Tensors and Numpy
They work well together, many things in common

In [3]:
import numpy as np
a = np.array([2., 4., 5.])
tf.constant(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([2., 4., 5.])>

In [25]:
m.numpy()

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

In [26]:
# 32-bit is more than enough, change it to use less RAM
a2 = tf.constant(a, dtype=tf.float32)
a2

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([2., 4., 5.], dtype=float32)>

### Type conversions
TensorFlow does not allow automatic conversions

In [27]:
tf.constant(2.) + tf.constant(40)

InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a int32 tensor [Op:AddV2]

In [28]:
# Not even with different byte representation
tf.constant(2.) + tf.constant(40., dtype=tf.float64)

InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a double tensor [Op:AddV2]

In [29]:
# Using cast
t2 = tf.constant(40., dtype=tf.float64)
tf.constant(2.0) + tf.cast(t2, tf.float32)

<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

### Variables
_tf.Tensor_ is immutable, they can't be weights in a DNN or other parameters, we use tf.Variable in this case. In practice we will rarely create variables manually, since Keras provides an add_weight() method and the model parameter will be updated by the optimizers.

In [33]:
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [34]:
# using assign methods with tf.Variable
print(v.assign(2 * v))
print('\n', v[0, 1].assign(42))
print('\n', v[:, 2].assign([0., 1.]))
print('\n', v.scatter_nd_update(indices=[[0,0], [1, 2]], updates=[100., 200.]))

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

 <tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

 <tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>

 <tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>


### Oher Data Structures
- _Sparse tensors_ (tf.SparseTensor): Efficient represent tensors containing mostly zeros. The tf.sparse package contains operations for sparse tensors.
- _Tensor arrays_ (tf.TensorArray) : List of tensors, fixed size by default but can be made dynamic. All tensors must have same shape and data type.
- _Ragged tensors_ (tf.RaggedTensor): static lists of lists of tensors, same shape and type. Use tf.ragged for operations.
- _String tensors_: represent byte strings, not unicode. A Python string is converted to UTF-8. Another option is to use tf
.int32 where each item represents an Unicode code point
- _Sets_: Regular tensors (sparse tensors). tf.constant([[1,2], [3,4]]) represent 2 sets {1,2}, {3,4}
- _Queues_: Store tensors across multiple steps, there are many kinds, First In, First Out (FIFO) queues (FIFOQueue), queues that can prioritize some items (PriorityQueue), shuffle their items (RandomSheffleQueue) and batch items of different shapes by padding (PaddingFIFOQueue). This classes are un the tf.queue package.

Now we are ready to start customizing models and training algorithms!

### Custom Loss Function
Want to train a regression model, cleaned up the dataset, remove outliers but is still noisy. The mse error penalize large errors too much and causes the model to be imprecise. The mae would not penalize outliers as much, but training would be slower and when testing would not be precise. This is a situation to use __Huber loss__, it is available in TF (keras.losses.Huber class), but pretend is not there. Let's implement it:

In [None]:
def huber_fn(y_true, y_pred):
    '''The hubber loss is quadratic when the error is smaller than
    a threshold "a" (typically 1) but linear when the error is larger
    than "a". The linear part makes it less sensitive to outliers
    than the mse and the quadratic part allows faster convergence. '''
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss = tf.abs(error) - 0.5
    return tf.where(is_small_error, squared_loss, linear_loss)
# Always use a vectorized implementation and is preferable to
# return a tensor with one loss per instance rather than the mean

In [None]:
model.compile(loss=huber_fn, optimizer="nadam")
model.fit(X_train, y_train....)

### Saving and Loading Models that contain Custom Components
Keras saves the name of the function and when loading the model, provide a dictionary that maps the function name to the actual function.

In [None]:
model = keras.models.load_model("my_model_with_custom_loss.h5", 
                                custom_objects={"huber_fn": huber_fn})

In [None]:
# custom threshold
def create_huber(threshold = 1.0):
    def huber_fn(y_true, y_pred):
        '''The hubber loss is quadratic when the error is smaller than
        a threshold "a" (typically 1) but linear when the error is larger
        than "a". The linear part makes it less sensitive to outliers
        than the mse and the quadratic part allows faster convergence. '''
        error = y_true - y_pred
        is_small_error = tf.abs(error) < threshold
        squared_loss = tf.square(error) / 2
        linear_loss = threshold * tf.abs(error) - threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)

In [None]:
model.compile(loss=create_huber(2.0), optimizer="nadam")
model.fit(X_train, y_train....)
# loading, the threshold is not saved, specify it
model = keras.models.load_model("my_model_with_custom_loss.h5", 
                                custom_objects={"huber_fn": create_huber(2.0)})


In [None]:
# Keras API only specifies how to use subclassing to define layers, models,
# callbacks, and regularizers. If you build other things, such as losses,
# metrics, initializers or constraints, they may not be portable.
class HuberLoss(keras.losses.Loss):
    def __init__(self, threshold=1.0, **kwargs):
        '''**kwargs are passed to the parent constructor, standard 
        hyperparameters: name of the loss and the reduction algorithm
        to use to aggregate the individual instance losses, by default
        is "sum_over_batch_size"'''
        self.threshold = threshold
        super().__init__(**kwargs)
    def call(self, y_true, y_pred):
        # takes the labels and predictions, computes all the instance losses
        # and returns them
        error = y_true - y_pred
        is_small_error = tf.abs(error) < self.threshold
        squared_loss = tf.square(error) / 2
        linear_loss = self.threshold * tf.abs(error) - self.threshold**2 / 2    
        return tf.where(is_small_error, squared_loss, linear_loss)
    def get_config(self):
        '''Returns a dictionary mapping each hyperparameter name to its value. 
        It first calls the parent class's get_config() method, then adds the
        new hyperparameters to this dictionary'''
        base_config = super().get_config()
        return {**base_config, "threshold": self.threshold}    

In [None]:
model.compile(loss=HuberLoss(2.), optimizer="nadam")
model = keras.models.load_model("model_huber_class.h5",
                                custom_objects={"HuberLoss": HuberLoss})

When we save a model, Keras calls the loss instance's _get_config()_ method and saves the config as JSON in the HDF5 file. WHen loading, it calls the _from_config()_ class method on the HuberLoss class; this methodis implemented by the base class (Loss) and creates an instance of the class, passing **config to the constructor.

### Custom Activation Functions, Initializers, Regularizers, and Constraints

In [14]:
# We just need to create a simple function
def my_softplus(z):
    # equivalent to: keras.activations.softplus() or tf.nn.softplus()
    return tf.math.log(tf.exp(z) + 1.0)

def my_glorot_initializer(shape, dtype=tf.float32):
    # equivalent to keras.initializers.glorot_normal()
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)

def my_l1_regularizer(weights):
    # equivalent to keras.regularizers.l1(0.01)
    return tf.reduce_sum(tf.abs(0.01*weights))

def my_positive_weights(weights):
    # keras.constraints.nonneg() or tf.nn.relu()
    return tf.where(weights < 0., tf.zeros_like(weights), weights)

In [None]:
layer = keras.layers.Dense(30, activation=my_softplus,
                           kernel_initializer=my_glorot_initializer,
                           kernel_regularizer=my_l1_regularizer,
                           kernel_constraint=my_positive_weights)

In [None]:
# If a function hyperparameter need to be saved along with the model,
# we could use subclassing
class MyL1Regularizer(keras.regularizers.Regularizer):
    def __init__(self, factor):
        self.factor = factor
    def __call__(self, weights):
        # The call() method must be implemented for losses, layers, activation
        # functions and models. __call__() for regularizers, initializers and constraints
        return tf.reduce_sum(tf.abs(self.factor * weights))
    def get_config(self):
        return {"factor": self.factor}

### Custom metrics
Losses and metrics are conceptually not the same thing, losses are used by Gradient Descent to __train__ a model, their gradients should not be 0. Meanwhile, metrics are used to __evaluate__ the model, easily interpretable, can be non-differentiable or have 0 gradients everywhere.

In mostcases, defining a custom metric is the same as defining a custom loss function. Even the Huber that was created can be used as a metric.  

In [None]:
model.compile(loss="mse", optimizer="nadam", metrics=[create_huber(2.0)])

For each batch during training, Keras will compute this metric and keep track of its mean since the beginning of the epoch. This is not always what we want. For example:

__Precision__ is the number of true positives divided by the number of positive predictions,in the training of a binary classifier, the first batch, the model got 4/5 correct predictions (80%), in the second 0/3 (0%), the mean is 40%, but it is wrong, the precisision is 4/8 (50%)

In [4]:
# precision object
precision = keras.metrics.Precision()
# Using it like a function, passing the predictions of the first batch, then for the second
precision([0, 1, 1, 1, 0, 1, 0, 1], [1, 1, 0, 1, 0, 1, 0, 1])

<tf.Tensor: shape=(), dtype=float32, numpy=0.8>

In [5]:
# overall precision so far, streaming metrics
precision([0, 1, 0, 0, 1, 0, 1, 1], [1, 0, 1, 1, 0, 0, 0, 0])

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

In [6]:
# we can call result() at any point
precision.result()

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

In [7]:
# number of true and false positives
precision.variables

[<tf.Variable 'true_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>,
 <tf.Variable 'false_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>]

In [10]:
# reseting the vars
precision.reset_states()
precision.variables

[<tf.Variable 'true_positives:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>,
 <tf.Variable 'false_positives:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>]

In [None]:
# Creating a streaming metric (for show purposes)
class HuberMetric(keras.metrics.Metric):
    # keeps track of the total Huber loss and the number of instances
    def __init__(self, threshold=1.0, **kwargs):
        super().__init__(**kwargs)
        self.threshold = threshold
        self.huber_fn = create_huber(threshold)
        # variables that keep track of the metrics state over multiple batches,
        # in this case, the sum of all Huber losses.
        self.total = self.add_weight("total", initializer="zeros")
        self.count = self.add_weight("count", initializer="zeros")
        # tf.Variable for manual creation, Keras keep track of it
    def update_state(self, y_true, y_pred, sample_weight=None):
        # this is called each time an instance of this class is called
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))
    def result(self):
        return self.total / self.count
        # when we use the metric as a function, the update_state() method gets called 
        # first, then the result() method and its output is returned

    def get_config(self):
        # using this we ensure the threshold gets saved along with the model
        base_config = super().get_config()
        return {**base_config, "threshold": self.threshold}
# In this case we don't need to override the the reset_states() method 

### Custom layers

Layers with no default implementation, blocks of layers treated as a single layer are some use cases for a custom layer.

In [None]:
# layers with no weights, simplest option is to wrap it in a keras.layers.Lambda
# The following will apply the exponential function to its input layers
exponential_layer = keras.layers.Lambda(lambda x: tf.exp(x)) 

In [None]:
# Building a stateful layer:
class MyDense(keras.layers.Layer):
    def __init__(self, units, activation=None, **kwargs):
        # hyperparameters (units and activation) and **kwargs as arguments.
        # **kwargs are passed to parent constructor (input_shape, trainable, name)
        super().__init__(**kwargs)
        # hyperparameters as attributes
        self.units = units
        self.activation = keras.activations.get(activation)

    def build(self, batch_input_shape):
        '''Its role is to create the layer's variables by calling the add_weight()
        method for each weight. This method is called the first time the layer is 
        used. At this point Keras will know the shape of this layers input and
        will pass it to the build() method'''
        self.kernel = self.add_weight(name="kernel", shape=[batch_input_shape[-1], self.units],
                                      initializer="glorot_normal")
        self.bias = self.add_weight(name="bias", shape=[self.units], initializer="zeros")
        # call the parent build method to tell keras the layer is built (self.built=True)
        super().build(batch_input_shape) # always at the end

    def call(self, X):
        # performs the desired operations. Matrix multiplication of inputs X and layer's
        # kernel, sum the bias vector and apply the activation function
        return self.activation(X @ self.kernel + self.bias)

    def compute_output_shape(self, batch_input_shape):
        # the last dimension is replaced with the number of neurons in the layer
        return tf.TensorShape(batch_input_shape.as_list()[:-1]) + [self.units]
        # it can be ommited, Keras infers the out shape, except when the layer is dinamic

    def get_config(self):
        base_config = super.get_config()
        # we save the activation function's full configuration by calling keras.activations.serialize()
        return {**base_config, "units": self.units,
                "activation": keras.activations.serialize(self.activation)}

In [None]:
# Layer with multiple inputs (e.g. Concatenate).
class MyMultiLayer(keras.layers.Layer):
    This takes two inputs and
    # returns three outputs
    def call(self, X):
        # argument of call() should be a tuple containing all inputs
        X1, X2 = X
        # Multiple outputs, return the list of outputs
        return [X1 + X2, X1 * X2, X1 / X2]
    def compute_output_shape(self, batch_input_shape):
        # the argument should be a tuple containing each input's batch shape
        b1, b2 = batch_input_shape
        # return the list of batch output shapes (one per output)
        return [b1, b1, b1]

In [None]:
# If a layer needs to have a different behaviour during training and
# testing (e.g. if it uses Dropout or BatchNormalization), we must add
# a training argument to the call method():
class MyGaussianNoise(keras.layers.Layer):
    def __init__(self, stddev, **kwargs):
        super().__init__(**kwargs)
        self.stddev = stddev

    def call(self, X, training=None):
        # set the behaviour if it is training the DNN, in this case, adding
        # gaussian noise
        if training:
            noise = tf.random.normal(tf.shape(X), stddev=self.stddev)
            return X + noise
        else:
            return X

    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

### Custom models
As we saw in CH10, Subclassing API, the things to do are: subclass the _keras.Model.class_, create layers and variables in the constructor, implement the _call()_ method to do whatever you want the model to do.

Supposing the following model:
<center><img src="img/custom_model.png"></img></center>
It does not make sense, but we can make it.

In [None]:
#Let's do the residual block
class ResidualBlock(keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(n_neurons, activation="elu",
                                          kernel_initializer="he_normal")
                       for _ in range(n_layers)]
    
    def call(self, inputs):
        Z = inputs
        # for each layer, it accepts the n_neurons, in this case Z
        for layer in self.hidden:
            Z = layer(Z)
        # returns the sum as in the ResidualBlock implementation
        return inputs + Z

# Keras automatically detects that the hidden attribute contains Trackable
# objects (layers in this case), so their values are added to this layer's
# list of variables.

In [None]:
# Subclassing API to define the model itself
class ResidualRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        # create layers in constructor
        self.hidden1 = keras.layers.Dense(30, activation="elu",
                                          kernel_initializer="he noramal")
        self.block1 = ResidualBlock(2, 30)
        self.block2 = ResidualBlock(2, 30)
        self.out = keras.layers.Dense(output_dim)
    
    def call(self, inputs):
        Z = self.hidden1(inputs)
        for _ in range(1 + 3):
            Z = self.block1(Z)
        Z = self.block2(Z)
        return self.out(Z)

# to save the model, we must implement the get_config() method in both classes

The model class is a subclass of the Layer class, so models can be defined and used exactly like layers, but with extra functionalities like compile(), fit(), evaluate(), etc. But it is cleaner to distinguish the internal components

### Losses and Metrics Based on Model Internals
THe losses and metrics done above were all based on the labels and the predictions (optionally sample weights). The losses can be defined also in weights or activations of its hidden layers. THis may be useful for regularization or monitoring some internal aspect.

Just pass the computed loss wherever needed and pass it to the _add_loss()_ method.

In [None]:
class ReconstructingRegressor(keras.Model):
    '''Reconstruction loss, we encourage the model to preserve as much information 
    as possible through the hidden layers. Is the mean squared difference between
    the recontruction and the inputs.'''
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        # 5 hidden layers
        self.hidden = [keras.layers.Dense(30, activation="selu",
                                         kernel_initializer="lecun_normal") 
                       for _ in range(5)]
        # 1 output layer
        self.out = keras.layers.Dense(output_dim)
    
    def build(self, batch_input_shape):
        n_inputs = batch_input_shape[-1]
        # extra dense layer to reconstruct the inputs of the model, it is created here
        # because the number of inputs is unknown before the build() method is called
        # and it depends on the number of inputs
        self.reconstruct = keras.layers.Dense(n_inputs)
        # Due to an issue introduced in TF 2.2 (#46858), we must not call super().build()
        # inside the build() method.
        # super().build(batch_input_shape)

    def call(self, inputs):
        Z = inputs
        # processing the inputs through all hidden layers
        for layer in self.hidden:
            Z = layer(Z)
        # result to the reconstruction layer
        reconstruction = self.reconstruct(Z)
        # computing the loss
        recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs))
        # adding it to the model list of losses, it is scaled so it doesn't dominate
        self.add_loss(0.05*recon_loss)
        return self.out


We can add a custom metric based on model internals. Create a _keras.metrics.Something_ object, call it in the call() method, pass the _recon_loss_ and add it to the model by calling the model's add_metric() method.

### Computing Gradients Using Autodiff

In [5]:
def f(w1, w2):
    return 3 * w1**2 + 2 * w1 * w2
w1, w2 = 5, 3
eps = 1e-6
# using the chain rule to calculate the aproximation of the derivative
print((f(w1 + eps, w2) - f(w1, w2)) / eps ) 
print((f(w1, w2 + eps) - f(w1, w2)) / eps ) 

36.000003007075065
10.000000003174137


In [7]:
# It is better to use auto diff
w1, w2 = tf.Variable(5.), tf.Variable(3.) # They have to be Variable
with tf.GradientTape() as tape:
    # the context will automatically record every operation involving the variable
    z = f(w1, w2)
# ask to compute the gradient of the result z with regarding both variables
gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

After calling once the gradient, tape gets erased. To call it multiple times, it has to be persistant

In [9]:
with tf.GradientTape(persistent=True) as tape:
    z = f(w1, w2)
dz_w1 = tape.gradient(z, w1)
dz_w2 = tape.gradient(z, w2)
del tape
print(dz_w1, dz_w2)

tf.Tensor(36.0, shape=(), dtype=float32) tf.Tensor(10.0, shape=(), dtype=float32)


In [11]:
c1, c2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
    # record every operations regarding to the tensors, useful when
    # implementing custom losses, if they vary little or too much
    tape.watch(c1)
    tape.watch(c2)
    z = f(c1, c2)
gradients = tape.gradient(z, [c1, c2])
gradients 

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

In [12]:
# second-order partial derivatives
with tf.GradientTape(persistent=True) as hessian_tape:
    with tf.GradientTape(persistent=True) as jacobian_tape:
        z = f(w1, w2)
    jacobians = jacobian_tape.gradient(z, [w1, w2])
hessians = [hessian_tape.gradient(jacobian, [w1, w2])
            for jacobian in jacobians]
print(jacobians, '\n', hessians)

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>, <tf.Tensor: shape=(), dtype=float32, numpy=10.0>] 
 [[<tf.Tensor: shape=(), dtype=float32, numpy=6.0>, <tf.Tensor: shape=(), dtype=float32, numpy=2.0>], [<tf.Tensor: shape=(), dtype=float32, numpy=2.0>, None]]


In [13]:
# stop gradients from backpropagating through some part of the network
def f(w1, w2):
    return 3 * w1**2 + tf.stop_gradient(2 * w1 * w2)

with tf.GradientTape() as tape:
    z = f(w1, w2)

gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=30.0>, None]

In [15]:
# Numerical issues
x = tf.Variable([100.])
with tf.GradientTape() as tape:
    z = my_softplus(x)
tape.gradient(z, [x])

[<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>]

It ends up computing inf/inf, it returns NaN. But we can analitically find that the derivative is $\frac{1}{1 + 1/e^x}$. 

In [17]:

@tf.custom_gradient
def my_better_softplus(z):
    exp = tf.exp(z)
    def my_softplus_gradients(grad):
        return grad / (1 + 1/exp)
    return tf.math.log(exp + 1), my_softplus_gradients

In [18]:
x = tf.Variable([1000.])
with tf.GradientTape() as tape:
    z = my_better_softplus(x)
z, tape.gradient(z, [x])

(<tf.Tensor: shape=(1,), dtype=float32, numpy=array([inf], dtype=float32)>,
 [<tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>])

In [19]:
# the main output still explodes, so, return the inputs where
# the outputs are large
def my_better_softplus(z):
    return tf.where(z > 30. , z, tf.math.log(tf.exp(z) + 1))
x = tf.Variable([1000.])
with tf.GradientTape() as tape:
    z = my_better_softplus(x)
z, tape.gradient(z, [x])

(<tf.Tensor: shape=(1,), dtype=float32, numpy=array([1000.], dtype=float32)>,
 [<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>])

### Custom Training Loops

It can be really error-prone, harder to mantain, but it allow specific instructions.

In [None]:
l2_reg = keras.regularizers.l2(0.05)
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="elu", kernel_initializer="he_normal",
                       kernel_initializer="he_normal"),
    keras.layers.Dense(1, kernel_regularizer=l2_reg)
])

In [None]:
def random_batch(X, y, batch_size=32):
    # Randomly sample a batch of instances from the training set
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]


In [None]:
def print_status_bar(iteration, total, loss, metrics=None):
    # display trainig status
    metrics = " - ".join(["{}: {:.4f}".format(m.name, m.result())
    for m in [loss] + (metrics or [])])
    end = "" if iteration < total else "\n"
    print("\r{}/{} - ".format(iteration, total) + metrics,
    end=end)

In [None]:
# Defining hyperparameters
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error
mean_loss = keras.metrics.Mean()
metrics = [keras.metric.MeanAbsoluteError()]

In [None]:
# Custom loop
for epoch in range(1, n_epochs + 1):
    # epochs loop
    print("Epoch {}/{}".format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        # batches loop
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        with tf.GradientTape() as tape:
            # prediction for one batch using the model as a function
            y_pred = model(X_batch, training=True)
            # mean over the batch
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            # model.loses refers to the regularization loss per layer
            loss = tf.add_n([main loss] + model.losses)
        # compute the gradient loss with regard to each trainable variable 
        gradients = tape.gradient(loss, model.trainable_variables)
        # gradient descent stepm update the mean loss and metrics
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)
        print_status_bar(step * batch_size, len(y_train), mean_loss, metrics)
    print_status_bar(len(y_train), len(y_train), mean_loss, metrics)
    for metric in [mean_loss] + metrics:
        # reset the state of the mean loss and the metrics
        metric.reset_states()

- If we want any other transformation to the gradients, do it before calling _apply_gradients_ method.
- If we want to add weight constraints, we should update the training loop to apply these constraints just after _apply_gradients()_

In [None]:
for variable in model.variables:
    if variable.constraint is not None:
        variable.assign(variable.constraint(variable))

- This training loop doesn't handle layers that behave differently during training and testing (BatchNormalization or Dropout). To handle these we need to call the model with _training=True_ and make sure it propagates to every layer that needs it.

### Tensorflow Functions and Graphs

In [3]:
def cube(x):
    return x ** 3
print(cube(2))
print(cube(tf.constant(2.0)))

8
tf.Tensor(8.0, shape=(), dtype=float32)


In [5]:
tf_cube = tf.function(cube)
tf_cube

<tensorflow.python.eager.def_function.Function at 0x7f1104b87460>

In [6]:
print(tf_cube(2))
print(tf_cube(tf.constant(2.0)))

2022-01-11 19:19:45.252741: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-01-11 19:19:45.303716: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2901210000 Hz


tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(8.0, shape=(), dtype=float32)


In [None]:
# tf.function() generated an equivalent computation graph
# It is more common to use it as a decorator
@tf.function
def tf_cube(x):
    return x ** 3

We should transform a python function into a tensorflow function, every time we can, it is way faster, but they have to receive tensor arguments. (tf.constant(x), etc)

__Rules__
- Only use Tensorflow constructs (tf.reduce_sum(), tf.sort(), etc)
- TF will only capture for loops thatiterate over a tensor or dataset, use _for i in tf.range(x).
- Vectorized implementations are better.