# Custom Models and Training with TensorFlow

In [2]:
import tensorflow as tf
import numpy as np

## Using TensorFlow like NumPy

TensorFlow revolves around *tensors*, similar to NumPy `ndarray`: usually a multidimensional array, but it can also hold a scalar (simple value, such as `42`).

### Tensors and Operations

Create tensor with `tf.constant()`

In [2]:
tf.constant([[1.,2.,3.], [4.,5.,6.]]) # matrix

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [4]:
tf.constant(42) # scalar

<tf.Tensor: shape=(), dtype=int32, numpy=42>

like `ndarray`, `tf.Tensor` has a shape and a data type (`dtype`)

In [5]:
t = tf.constant([[1.,2.,3.], [4.,5.,6.]])
t.shape

TensorShape([2, 3])

In [6]:
t.dtype

tf.float32

Indexing works like NumPy

In [7]:
t[:, 1:]

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 3.],
       [5., 6.]], dtype=float32)>

In [8]:
t[..., 1, tf.newaxis]

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[2.],
       [5.]], dtype=float32)>

Tensor operations

In [9]:
t + 10

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[11., 12., 13.],
       [14., 15., 16.]], dtype=float32)>

In [10]:
tf.square(t)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)>

In [11]:
t @ tf.transpose(t)

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[14., 32.],
       [32., 77.]], dtype=float32)>

`@` is equivalent to calling `tf.matmul()` (matrix multiplication)

#### Keras' Low-Level API

Keras API has its own low-level API, `keras.backend`. Includes functions like `square()`, `exp()`, and `sqrt()`. In tf.keras, these functions just call corresponding TensorFlow operations. If you want to write code that will be portable to other Keras implementations, you should use these Keras functions. However, they only cover a subset of all the functions available in TensorFlow

In [12]:
from tensorflow import keras
import keras.backend as K
K.square(K.transpose(t)) + 10

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[11., 26.],
       [14., 35.],
       [19., 46.]], dtype=float32)>

### Tensors and NumPy

You can create a tensor from a NumPy array, and vice versa. You can apply TensorFlow operations to NumPy arrays and NumPy operations to tensors.

In [14]:
a = np.array([2.,4.,5.])
tf.constant(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([2., 4., 5.])>

In [15]:
t.numpy() # or np.array(t)

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

In [16]:
tf.square(a)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([ 4., 16., 25.])>

In [17]:
np.square(t)

array([[ 1.,  4.,  9.],
       [16., 25., 36.]], dtype=float32)

Note NumPy uses 64-bit precision by default, while TensorFlow uses 32-bit. When you create a tensor from a NumPy array, make sure to set it to `dtype=tf.float32`

### Type Conversions

TensorFlow does not perform any type conversions automatically

In [18]:
tf.constant(2.) + tf.constant(40)

InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a int32 tensor [Op:AddV2]

In [19]:
tf.constant(2.) + tf.constant(40., dtype=tf.float64)

InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a double tensor [Op:AddV2]

### Variables

`tf.Tensor` values are immutable. Use tf.Variable for mutable variables

In [21]:
v = tf.Variable([[1.,2.,3.,], [4.,5.,6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

`tf.Variable` behaves same way as `tf.Tensor` (same operations, with NumPy as well, picky with types, etc) but can also be modified in place using the `assign()` method (or `assign_add()` or `assign_sub()` which increment or decrement the variable by the given value). Can also modify cells (or slices) by using the cell's (or slice's) `assign()` method or `scatter_update()` or `scatter_nd_update()` (direct item assignment will not work)

In [22]:
v.assign(2*v)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [23]:
v[0, 1].assign(42)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [25]:
v.scatter_nd_update(indices=[[0,0], [1,2]], updates=[100., 200.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   6.],
       [  8.,  10., 200.]], dtype=float32)>

### Other Data Structures

- Sparse tensors (`tf.SparseTensor`)
- Tensor arrays (`tf.TensorArray`)
- Ragged tensors (`tf.RaggedTensor`)
- String tensors
- Sets
- Queues

## Customizing Models and Training Algorithms

### Custom Loss Functions

Example of creating Huber loss function using TensorFlow operations

In [None]:
def huber_fn(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss = tf.abs(error) - 0.5
    return tf.where(is_small_error, squared_loss, linear_loss)

In [None]:
model.compile(loss=huber_fn, optimizer='nadam')
model.fit(X_train, y_train, [...])

### Saving and Loading Models That Contain Custom Components

When you load a model containing custom objects, you need to map the names to the objects

In [None]:
model = keras.models.load_model('my_model_with_a_custom_loss.h5', custom_objects={'huber_fn': huber_fn})

With the current implementation, any error between -1 and 1 is considered small. What if you want a different threshold?

In [None]:
def create_huber(threshold=1.0):
    def huber_fn(y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < threshold
        squared_loss = tf.square(error) / 2
        linear_loss = threshold * tf.abs(error) - threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    return huber_fn

model.compile(loss=create_huber(2.0), optimizer='nadam')

Threshold will not be saved when saving the model. You will have to specify the value when loading it. 

In [None]:
model = keras.models.load_model('my_model_with_a_custom_loss_threshold_2.h5', custom_objects={'huber_fn': create_huber(2.0)})

Solve this by creating a subclass of `keras.losses.Loss` class and then implementing its `get_config()` method

In [None]:
class HuberLoss(keras.losses.Loss):
    def __init__(self, threshold=1.0, **kwargs):
        self.threshold = threshold
        super().__init__(**kwargs)
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < self.threshold
        squared_loss = tf.square(error) / 2
        linear_loss = self.threshold * tf.abs(error) - self.threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'threshold': self.threshold}


In [None]:
model.compile(loss=HuberLoss(2.), optimizer='nadam')

Saving the model will save the threshold

In [None]:
model = keras.models.load_model('my_model_with_a_custom_loss_class.5', custom_objects={'HuberLoss': HuberLoss})

When you save the model, Keras calls the loss instance's `get_config()` method and saves the config as a JSON in the HDF5 file. When you load it, it calls the `from_config()` class method on the `HuberLoss` class

### Custom Activation Functions, Initializers, Regularizers, and Constraints

Most Keras functionalities, such as losses, regularizers, constraints, initializers, metrics, activation functions, layers, and models, can be customized in a similar way. 

In [None]:
def my_softplus(z): # note: tf.nn.softplus(z) better hands large inputs
    return tf.math.log(tf.exp(z) + 1.0)

def my_glorot_initializer(shape, dtype=tf.float32):
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)

def my_l1_regularizer(weights):
    return tf.reduce_sum(tf.abs(0.01 * weights))

def my_positive_weights(weights): # return value is just tf.nn.relu(weights)
    return tf.where(weights < 0., tf.zeros_like(weights), weights)

In [None]:
layer = keras.layers.Dense(30, activation=my_softplus, 
                            kernel_initializer=my_glorot_initializer,
                            kernel_regularizer=my_l1_regularizer,
                            kernel_constraint=my_positive_weights)

To sabe the hyperparameters along with the class, use a similar method as mentioned previously: with `keras.regularizers.Regularizer`, `keras.constraints.Constraints`, `keras.initializers.Initializer`, `keras.layers.Layer`

In [None]:
class MyL1Regularizer(keras.regularizers.Regularizer):
    def __init__(self, factor):
        self.factor = factor
    def __call__(self, weights):
        return tf.reduce_sum(tf.abs(self.factor * weights))
    def get_config(self):
        return {'factor': self.factor}

Note: implement `call()` for losses, layers (including activaiton functions), and models, or `__call__()` for regularizers, initializers, and constraints

### Custom Metrics

*Streaming metric* (or *stateful metric*) is one that is gradually updated batch after batch

To create one, create a subclass of the `keras.metrics.Metric` class

In [None]:
class HuberMetric(keras.metrics.Metric):
    def __init__(self, threshold=1.0, **kwargs):
        super().__init__(**kwargs) # handles base args (e.g., dtype)
        self.threshold = threshold
        self.huber_fn = create_huber(threshold)
        self.total = self.add_weight('total', initializer='zeros')
        self.count = self.add_weight('count', initializer='zeros')
    def update_state(self, y_true, y_pred, sample_weight=None):
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assing_add(tf.cast(tf.size(y_true), tf.float32))
    def result(self):
        return self.total / self.count
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'threshold': self.threshold}

Some metrics, like precision, cannot simply be averaged over batches: in those cases, there's no other option than to implement a streaming metric

### Custom Layers

For a custom layer without any weights: write a function and wrap it in a `keras.layers.Lambda` layer

In [None]:
exponential_layer = keras.layers.Lambda(lambda x: tf.exp(x))

Build a custom stateful layer, create a subclass of `keras.layers.Layer` class

In [None]:
class MyDense(keras.layers.Layer):
    def __init__(self, units, activation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)
    
    def build(self, batch_input_shape):
        self.kernel = self.add_weight(
            name='kernel', shape=[batch_input_shape[-1], self.units],
            initializer='glorot_normal'
        )
        self.bias = self.add_weight(
            name='bias', shape=[self.units], initializer='zeros'
        )
        super().build(batch_input_shape) # must be at the end
    
    def call(self, X):
        return self.activation(X @ self.kernel + self.bias)
    
    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.units])
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, 'units': self.units, 'activation': keras.activations.serialize(self.activation)}

To create a layer with multiple inputs or outputs (e.g., two inputs and three outputs) make sure the number of inputs and outputs are correct for the following methods:

In [None]:
class MyMultiLayer(keras.layers.Layer):
    def call(self, X):
        X1, X2 = X
        return [X1+X2, X1*X2, X1/X2]
    
    def compute_output_shape(self, batch_input_shape):
        b1, b2 = batch_input_shape
        return [b1, b1, b1] # should probably handle broadcasting rules

Can be used with Functional and Subclassing APIs, (as Sequential does not accept layers with multiple inputs/outputs)

If layer needs to have different behavior during training and testing (e.g., `Dropout`, `BatchNormalization`) add `training` argument to `call()`. Example creates a layer that adds Gaussian noise during training (for regularization) but does not during testing (Keras has a layer that does this: `keras.layers.GaussianNoise`)

In [None]:
class MyGaussianNoise(keras.layers.Layer):
    def __init__(self, stddev, **kwargs):
        super().__init__(**kwargs)
        self.stddev = stddev
    
    def call(self, X, training=None):
        if training:
            noise = tf.random.normal(tf.shape(X), stddev=self.stddev)
            return X + noise
        else: return X
    
    def compute_output_shape(self, batch_input_shape):
        return batch_input_shape

### Custom Models

Custom model using `ResidualBlock` layers (adds inputs to its outputs)

In [None]:
class ResidualBlock(keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(n_neurons, activation='elu', kernel_initializer='he_normal') for _ in range(n_layers)]
    
    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        return inputs + Z

In [None]:
class ResidualRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = keras.layers.Dense(30, activation='elu', kernel_initializer='he_normal')
        self.block1 = ResidualBlock(2, 30)
        self.block2 = ResidualBlock(2, 30)
        self.out = keras.layers.Dense(output_dim)
    
    def call(self, inputs):
        Z = self.hidden1(inputs)
        for _ in range(1+3):
            Z = self.block1(Z)
        Z = self.block2(Z)
        return self.out(Z)

### Losses and Metrics Based on Model Internals

Build a custom regression MLP model composed of a stack of 5 hidden layers, plus an output layer, an auxiliary ouput on top of the upper hidden layer (its loss will be called the *reconstruction loss*: the mean squared difference between the reconstruction and the inputs). By adding reconstruction loss to the main loss, we encourage the model to preserve as much information as possible through the hidden layers - even information that is not directly useful for the regression task. This loss sometimes imporves generalization:

In [None]:
class ReconstructingRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(30, activation='selu', kernel_initializer='lecun_normal') for _ in range(5)]
        self.out = keras.layers.Dense(output_dim)
    
    def build(self, batch_input_shape):
        n_inputs = batch_input_shape[-1]
        self.reconstruct = keras.layers.Dense(n_inputs)
        super().build(batch_input_shape)
    
    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        reconstruction = self.reconstruct(Z)
        recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs))
        self.add_loss(0.05 * recon_loss)
        return self.out(Z)

### Computing Gradients Using Autodiff

Consider simple toy function:

In [1]:
def f(w1, w2):
    return 3 * w1**2 + 2 * w1 * w2

To find gradient of this function (partial derivatives for all variables)

In [4]:
w1, w2 = 5, 3
eps = 1e-6

In [5]:
(f(w1 + eps, w2) - f(w1, w2)) / eps

36.000003007075065

In [6]:
(f(w1, w2 + eps) - f(w1, w2)) / eps

10.000000003174137

TensorFlow makes this simple

In [8]:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
    z = f(w1, w2)
gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

Tape is automatically erased immediately after you call its `gradient()` method, so you can't call it twice

In [9]:
with tf.GradientTape() as tape:
    z = f(w1, w2)
dz_dw1 = tape.gradient(z, w1) # => tensor 36.0
dz_dw2 = tape.gradient(z, w2) # RuntimeError!

RuntimeError: A non-persistent GradientTape can only be used to compute one set of gradients (or jacobians)

Make tape persistent and delete it each time you are done with it to free resources:

In [10]:
with tf.GradientTape(persistent=True) as tape:
    z = f(w1, w2)

dz_dw1 = tape.gradient(z, w1) # => tensor 36.0
dz_dw2 = tape.gradient(z, w2) # => tensor 10.0, works fine now!
del tape

By default tape only tracks operations involving variables. Anything else will result in `None`

In [11]:
c1, c2 = tf.constant(5.), tf.constant(3.)
with tf.GradientTape() as tape:
    z = f(c1, c2)

gradients = tape.gradient(z, [c1, c2])
gradients

[None, None]

Force tapes to watch any tensors you like, to record every operation that involves them. You can compute gradients with regard to these tensors as if they were variables:

In [13]:
with tf.GradientTape() as tape:
    tape.watch(c1)
    tape.watch(c2)
    z = f(c1, c2)

gradients = tape.gradient(z, [c1, c2]) # returns [tensor 36., tensor 10.]
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

To stop gradients from backpropagating through some part of your neural network:

In [14]:
def f(w1, w2):
    return 3 * w1**2 + tf.stop_gradient(2 * w1 * w2)

with tf.GradientTape() as tape:
    z = f(w1, w2) # same result as without stop_gradient()

gradients = tape.gradient(z, [w1, w2]) 
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=30.0>, None]

May occasionally run into some numerical issues when computing gradients. For example, if you compute the gradients of the `my_softplus() function for large inputs, the result will be NaN. This is because computing the gradients of this function using autodiff leads to some numerical difficulties: due to floating-point precision errors, autodiff ends up computing infinity divided by infinity (which returns NaN). 

Fortunately we can find the derivative of softplus as $ 1 / (1+1/e^{x})$, which is numerically stable. Next, tell TensorFlow to use this stable function by decorating it with `@tf.custom_gradient` and making it return both its normal output and the function that computes the derivatives

In [None]:
@tf.custom_gradient
def my_better_softplus(z):
    exp = tf.exp(z)
    def my_softplus_gradients(grad):
        return grad / (1 + 1 / exp)
    return tf.math.log(exp + 1), my_softplus_gradients

### Custom Training Loops

In some rare cases `fit()` method may not be flexible enough. May also write custom training loops to feel more confident that they do precisely what is intended. Can feel safer to make everything explicit. However, writing a custom training loop will make code longer, more error-prone, and harder to maintain.

First build simple model:

In [None]:
l2_reg = keras.regularizers.l2(0.05)
model = keras.models.Sequential([
    keras.layers.Dense(30, activation='elu', kernel_initializer='he_normal', kernel_regularizer=l2_reg),
    keras.layers.Dense(1, kernel_regularizer=l2_reg)
])

Next create tiny function that will randomly sample a batch of instances from training set:

In [None]:
def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

Also define a function that will display the training status, including number of steps, total number of steps, mean loss since the start epoch, and other metrics:

In [None]:
def print_status_bar(iteration, total, loss, metrics=None):
    metrics = " - ". join(['{}: {:.4f'.format(m.name(), m.result())
                    for m in [loss] + (metrics or [])])
    end = "" if iteration < total else "\n"
    print("\r{}/{} - ".format(iteration, total) + metrics, end=end)

First define some hyperparameters and choose the optimizer, loss function, and metrics (MAE)

In [None]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error
mean_loss = keras.metrics.Mean()
metrics = [keras.metrics.MeanAbsoluteError()]

Custom loop

In [None]:
for epoch in range(1, n_epochs + 1):
    print("Epoch {}/{}".format(epoch, n_epochs))
    for step in range(1, n_steps + 1):
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch, training=True)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)
        print_status_bar(step * batch_size, len(y_train), mean_loss, metrics)
    print_status_bar(len(y_train), len(y_train, mean_loss, metrics))
    for metric in [mean_loss] + metrics:
        metric.reset_states()

If you add weight constraints to model, update the trianing loop to apply the constraints just after `apply_gradients()`

In [None]:
for variable in model.variables:
    if variable.constraint is not None:
        variable.assign(variable.constraint(variable))

Most importantly, this training loop does not handle layers that behave differently during training and testing (e.g., `BatchNoramlization` or `Dropout`). To handle these, call the model with `training=True` and make sure it propagates this to every layer that needs it. 

Easy to make a mistake, but you get full control

## TensorFlow Functions and Graphs

In [16]:
def cube(x): return x**3

use `tf.function()` to convert this Python function to a *TensorFlow Function*

In [17]:
tf_cube = tf.function(cube)
tf_cube

<tensorflow.python.eager.def_function.Function at 0x7f7b0c093668>

Can be used exactly like original Python function, and it returns the same result (but as tensors)

In [18]:
tf_cube(2)

<tf.Tensor: shape=(), dtype=int32, numpy=8>

In [19]:
tf_cube(tf.constant(2.))

<tf.Tensor: shape=(), dtype=float32, numpy=8.0>

Alternatively use a decorator (more common):

In [20]:
@tf.function
def tf_cube(x): return x**3

To call original Python function:

In [21]:
tf_cube.python_function(2)

8

TensorFlow optimizes computation graph, making TF Functions run much faster than original Python functions, especially for complex computations

Writing a custom loss function, custom metric, custom layer, or any other custom function and using it in a Keras model, Keras automatically converts the function into a TF Function

### AutoGraph and Tracing

After analyzing the function's code, AutoGraph outputs an updgraded version of that function in which all the control flow statements are replaced by the appropriate TensorFlow operations. Next, Tensorflow calls this "upgraded" function, but instead of passing the argument, it passes a *symbolic tensor* - a tensor without any actual value, only a name, a data type, and a shape. The function will run in *graph mode*, meaning that each TF operation will add a node in the graph to represent itself and its output tensor(s) (as opposed to the regular mode, called *eager execution*, or *eager mode*)

## Exercises

1. **How would you describe TensorFlow in a short sentence? What are its main features? Can you name other Deep Learning libraries?**
<br>
A powerful library for numerical computation, particularly well suited and fine-tuned for large-scale Machine learning (but it can be used for anything else that requires heavy computations). Its main features are:
- NumPy features, but with GPU support
- Supports distributed computing
- a kind of JIT compiler that allows it to optimize computations for speed and memory usage
- Computation graphs can be exported to portable format
- implements autodiff, and provides some excellent optimizers, such as RMSProp and Nadam
- tf.keras, data loading and preprocessing ops, image processing ops, signal processing ops, and more
<br><br>
Other Deep Learning libraries include PyTorch, MXNetm, Msft Cognitive Toolkit, Caffe2, Chainer, and Theano

2. **Is TensorFlow a drop-in replacement for NumPy? What are the main differences between the two?**
<br>
It is not a drop-in replacement. TensorFlow is a much more robust library than NumPy. The main differences between the two are:
- TenorFlow revolves around tensors which flow from operation to operation
- TF's multidimensional array-like structure can hold a scalar as well
- the function names used on tensors from TF are different from the ones in NumPy but NumPy operations can be used on them as well 
- some of these functions don't behave the same way
- NumPy arrays are mutable, tensors are not
- NumPy is based on 64-bit precision while TensorFlow is 32-bit
- TensorFlow includes several other data structures as well

3. **Do you get the same result with `tf.range(10)` and `tf.constant(np.arange(10))`?**

In [22]:
tf.range(10)

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)>

In [23]:
tf.constant(np.arange(10))

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

No, the data type is different. NumPy arrays are 64-bit by default and TF are 32-bit. So when casting a NumPy array to a tensor, it will retain the precision. 

4. **Can you name six other data structures available in TensorFlow, beyond regular tensors?**

- Sparse tensors
- Tensor arrays
- Ragged tensors
- String tensors
- Sets
- Queues

5. **A custom loss function can be defined by writing a function or by subclassing the `keras.losses.Loss` class. When would you use each option?**

If you want to save a model and save the values used in their custom components, subclassing the `keras.losses.Loss` class allows you to do that.

6. **Similarly, a custom metric can be defined in a function or a subclass of `keras.metrics.Metric`. When would you use each option?**

Again, the same reason applies.

7. **When should you create a custom layer versus a custom model?**

Distinguish the internal components of your model (i.e., layers or reusable blocks of layers) from the model itself (i.e., the object you will train). The former should subclass from `keras.layers.Layer` while the latter should subclass `keras.models.Model`

8. **What are some use cases that require writing your own training loop?**

It is quite difficult to do, so only when you really need to. An example of when you should do it would be when you want to use two different optimizers for different parts of your neural network (like in the Wide & Deep paper)

9. **Can custom Keras components contain arbitrary Python code, or must they be convertible to TF Functions?**

They should be convertible which means they must respect all the TF Function rules. If they can't but you want to create a TF Function anyway (with arbitrary Python code), then wrap it in a `tf.py_function()` operation (note this will hinder performance).

10. **What are the main rules to respect if you want a function to be convertible to a TF Function?**

- if you call any external library, this call will run only during tracing; it will not be part of the graph
- you can call other python functions or TF Functions, but they should follolw the same rules
- if the function creates a TF variable, it must do so upon the very first call and only then
- the source code of your Python function should be available to TF
- TF will only capture for loops that iterate over a tensor or a dataset
- prefer vectorized implementations whenever you can over using loops

11. **When would you need to create a dynamic Keras model? How do you do that? Why not make all your models dynamic?**

It is useful for debugging, as it will not compile any custom component to a TF Function, and you can use any Python debugger to debug your code. It can also be useful if you want to include arbitrary Python code in your model (or training code), including calls to external libraries.

12. **Implement a custom layer that performs *Layer Normalization* (we will use this type of layer in Chapter 15):**

In [2]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

a. The `build()` method should define two variable weights **alpha** and **beta**, both of shape `input_shape[-1:]` and data type `tf.float32`. **alpha** should be initialized with 1s, and **beta** with 0s.

b. The `call()` method should compute the mean $\mu$ and standard deviation $\sigma$ of each instance's features. For this, you can use `tf.nn.moments(inputs, axes=-1, keepdims=True)`, which returns the mean $\mu$ and the variance $\sigma^{2}$ of all instances (compute the square root of the variance to get the standard deviation). Then the function should compute and return **alpha** * (**X** - $\mu$)/($\sigma$ + $\epsilon$) + **beta**, where (*) represents itemwise multiplication and $\epsilon$ is a smoothing term (where $\epsilon$ is a constant to avoid division by zero, e.g., 1e-3).

In [29]:
class MyLayerNoramlization(keras.layers.Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    
    def build(self, input_shape): 
        self.alpha = tf.ones(shape=input_shape[-1:])
        self.beta = tf.zeros(shape=input_shape[-1:])
        super().build(input_shape)

    def call(self, X):
        epsilon = 1e-3
        mu, sigma_squared = tf.nn.moments(X, axes=-1, keepdims=True)
        sigma = tf.math.sqrt(sigma_squared)
        return self.alpha * (X-mu) / (sigma+epsilon) + self.beta

c. Ensure that your custom layer produces the same (or very nearly the same) output as the `keras.layers.LayerNormalization` layer.

In [30]:
X = np.array([[1., 2., 3.], [4., 5., 6.]])

In [34]:
m1 = keras.models.Sequential()
m1.add(MyLayerNoramlization())
m1.predict(X)



array([[-1.2232468,  0.       ,  1.2232468],
       [-1.2232468,  0.       ,  1.2232468]], dtype=float32)

In [33]:
m2 = keras.models.Sequential()
m2.add(keras.layers.LayerNormalization())
m2.predict(X)

array([[-1.2238274,  0.       ,  1.2238274],
       [-1.2238274,  0.       ,  1.2238274]], dtype=float32)

13. **Train a model using a custom training loop to tackle the Fashion MNIST dataset (see chapter 10).**

a. Display the epoch, iteration, mean training loss, and mean accuracy over each epoch (updated at each iteration), as well as the validation loss and accuracy at the end of each epoch.

b. Try using a different optimizer with different learning rate for the upper layers and the lower layers.
<br>
*TODO*: could never figure this out

In [114]:
from sklearn.model_selection import train_test_split

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test_full = X_test / 255.0
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size=0.1, random_state=42)

In [115]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(300, activation='relu'),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

In [116]:
def print_status_bar(iteration, total, metrics=None, validation_metrics=None):
    print('\tBase metrics:')
    print('\n'.join([f'\t{m.name}: {m.result():.4f}' for m in metrics]), end='\n\n')
    print('\tValidation metrics:')
    print('\n'.join([f'\t{m.name}: {m.result():.4f}' for m in validation_metrics]), end='\n\n')

In [117]:
def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

In [118]:
n_epochs = 2
batch_size = 32
n_steps = len(X_train) // batch_size
n_steps_val = len(X_valid) // batch_size
optimizer1 = keras.optimizers.SGD(learning_rate=1e-3, momentum=0.9)
optimizer2 = keras.optimizers.Nadam(learning_rate=3e-3)
loss_fn = keras.losses.SparseCategoricalCrossentropy()
mean_loss = keras.metrics.Mean()
val_loss = keras.metrics.Mean()
mean_acc = keras.metrics.SparseCategoricalAccuracy()
val_acc = keras.metrics.SparseCategoricalAccuracy()

In [119]:
from tqdm import tqdm

for epoch in range(1, n_epochs+1):
    print(f'Epoch {epoch}/{n_epochs}')

    # Iterate over batches of the dataset
    for step in tqdm(range(1, n_steps + 1)):
        X_batch, y_batch = random_batch(X_train, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch, training=True)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(main_loss, model.trainable_variables)
        optimizer1.apply_gradients(zip(gradients, model.trainable_variables))

        # Update mean loss
        mean_loss(loss)
        # Update mean accuracy metric
        mean_acc.update_state(y_batch, y_pred)

    # Run a validation loop at the end of each epoch
    for val_step in range(1, n_steps_val + 1):
        X_batch_val, y_batch_val = random_batch(X_valid, y_valid)
        # Find loss
        with tf.GradientTape() as tape:
            val_pred = model(X_batch_val, training=False)
            val_main_loss = tf.reduce_mean(loss_fn(y_batch_val, val_pred))
            v_loss = tf.add_n([val_main_loss] + model.losses)
        # Update val metrics
        val_loss(v_loss)
        val_acc.update_state(y_batch_val, val_pred)

    # Display metrics at the end of each epoch
    print_status_bar(step*batch_size, len(y_train), metrics=[mean_loss, mean_acc], validation_metrics=[val_loss, val_acc])
        
    # Reset metrics
    for metric in [mean_loss, mean_acc, val_loss, val_acc]:
        metric.reset_state()

Epoch 1/2


100%|██████████| 1687/1687 [00:19<00:00, 87.29it/s]


	Base metrics:
	mean: 0.7467
	sparse_categorical_accuracy: 0.7593

	Validation metrics:
	mean: 0.5237
	sparse_categorical_accuracy: 0.8173

Epoch 2/2


100%|██████████| 1687/1687 [00:19<00:00, 87.66it/s]


	Base metrics:
	mean: 0.4841
	sparse_categorical_accuracy: 0.8302

	Validation metrics:
	mean: 0.4638
	sparse_categorical_accuracy: 0.8326

