# Chapter 12: Custom Models and Training with TensorFlow
This work is partialy combined text and code from the book [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) is only supposed to be used as reference and is recommended to follow along with a copy of the Book puchased.

# Using TensorFlow like Numpy
Tensorlow's API revolves around ***tensors*** which usually is <mark>a multidimensional array, but it can also hold a scalar</mark>.

These tensors flow from operations (or op for short) hence, TensorFlow.


## Tensors and Operations
**`tf.constant`** : Create a tensor.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

In [None]:
tf.constant([[1, 2, 3], 
             [4, 5, 6]])

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)>

In [None]:
tf.constant(45)

<tf.Tensor: shape=(), dtype=int32, numpy=45>

In [None]:
tensor = tf.constant([[1, 2, 3], 
                      [4, 5, 6]])
tensor.shape

TensorShape([2, 3])

In [None]:
tensor.dtype

tf.int32

In [None]:
tensor[:, 1]

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([2, 5], dtype=int32)>

In [None]:
tensor[..., 1, tf.newaxis]

<tf.Tensor: shape=(2, 1), dtype=int32, numpy=
array([[2],
       [5]], dtype=int32)>

In [None]:
import numpy as np

In [None]:
np.array([[1, 2, 3], 
          [4, 5, 6]])[..., 1, np.newaxis]

array([[2],
       [5]])

In [None]:
tensor + 10

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[11, 12, 13],
       [14, 15, 16]], dtype=int32)>

In [None]:
tf.square(tensor)

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[ 1,  4,  9],
       [16, 25, 36]], dtype=int32)>

In [None]:
tensor @ tf.transpose(tensor)

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[14, 32],
       [32, 77]], dtype=int32)>

## Tensors and NumPy

In [None]:
a = np.array([2, 4, 5])
tf.constant(a)

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([2, 4, 5])>

In [None]:
tensor.numpy()

array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)

In [None]:
tf.square(tensor)

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[ 1,  4,  9],
       [16, 25, 36]], dtype=int32)>

>🟠 When you create a tensor from a NumPy array, make sure to set `dtype=tf.float32`. As by default TensorFlow uses 32-bit precision whereas numpy using 64-bit one, as this takes less mem, faster to compute and more than enough for NN.

## Type Conversion
**TensorFlow does not perform any type conversions automatically**: it just raises an exception if you try to execute an opertion on tensors with incompatible types.

In [None]:
tf.constant(2.) + tf.constant(40)

InvalidArgumentError: ignored

In [None]:
tf.constant(2.) + tf.constant(40., dtype=tf.float64)

InvalidArgumentError: ignored

Use `tf.cast` when you really need to convert types:

In [None]:
t2 = tf.constant(40., dtype=tf.float64)
tf.constant(2.0) + tf.cast(t2, tf.float32)

<tf.Tensor: shape=(), dtype=float32, numpy=42.0>

## Variables
**`tf.Varaible`**

The `tf.Tensor` are immutable. For neural networks' layer we can't use them as we require to update the parameters. Instead we can use `tf.Variable`.

In [None]:
v = tf.Variable([[1., 2., 3.], 
                 [4., 5., 6.]])
v

<tf.Variable 'Variable:0' shape=(2, 3) dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

**`tf.assign()`** method can modify in place. or `tf.assign_add()` or `tf.assign_sub()`, which increment or decrement the variable by the given value. 

In [None]:
v.assign(2*v)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [None]:
v[0, 1].assign(42)

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  6.],
       [ 8., 10., 12.]], dtype=float32)>

In [None]:
v[:, 2].assign([0., 1.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[ 2., 42.,  0.],
       [ 8., 10.,  1.]], dtype=float32)>

**`scatter_nd_update`** to update multiple values at multiple indices.

In [None]:
v.scatter_nd_update(indices=[[0, 0], [1, 2]], updates=[100., 200.])

<tf.Variable 'UnreadVariable' shape=(2, 3) dtype=float32, numpy=
array([[100.,  42.,   0.],
       [  8.,  10., 200.]], dtype=float32)>

>🔵 In practice you will rarely have to create variables manually, since Keras provides an `add_weight()` method that will take care of it for you.

## Other Data Structures
- **Sparse Tensors** (`tf.SparseTensor`)

  Efficienlty <mark>represent tensors containing mostly zeros.</mark>

- **Tensor Arrays** (`tf.TensorArray`)

  <mark>List of tensors.</mark> Fixed size by default but can optionally be made dynamic. All tensors inside them must have same dim and data type.

- **Ragged Tensors** (`tf.RaggedTensor`)

  <mark>Represent static lists of lists of tensors,</mark> where every tensor has the same shape and data type.

- **String tensors**

  <mark>Regular tensors of type `tf.string`, represnting byte string</mark>, NOT Unicode. `tf.strings` package contains ops for both Unicode and byte string, and also convertion b/w them. To represent Unicode string, we have to use tensors of type `tf.int32` where each item represents a Unicode code point.

  Note: `tf.string` is atomic, i.e. its length does not appear in the tensor'shape, whereas Unicode tensor's length appear.

- **Sets**

  Are represented as regular tensors (or sparse tensors).

- **Queues**

  <mark>Store tensors across multiple steps.</mark> All queues are availaible in `tf.queue` package.
  

# Customizing Models and Training Algorithms

## Custom Loss function
Let's create a Huber loss, (Which is actually already available in `tf.keras.losses.Huber`) but let's pretend it's not.


In [None]:
def huber_fun(y_true, y_pred):
  """
  If the error is less than absolute 1:
  Replace with squared error
  Otherwise, replace with linear loss
  """
  error = y_true - y_pred
  is_small_error = tf.abs(error) < 1
  squared_loss = tf.square(error) / 2
  linear_loss = tf.abs(error) - 0.5
  return tf.where(is_small_error, squared_loss, linear_loss)

It is also preferrable to return a tensor containing one loss per instance, rather than returning the mean loss. This way, Keras can apply class weights or sample weights when requested.

In [None]:
model.compile(loss=huber_fn, optimizer='nadam')
model.fit(X_train, y_train, [...])

But, question might arise..., What happens to this custom loss when you save the model?

## Saving and Loading Models That Contain Custom Components
Keras saves the name of the function. Whenever you load it, i.e., <mark>when you load a model containing custom objects, you need to map the names to the objects:</mark>

In [None]:
model = keras.models.load_model("my_model_with_a_custom_loss.h5", 
                                custom_objects={"huber_fn": huber_fn})

What if you wanted a different threshold, let's create a configured loss function:

In [None]:
def create_huber(threshold=1.0):
  def huber_fun(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < threshold
    squared_loss = tf.square(error) / 2
    linear_loss = threshold * tf.abs(error) - threshold**2 / 2
    return tf.where(is_small_error, squared_loss, linear_loss)
  return huber_fn

model.compile(loss=create_huber(2.0), optimizer='nadam')

Unfortunately, **when you sae the model, the `threshold` will not be saved.**

Which just means you need to provide the threshold value, when loading the model.

In [None]:
model = keras.models.load_model("my_model_with_a_custom_loss_threshold_2.h5",
                                custom_objects={"huber_fun": create_huber(2.0)})

We are using `create_huber()` to create the huber_fn.

You can solve this by creating a subclass of the `keras.losses.loss` class, and then implementing its `get_config()` method:

In [None]:
class HuberClass(keras.losses.Loss):
  def __init__(self, threshold=1.0, **kwargs):
    self.threshold = threshold
    super().__init__(**kwargs)
  def call(self, y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < threshold
    squared_loss = tf.square(error) /
    linear_loss = self.threshold * tf.abs(error) * threshold**2 /2
    return tf.where(is_small_error, squared_loss, linear_loss)
  def get_config(self):
    base_config = super().get_config()
    return (**base_config, "threshold": self, threshold)

>🟠 The keras API currently only specifies how to use subclassing to define layers, models, callbacks, and regularizers. If we build other components (such as losses, metrics, initializers, or contraints) using subclassing, **they may not be portable**.

The code:
- The constructor accepts `**kwargs` and passes them to the parent constructor which handles standard hyperparameter: the `name` of the loss and the `reduction` algorithm which defaults to "`sum_over_batch_size`".
- The `call()` method computes all the instance losses, and returns them.
- The `get_config()` method return a dictionary mapping each hyperparmeter name to its value.

You can then use it Like this:

In [None]:
model.compile(loss=HuberLoss(2.), optimizer="nadam")

<mark>When you save the model, the threshold will be saved along with it; and when you load the model, you just need to mapt the class name to the class itself:</mark>

In [None]:
model = keras.model.load_model("my_model_with_a_custom_loss_class.h5", 
                               custom_objects={"HuberLoss": HuberLoss})

When the model is saved, keras calls the losss instance's `get_config()` saves in HDF5 file. When loading, calls the `from_config()` method on `HuberClass` implemented in base class `Loss` and creates an instance of HuberLoss by passing `**config` to the constructor.


## Custom Activation Functions, Initializers, Regularizers, and Constraints
Here are some examples:

1. **Custom Activation function**

  Equivalent to `keras.activations.softplus()` or `tf.nn.softplus`


In [None]:
def my_softplus(z):
  return tf.math.log(tf.exp(z) + 1.0) 

2. **Custom Glorot Initialization**

  Equivalent to `keras.initializers.glorot_normal()` 

In [None]:
def my_glorot_initializer(shape, dtype=tf.float32):
  stddev = tf.sqrt(2. / shape[0] + shape[1])
  return tf.random.normal(shape, stddev=stddev, dtype=dtype)

3. **Custom $\ell_1$ regularizer**

  Equivalent to `keras.regularizers.l1(0.01)`

In [None]:
def my_l1_regularizer(weights):
  return tf.reduce_sum(tf.abs(0.01 * weights))

4. **Custom Constraint that ensures weights are all positive**

  Equivalent to `keras.constraints.nonreg()` or `tf.nn.relu()`

In [None]:
def my_positive_weights(weights):
  return tf.where(weights < 0, tf.zeros_like(weights), weights)

Using these functions:

In [None]:
layer = keras.layers.Dense(30, 
                           activation=my_softplus,
                           kernel_initializer=my_glorot_initializer,
                           kernel_regularizer=my_l1_regularizer,
                           kernel_contraint=my_positive_weights)

If a function has hyperparameters that need to be saved along with the model, then you will want to subclass the appropriate class.


In [None]:
class MyL1Regularizer(keras.regularizers.Regularizer):
  def __init__(self, factor):
    self.factor = factor
  def __call__(self, weights):
    return tf.reduce_sum(tf.abs(factor * weights))
  def get_config(self):
    return {"factor": self.factor}

<mark>You must implement the `call()` method for losses, layers (including activation functions), and the models, </mark>

or the <mark>`___call__()` method for the regularizers, initializers and costraints.</mark>

## Custom Metrics
In a lot of cases, the custom metric function is exactly the same as defining a custom loss function. 

Here, we used the Previously defined Huber Loss as metric.

In [None]:
history = model.compile(loss="mse",
                        optimizer="nadam",
                        metrics=[create_huber(2.0)])

In [None]:
precision = keras.metrics.Precision()
precision([0, 1, 1, 1, 0, 1, 0, 1], [1, 1, 0, 1, 0, 1, 0, 1])

<tf.Tensor: shape=(), dtype=float32, numpy=0.8>

In [None]:
precision([0, 1, 0, 0, 1, 0, 1, 1], [1, 0, 1, 1, 0, 0, 0, 0])

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

**Streaming Metric** or **Stateful metric**

After the first batch, it returns a precision of 80%; then after the second batch, it returns 50% (<mark>which is the overall precision so far, not the second batch's precision</mark>), so *streaming metric* is gradually updated, batch after batch.

Calling the `result()` method will get the current value of the metric.

`variables` attribute let's us grab a view of the variables.

`reset_states()` to reset these variables.

In [None]:
precision.result()

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

In [None]:
precision.variables

[<tf.Variable 'true_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>,
 <tf.Variable 'false_positives:0' shape=(1,) dtype=float32, numpy=array([4.], dtype=float32)>]

To create such a streaming metric, create a subclass of the `keras.metrics.Metric` class.

In [None]:
class HuberMetric(keras.metrics.Metric):
  def __init__(self, threshold=1.0, **kwargs):
    super().__init__(**kwargs) # handles base args (e.g., dtype)
    self.threshold = threshold
    self.huber_fn = create_huber(threshold)
    self.total = self.add_weight("total", initializer="zeros")
    self.count = self.add_weight("count", initializer="zeros")
  def update_state(self, y_true, y_pred, sample_weight=None):
    metric = self.huber_fn(y_true, y_pred)
    self.total.assign_add(tf.reduce_sum(metric))
    self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))
  def result(self):
    return self.total / self.count
  def get_config(self):
    base_config = super().get_config()
    return {**base_config, "threshold": self.threshold}

When you define a metric using a simple function, Keras automatically calls its for each batch, and it keeps track of the mean during each epoch. 

So the added benefit is just that now the config is saved.

Otherwise, some metrics like Precisions cant be averaged over batches, in those cases, these's no other option to implement a streaming metric.

## Custom Layers
Maybe you wanted to use repetitive layers, or some exotic layer TensorFlow doesn't provide default implementation.

**If you want to create a custom layer without any weights**, the simplest option is to write a function and wrap it in a `keras.layers.Lambda` layer

In [None]:
exponential = keras.layers.Lambda(lambda x: tf.exp(x))

You can also use it as an activation function, (or you could write:
- `activation=tf.exp`
- `activation=keras.activation.exponential`
- `activation="exponential")

**If you want to build a custom stateful layer (i.e., a layer with weight)**

In [None]:
#  A simpliified version of Dense layer

class MyDense(keras.layers.Layer):
  def __init__(self, units, activation, **kwargs):
    super().__init__(**kwargs)
    self.units = units
    self.activation = keras.activations.get(activation)

  def build(self, batch_input_shape):
    self.kernel = self.add_weight(
        name="kernel", 
        shape=[batch_input_shape[-1], self.units],
        initializer="glorot_normal"
    )
    print("batch_input_shape", batch_input_shape)
    self.bias = self.add_weight(
        name="bias",
        shape=[self.units],
        initializer="zeros"
    )
    super().build(batch_input_shape)              # must be at end. Sets: self.built=True

  def call(self, X):
    return self.activation(X @ self.kernel + self.bias)
  
  def compute_output_shape(self, batch_input_shape):
    return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.units])

  def get_config(self):
    base_config = super().get_config()
    return {**base_congfig,
            "units": self.units,
            "activation": self.activations.serialize(self.activation)}

In [None]:
model = keras.models.Sequential([
    keras.layers.Dense(200, activation="relu", input_shape=(20, 20)),
    keras.layers.Dense(100, activation="relu"),
    MyDense(300, activation="relu")
])

batch_input_shape (None, 20, 100)


To create a layer with **Multiple inputs**:
- The argument to the `call()` method should be a tuple containing all the inputs.
- The argument to the `compute_output_shape()` method should be a tuple containing each input's batch shape

To create a layer with **Multiple outputs**:
- The `call()` method should return the list of outputs.
- `compute_output_shape()` should return the list of batch output shapes (one per output).

For eg. this toy example:

In [None]:
class MyMultiLayer(keras.layers.Layer):
  def call(self, X):
    X1, X2 = X
    return [X1+X2, X1*X2, X1/X2]
  
  def compute_output_shape(self, batch_input_shape):
    b1, b2 = output_input_shape
    return [b1, b1, b1] # should probably handle broadcasting

Let's create a layer that adds a gaussian noise during training (for rgularization) but does nothing during testing.

In [None]:
class MyGaussianNoise(keras.layers.Layer):
  def __init__(self, stddev, **kwargs):
    super().__init__(**kwargs)
    self.stddev = stddev
  
  def call(self, X, training=None):
    if training:
      noise = tf.random.normal(tf.shape(X), stddev=self.stddev)
      return X + noise
    else:
      return X
    
  def compute_output_shape(self, batch_input_shape):
    return batch_input_shape
  

## Custom Models
We have already seen creating cutom models using Subclassing API, in chapter 10.

Here, we create A residual layer, a layer which adds its input to its ouptut, creating the final output. The output will itself will be created by Dense Layers.

In [None]:
class ResidualBlock(keras.layers.Layer):
  def __init__(self, n_layers, n_neurons, **kwargs):
    super().__init__(**kwargs)
    self.hidden = [keras.layers.Dense(n_neurons, 
                                      activation="elu",
                                      kernel_initializer="he_normal")
                   for _ in range(n_layers)]
    def call(self, inputs):
      Z = inputs
      for layer in self.hidden:
        Z = layer(Z)
      return inputs + Z

Next, let's build the model using the subclassing API, where we want to repeat the operations of the Residual Block.

In [None]:
class ResidualRegressor(keras.Model):
  def __init__(self, output_dim, **kwargs):
    super().__init__(**kwargs)
    self.hidden1 = keras.layers.Dense(30, 
                                      activation="elu",
                                      kernel_initializer="he_normal")
    self.block1 = ResidualBlock(2, 30)
    self.block2 = ResidualBlock(2, 30)
    self.out = keras.layers.Dense(output_dim)

  def call(self, inputs):
    Z = self.hidden1(inputs)
    for _ in range(1 + 3):
      Z = self.block1(Z)
    Z = self.block2(Z)
    return self.out(Z)

In [None]:
model = ResidualRegressor(1)
model.compile(loss="", optimizer="rmsprop")

## Losses and Metrics Based on Model Internals
There will be times when you want to define losses based on other parts of you model, such as weights or activations of its hidden layers. 

**To define a custom loss based on models internals, compute it based on any part of the model you want, then pass the result to the `add_loss()`**.

Let's build a custom model, with 5 hidden layer and an auxilary output on top of the upper hidden layer. The loss associated to this auxilary output will be called the ***reconstruction loss***: <mark>it is the mean squared difference between the reconstruction and the inputs.</mark>

By adding this layer we want the model to preserve as much information through the hidden layers--even the information that is not directly usedul for the regression task.

In [None]:
class ReconstructingRegressor(keras.Model):
  def __init__(self, output_dim, **kwargs):
    super().__init__(**kwargs)
    self.hidden = [keras.layers.Dense(30, 
                                      activation="relu",
                                      kernel_initializer="lecun_normal")
                   for _ in range(5)]
    self.out = keras.layers.Dense(output_dim)
  
  def built(self, batch_input_shape):
    """
    The extra dense layer needs to be here because
    its number of units must be equal to the number of inputs,
    and this number is unknown before the build() method is called.
    """
    n_inputs = batch_input_shape[-1]
    self.reconstruct = keras.layers.Dense(n_inputs)
    super().build(batch_input_shape)

  def call(self, inputs):
    Z = inputs
    for layer in self.hidden:
      Z = layer(Z)
    reconstruction = self.recontruct(Z)
    recon_loss = tf.reduce_mean(tf.square(recontruction - inputs))
    self.add_loss(0.05 * recon_loss)
    return self.out(Z)

Similarly you can add a custom metric based on model internals by computing it in any way you want, as long as the result is the output of a metric object.

In some cases you may need to customize the training lop itself. Before that weneed to look at how to compute gradients automatically.

## Computing Gradients Using Autodiff
Let's consider a simple toy function:

In [None]:
def f(w1, w2):
  return 3 * w1 ** 2 + 2 * w1 * w2

In [None]:
# One way could be to compute an approximation of each
# partial deivative by measuring how much the function;s output
# changes when you tweak the corresponding parameter
w1, w2 = 5, 3
eps = 1e-6
(f(w1+eps, w2) - f(w1, w2)) / eps

36.000003007075065

In [None]:
(f(w1, w2+eps) - f(w1, w2)) / eps

10.000000003174137

**Instead use autodiff**

In [None]:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
  z = f(w1, w2)

gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

First line is self explanatory. In second line, we create a `tf.GradientTape` context that will automatically record every operation that incloves a variable, and finally we ask this tape to compute the gradients of the result `z` with regard to both variables `[w1, w2]`.

>🟢 To save memory, only put the strict minimum inside the `tf.GradientTape()` block. Alternatively, pause recording by creating a with `tape.stop_recording()` block inside the `tf.GradientTape()` block.

**By default, the tape will only track operations involving varaibles**, so if you try to compute gradient of `z` with regard to anything other than a varaible, the result will be `None`.

In [None]:
c1, c2 = tf.constant(5.), tf.constant(3.)
with tf.GradientTape() as tape:
  z = f(c1, c2)

gradients = tape.gradient(z, [c1, c2]) 
gradients

[None, None]

**But you can force it**

In [None]:
with tf.GradientTape() as tape:
  tape.watch(c1)
  tape.watch(c2)
  z = f(c1, c2)

gradients = tape.gradient(z, [c1, c2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=36.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=10.0>]

This can be useful in some cases, like  if we want to implement a regularization loss that penalizes activations that vary a lot when the inputs vary little: the loss will be based on the gradient of the activations with regard to the inputs. Since the inputs are not variables, you would need to tell the tape to watch them.

In some cases **you may want to stop gradients from backpropagating** through some part of your neural network. Use `tf.stop_gradient()`.

The function returns its inputs during the forward pass (like `tf.identity()`), but it does not let gradients through during backpropagation (it acts like a constant).

In [None]:
def f(w1, w2):
  return 3 * w1**2 + tf.stop_gradient(2 * w1 * w2)

with tf.GradientTape() as tape:
  z = f(w1, w2) 

gradients = tape.gradient(z, [w1, w2])
gradients

[<tf.Tensor: shape=(), dtype=float32, numpy=30.0>, None]

Finally, we might ocassionaly run into some numerical issues when computing gradients. For example, if you compute the gradients of the `my_softplus()` function for large inputs, the result will be NaN:

In [None]:
x = tf.Variable([100.])
with tf.GradientTape() as tape:
  z = my_softplus(x)

tape.gradient(z, [x])

[<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>]

We know that the derivative of softplus function is just 1 / (1 + exp(x)) which is numerically stable. 

**Using `@tf.custom_gradient` and making it return both its normal output and the function that computes the derivative, we can solve the issue.**

> **Note**: It will recieve as input the gradients that were backpropagated so far, down to the softplus function; and according to the chain rule we must multiply them with this function's gradients):

In [None]:
@tf.custom_gradient
def my_better_softplus(z):
  exp = tf.exp(z)
  def my_softplus_gradients(gradient):
    return gradient / (1 + 1/exp)
  return tf.math.log(exp + 1), my_softplus_gradients

In [None]:
x = tf.Variable([100.])
with tf.GradientTape() as tape:
  z = my_better_softplus(x)

tape.gradient(z, [x])

[<tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>]

## Custom Training Loops
Sometimes, we need flexible `fit()` method.

Like, the Wide & Deep paper, we discussed in Chapter 10 uses two different optimizers, one for the wide path other for the deep path. Since the `fit()` method only uses one optimizer (the one that we specify when compiling the model), implementing this paper requires qriting your own custom loop.

You may also want to write a custom training loop, just so that you can get confident that it do what you actually indent to do. Although risking at the code being error-prone and albit long.

>🟢 Unless you really need this extra-flexibility, and customization; Avoid it.

**1. Let's create a simple model. No need to compile it, since we will handle the training loop manually:**

In [None]:
l2_reg = keras.regularizers.l2(0.5)
model = keras.models.Sequential([
    keras.layers.Dense(30,
                       activation="elu",
                       kernel_initializer="he_normal",
                       kernel_regularizer=l2_reg),
    keras.layers.Dense(1, kernel_regularizer=l2_reg)
])

**2. Let's create a tiny function that will randomly sample a batch of instance from the training set.**

  Also define a func that will display the training status.

In [None]:
def random_batch(X, y, batch_size=32):
  idx = np.random.randint(len(X), size=batch_size)
  return X[idx], y[idx]

In [None]:
def print_status_bar(iteration, total, loss, metrics=None):
  metrics = " - ".join(["{}: {:.4f}".format(m.name, m.result)
                        for m in [loss] + (metrics or [])])
  end = " " if iteration < total else "\n"
  print("\r{}/{} - ".format(iteration, total) + metrics, end=end)

**3. Let's get the imp work done.**

  First we define some hyperparameters and choose the optimizer, the loss function, and the metrics.

In [None]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // Batch_size  # total_no_instances / batch_size
optimizer = keras.optimizers.Nadam(lr=0.001)
loss_fn = keras.losses.mean_squared_error
mean_loss = keras.metrics.Mean()
metrics = [keras.metrics.MeanAbsoluteError()]

**4. And let's build the custom loop!**

In [None]:
for epoch in range(1, n_epochs + 1):
  print("Epoch {}/{}".format(epoch, n_epochs))
  for step in range(1, n_steps + 1 ):
    X_batch, y_batch = random_batch(X_train,_scaled, y_train)
    with tf.GradientTape() as tape:
      y_pred = model(X_batch, training=True)
      main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
      loss = tf.add_n([main_loss] + model.losses)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    # if you add weight constraints to your model
    for variable in model.variables:
      if variable.constraint is not None:
        variable.assign(variable.constraint(variable)) 
    
    mean_loss(loss)
    for metric in metrics:
      metric(y_batch, y_pred)
    print_status_bar(step * batch_size, len(y_train), mean_loss, metrics)
  print_status_bar(len(y_train), len(y_train), mean_loss, metrics)
  for metric in [mean_loss] + metrics:
    metric.reset_states()

#### This needs some explanation
- First, we created two nested loops: one for epochs, the other for the batches within an epoch.
- Sampled a random batch.
- Made a prediction. Compute the losses using the defined `loss_fn` and meaned over the instances in the batch. We also computed the regularization loss (which is already reduced to a single saclar each), so we just sumed it to the `main_loss` using `tf.add_n()`.
- Compute gradient of the loss with respect to trainable variables of the network and apply it to optimizer. 
- Then we updated the mean loss and the metrics (over the current epoch).
- At the end we displayed the status bar again to make it look complete.


**If you want to apply any other transformation to the gradients, simply do so before calling the `apply_gradient()` method.**

**If you add weight constraints to your model (e.g., by `kernel_constraint` or `bias_constraint` when creating layer) you should update the training loop, as seen in small block just after `apply_gradients()`**

# TensorFlow Functions and Graphs
In TensorFlow 2 graphs are still there, but not as central and they're much simpler to use.

Let's start with a function that find the cube of its input:

In [None]:
def cube(x):
  return x**3

In [None]:
cube(2)

8

In [None]:
cube(tf.constant(2.0))

<tf.Tensor: shape=(), dtype=float32, numpy=8.0>

Now, let's use `tf.function()` to convert this function to a ***TensorFlow Function***: <mark>it just return a Tensor instead of any python data type.</mark>

In [None]:
tf_cube = tf.function(cube)
tf_cube

<tensorflow.python.eager.def_function.Function at 0x7fa28ee4ab90>

In [None]:
tf_cube(2)

<tf.Tensor: shape=(), dtype=int32, numpy=8>

In [None]:
tf_cube(tf.constant(2.0))

<tf.Tensor: shape=(), dtype=float32, numpy=8.0>

Under the hood, `tf.function()` analyzed the computations performed by the `cube()` fucntion and generated an equivalent computation graph.

We can also use the function decorator as an much more simpler solution.


In [None]:
@tf.function
def tf_cube(x):
  return x**3

The original pytho function is also available via the TF fucntion's `python_function` attribute, in case you ever need it:

In [None]:
tf_cube.python_function(2)

8

TensorFlow optimizer the computation graph, prunig unused nodes, simplifying expression and more. Once the optimized graph is ready, the TF Function efficiently executes the operations in the Graph, in the appriate order (and in parallel when it can). As a result making complex computation a lot faster.

>🟢 When you write a custom loss function, a custom metric or anything, keras automatically converts it to TF Funcitons. If we want, we can disable this by setting `dynamic=True` when creating a custom layer, or a custom model. Alternatively, we can set `run_eagerly=True` when calling the model's `compile()` method.

## AutoGraph and Tracing
How does TensorFlow generate graphs?

1. **AutoGraph**: TF analyzes the Python function's source code to capture all the control flow statements.
2. AutoGraph outputs an upgraded version of that function in which all the control flow statements are replaced by appropriate TensorFlow operations.
3. Next, TensorFlow calls this "upgraded" function, but instead of passing the argument, it passes a ***symbolic tensor***- a tensor wihout any actual value, only a name, a data type, and a shape.

  The function will run in ***graph mode***, <mark>meaning that each TensorFlow operation will add a node in the graph to represent itself and its output tensor(s)</mark> (as opposed to the regular  model called *eager execution*, or *eager mode*).

>🟢  To view the generated function's source code, you can call `tf.autograph.to_code(sum_squares.python_function)`





## TF Function Rules
There are few rules to respect.
- If you call any external library, including Numpy or even the standard library, this call will run only during tracing; it will not be art of the graph.
  - If you define a TF function `f(x)` that just returns `np.random.rand()`, a radom number will only be generated when the function is traced, so `f(tf.constant(2.0))` and `f(tf.constant(3.))` will return the same number (<mark>as TF fucntion generates a new graph for every unique set of input shapes and data types and cahes it for subsequent calls.</mark>)

  - If your non-TensorFlpw code has side effects (such as logging something or updating a Pyhton counter), then we should not expect those side effects to occure every time we call the TF function, as they only occur when the function is traced.
  
  - We can wrap arbitrary Python code in a `tf.py_funciton()` operation, but doin so will hinder performance, reduce portability.

- You can call other Python function (which themselves are not decorated with `@tf.function`) or TF functions, but they should follow the same rules as TF will capture their operations in the computation graph.

- If the funciton creates a TensorFlow variable (or any other TensorFlow object, such as a dataset or a queue), it must do so upon the very first call, and only then, or else you will get an exception. 

- The soruce code of your Python function should be available to TensorFlow, if not(like defining func in shell or deploying only the compiled *.pyc Python files to production), then the graph generation process will fail or have limited fucntionality.

- TensorFlow will only capture for loops that iterate over a tensor or dataset. So use `tf.range()`

- As always, for performance reasons, you should prefer a vectorized implementation whenever you can.