<a href="https://colab.research.google.com/github/antahiap/debugging-dl-models/blob/master/notebooks/2_most_common_bugs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Please, make a copy of the notebook.
import gdown
import numpy as np
import matplotlib.pyplot as plt
import os
import pandas as pd
import seaborn as sns
import tensorflow as tf

from sklearn.datasets import make_regression

%matplotlib inline
%load_ext tensorboard

##### Load tensorboard logs and create some helper functions.

In [2]:
url = f'https://drive.google.com/uc?id=1baY_ylqS9_kk-LIuZIU8xMLXRW6_saTh'

output = 'tb_logs.zip'
gdown.download(url, output, quiet=False)

# unzip
!unzip tb_logs.zip
!rm tb_logs.zip

Downloading...
From: https://drive.google.com/uc?id=1baY_ylqS9_kk-LIuZIU8xMLXRW6_saTh
To: /content/tb_logs.zip
100%|██████████| 334k/334k [00:00<00:00, 81.2MB/s]

Archive:  tb_logs.zip
   creating: tb_logs/
   creating: tb_logs/loss_bug/
   creating: tb_logs/loss_bug/logits_true/
  inflating: tb_logs/loss_bug/logits_true/events.out.tfevents.1692612968.556d0c6dace4.225.1.v2  
   creating: tb_logs/loss_bug/logits_false/
  inflating: tb_logs/loss_bug/logits_false/events.out.tfevents.1692612764.556d0c6dace4.225.0.v2  
   creating: tb_logs/scaling/
   creating: tb_logs/scaling/regression_non_standard/
  inflating: tb_logs/scaling/regression_non_standard/events.out.tfevents.1692626166.a78713ec1a12.215.7.v2  
   creating: tb_logs/scaling/regression_standard/
  inflating: tb_logs/scaling/regression_standard/events.out.tfevents.1692626221.a78713ec1a12.215.8.v2  
   creating: tb_logs/exploding_grads/
   creating: tb_logs/exploding_grads/clipped/
  inflating: tb_logs/exploding_grads/clipped/events.out.tfevents.1692626302.a78713ec1a12.215.10.v2  
   creating: tb_logs/exploding_grads/no_clipping/
  inflating: tb_logs/exploding_grads/no_clipping/events.out.tf




In [3]:
def make_writer(filepath, dir_name):
    """ Creates a directory to save tensorboard events """
    path = os.path.join(filepath, dir_name)
    os.makedirs(path, exist_ok=True)
    print(f'Creating a tensorboard directory: {path}')
    writer = tf.summary.create_file_writer(path)
    return writer

def standardize(array):
    means = np.mean(array, axis=0)
    std = np.std(array, axis=0)
    return (array - means) /  std

# Most common bugs I

## Resources

- [Chapter 4 of Deep learning book. Numerical computation](https://www.deeplearningbook.org/contents/numerical.html)
- [Gradient norm clipping](http://proceedings.mlr.press/v28/pascanu13.html)
- [The log-sum-exp trick](https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/)

## Incorrect tensor shapes

### Most common reasons:

- Flipped dimensions when using tf.reshape.
- Sum, avg, softmax over wrong dimension.
- Forgot to flatten after conv layers.
- Forgot to get rid of extra "1" dimensions, e.g. if shape is (None, 1, 1, 4).

- In TF2, as well as in other libraries, you can accidentally broadcast tensors and then it can fail silently or just output wrong results.

In [4]:
y_true = np.array([0.1, 0.7, 0.02, 0.08, 0.05, 0.05])
y_true_extra_dim = np.expand_dims(y_true, -1)
y_pred = np.array([0.1, 0.6, 0.05, 0.05, 0.1, 0.1])

In [5]:
print(f'y_true: {y_true} \n')
print(f'Shape of y_true: {y_true.shape} \n')
print(f'y_true_extra_dim: {y_true_extra_dim} \n')
print(f'Shape of y_true_extra_dim: {y_true_extra_dim.shape} \n')

y_true: [0.1  0.7  0.02 0.08 0.05 0.05] 

Shape of y_true: (6,) 

y_true_extra_dim: [[0.1 ]
 [0.7 ]
 [0.02]
 [0.08]
 [0.05]
 [0.05]] 

Shape of y_true_extra_dim: (6, 1) 



In [6]:
y_pred

array([0.1 , 0.6 , 0.05, 0.05, 0.1 , 0.1 ])

In [7]:
y_pred.shape

(6,)

Say we want to divide y_true by y_pred. What shapes do we expect to get?

In [8]:
y_true / y_pred

array([1.        , 1.16666667, 0.4       , 1.6       , 0.5       ,
       0.5       ])

In [9]:
y_true_extra_dim / y_pred

array([[ 1.        ,  0.16666667,  2.        ,  2.        ,  1.        ,
         1.        ],
       [ 7.        ,  1.16666667, 14.        , 14.        ,  7.        ,
         7.        ],
       [ 0.2       ,  0.03333333,  0.4       ,  0.4       ,  0.2       ,
         0.2       ],
       [ 0.8       ,  0.13333333,  1.6       ,  1.6       ,  0.8       ,
         0.8       ],
       [ 0.5       ,  0.08333333,  1.        ,  1.        ,  0.5       ,
         0.5       ],
       [ 0.5       ,  0.08333333,  1.        ,  1.        ,  0.5       ,
         0.5       ]])

#### KL-divergence

KL-divergence is used in some models like VAEs or Bayesian models.

In [10]:
kl = tf.keras.losses.KLDivergence()

print(f'KLD for y_true: {kl(y_true, y_pred).numpy()} \n')
print(f'KLD for y_true_extra_dim: {kl(y_true_extra_dim, y_pred).numpy()}')

KLD for y_true: 0.05786523352526206 

KLD for y_true_extra_dim: 1.1752046259108784


## Pre-processing inputs incorrectly

- Forgot to standardize/scale.
    -  It makes the resulting model dependent on the choice of units used in the input.
- Too much augmentation.


### Regression example with Auto MPG data

#### Load the data and create a pandas DataFrame

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
dataset = pd.read_csv(url, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)
dataset = dataset.dropna()
dataset.drop('Origin', axis=1, inplace=True)

#### Create labels and train set

In [None]:
labels = dataset.pop('MPG')
labels = np.array(labels).astype('float32')

Let's make the difference in scales for some features even more pronounced and print the statistics.

In [None]:
dataset['Horsepower'] = dataset['Horsepower'] * 1000
dataset['Displacement'] = dataset['Displacement'] / 1000
train_set = np.array(dataset).astype('float32')
train_set_stand = standardize(train_set)

In [None]:
print(f"Train_set std:")
{col: avg for col, avg in zip(dataset.columns, train_set.std(axis=0))}

In [None]:
print(f"Train_set_stand std:")
{col: avg for col, avg in zip(dataset.columns, train_set_stand.mean(axis=0))}

### Model

In [None]:
class RegressorNet(tf.keras.Model):

    def __init__(self, input_shape, optimizer):
        super(RegressorNet, self).__init__()

        self.optimizer = optimizer
        self.regressor = tf.keras.Sequential([
            tf.keras.layers.Input(input_shape),
            tf.keras.layers.Dense(64, activation='relu', name='dense_1'),
            tf.keras.layers.Dense(64, activation='relu', name='dense_2'),
            tf.keras.layers.Dense(1, activation='linear', name='dense_out')
        ])

    def summary(self):
        self.regressor.summary()

    def call(self, X):
        return self.regressor(X)

    def get_loss(self, X, y_true):
        y_pred = self(X)
        l2_loss = tf.keras.losses.mean_squared_error(y_true, y_pred)
        return l2_loss

    def grad_step(self, X, y_true):
        with tf.GradientTape() as tape:
            loss = self.get_loss(X, y_true)
        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        return loss, gradients

optimizer_not_scaled = tf.keras.optimizers.Adam()
optimizer_scaled = tf.keras.optimizers.Adam()

net_not_scaled = RegressorNet(input_shape=train_set.shape[1], optimizer=optimizer_not_scaled)
net_scaled = RegressorNet(input_shape=train_set.shape[1], optimizer=optimizer_scaled)

net_scaled.summary()

### Train and output to Tensorboard

In [None]:
def train(model, epochs, X, y, save_dir):

    writer = make_writer(os.path.join('tb_logs'), save_dir)

    for epoch in range(0, epochs + 1):

        if epoch % 100 == 0:
            print('Epoch {} is running...'.format(epoch))

        # Gradient update step
        loss, gradients = model.grad_step(X, y.reshape(-1, 1))
        loss = tf.math.reduce_mean(loss, axis=0)

        if epoch % 100 == 0:
            print(f'{loss}')

        # Tensorboard
        with writer.as_default():
            tf.summary.scalar('Train loss', loss, step=epoch)

            for layer_number, layer in enumerate(model.trainable_variables):
                tf.summary.histogram(layer.name, gradients[layer_number], step=epoch, buckets=1)


In [None]:
#train(net_not_scaled, 1000, train_set, labels, 'scaling/regression_non_standard')
#train(net_scaled, 1000, train_set_stand, labels, 'scaling/regression_standard')

In [None]:
%tensorboard --logdir tb_logs/scaling --port 6006

## Incorrect input to the loss/ incorrect loss

- Softmaxed outputs to a loss that expects logits or vice-versa.
- One-hot encoded labels to a sparse categorical cross-entropy loss.
- ReLU in the last layer for regression problems.
- E.g. MSE loss when categorical loss is expected.

### Log-sum-exp trick:

Softmaxed probabilities:

$$ p_i = \frac{exp(x_i)}{\sum_{j=1}^{n} exp(x_j)} $$

This can be rewritten as:

$$p_i = exp \left( x_i - log\sum_{j=1}^{n}exp(x_j) \right)$$

We can rewrite the LSE term, as well:

$$ y = log\sum_{j=1}^{n}exp(x_j) $$
$$exp(y) = \sum_{j=1}^{n}exp(x_j)$$
$$exp(y) = exp(c)\sum_{j=1}^{n}exp(x_j - c)$$
$$y = c + log\sum_{j=1}^{n}exp(x_j - c)$$

**So a common trick is to subtract the maximum logit value from all logits before performing the softmax operation.**

In [None]:
from scipy.special import logsumexp

In [None]:
x = np.array([1000, 1000, 1000])
np.exp(x)

In [None]:
logsumexp(x)

In [None]:
np.exp(x - logsumexp(x))

### MNIST example

In [None]:
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train / 255.0
# Add a channels dim
x_train = tf.expand_dims(x_train.astype('float32'), axis=-1)
y_train = y_train

train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(1000).batch(32)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [None]:
class MnistNet(tf.keras.Model):
    def __init__(self, input_shape, loss, optimizer):
        super(MnistNet, self).__init__()
        self.mnist_net = tf.keras.Sequential([
            tf.keras.layers.Input(input_shape),
            tf.keras.layers.Flatten(name='flatten'),
            tf.keras.layers.Dense(128, activation='relu', name='dense_1'),
            tf.keras.layers.Dense(10, name='dense_out', activation=None)
        ])
        self.optimizer = optimizer
        self.loss = loss

    def summary(self):
        self.mnist_net.summary()

    def call(self, images):
        return self.mnist_net(images)

    def get_pred_loss(self, images, labels):
        y_pred = self(images)
        return y_pred, self.loss(labels, y_pred)

    def grad_step(self, images, labels):
        with tf.GradientTape() as tape:
            y_pred, loss = self.get_pred_loss(images, labels)
        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        return y_pred, loss, gradients

# Create models
loss_logits_false = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
loss_logits_true = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer_0 = tf.keras.optimizers.Adam()
optimizer_1 = tf.keras.optimizers.Adam()

model_logits_false = MnistNet(x_train.shape[1:], loss_logits_false, optimizer_0)
model_logits_true = MnistNet(x_train.shape[1:], loss_logits_true, optimizer_1)

In [None]:
def train(model, epochs, dataset, save_dir):

    writer = make_writer(os.path.join('tb_logs'), save_dir)
    train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
    train_loss = tf.keras.metrics.Mean()


    for epoch in range(0, epochs + 1):

        train_accuracy.reset_states()
        train_loss.reset_states()

        for images, labels in dataset:
          # Gradient update step
          y_pred, loss, gradients = model.grad_step(images, labels)
          train_loss(loss)
          train_accuracy(labels, y_pred)

        # Tensorboard
        with writer.as_default():
            tf.summary.scalar('Train loss', loss, step=epoch)
            tf.summary.scalar('Train Accuracy', train_accuracy.result() * 100, step=epoch)

            for layer_number, layer in enumerate(model.trainable_variables):
              tf.summary.histogram('/'.join(layer.name.split('/')[1:]), gradients[layer_number], step=epoch, buckets=1)


        message = (f'Epoch: {epoch}, Loss: {train_loss.result()}, Accuracy: {train_accuracy.result() * 100}')
        print(message)

In [None]:
#train(model_logits_false, 5, train_ds, 'loss_bug/logits_false')
#train(model_logits_true, 5, train_ds, 'loss_bug/logits_true')

In [None]:
%tensorboard --logdir tb_logs/loss_bug --port 6007

## Numerical instabilities

- Vanishing and exploding gradients.
- Softmax over a very large value.
- Operations including divisions by values close to zero.
- Big policy updates in RL.

#### Exploding gradients and gradient clipping

In [None]:
M = tf.random.normal((4, 4))
print(f'A single matrix \n \n {M.numpy()}')
for i in range(100):
    M = tf.matmul(M, tf.random.normal((4, 4)))

print(f'\nAfter multiplying 100 matrices \n \n {M.numpy()}')

#### Gradient clipping

- Clip a gradient by norm:
$\textbf{g} \gets \frac{\theta}{||\textbf{g}||}\textbf{g} $
    - For example: $$\textbf{g}= [-2, 3, 6]$$ $$\theta = 5$$ $$||\textbf{g}|| = 7$$ $$\textbf{g} \gets [-2, 3, 6]\cdot \frac{5}{7}$$
    
- Clip gradient by value(not commonly use at based on selection of \theta, the direction of gradient might change.
    - If $g_i < \theta_1$, then $g_i \gets \theta_1$ and $g_i > \theta_2$, then $g_i \gets \theta_2$
    - For example: $$\textbf{g}= [-2, 3, 10]$$ $$\theta_1 = 0, \theta_2 = 5$$  $$ \textbf{g} \gets [0, 3, 5]$$

    
- Clip gradient by global norm:
    - Rescales a list of tensors so that the total norm of the vector of all their norms does not exceed a threshold.
    - For example: $$\textbf{g}_1 = [-2, 3, 6]$$ $$\textbf{g}_2= [-4, 6, 12]$$ $$\theta = 14$$ $$||\textbf{g}_1|| = 7$$ $$||\textbf{g}_2|| = 14$$ $$\textbf{g}_1 \gets [-2, 3, 6]\cdot \frac{14}{\sqrt{7^2 + 14^2}}$$ $$\textbf{g}_2 \gets [-4, 6, 12]\cdot \frac{14}{\sqrt{7^2 + 14^2}} $$
    

In [11]:
# generate regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)
# split into train and test
n_train = 500
trainX = X[:n_train, :].astype('float32')
trainY = y[:n_train].astype('float32').reshape(-1, 1)

# Creat tf.Datasets
train_dataset = tf.data.Dataset.from_tensor_slices((trainX, trainY)).shuffle(trainX.shape[0]).batch(32)

In [12]:
class RegressorNet(tf.keras.Model):
    def __init__(self, input_shape, optimizer):
        super(RegressorNet, self).__init__()
        self.optimizer = optimizer
        self.regressor = tf.keras.Sequential([
            tf.keras.layers.Input(input_shape),
            tf.keras.layers.Dense(25, activation='relu', kernel_initializer='he_uniform', name='dense_1'),
            tf.keras.layers.Dense(1, activation='linear', name='out')
        ])

    def summary(self):
        self.regressor.summary()

    def call(self, X):
        return self.regressor(X)

    def get_loss(self, X, y_true):
        y_pred = self(X)
        l2_loss = tf.keras.losses.mean_squared_error(y_true, y_pred)
        return l2_loss

    def grad_step(self, X, y_true):
        with tf.GradientTape() as tape:
            loss = self.get_loss(X, y_true)
        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        return loss, gradients

    def grad_step_clipped(self, X, y_true):
        with tf.GradientTape() as tape:
            loss = self.get_loss(X, y_true)
        gradients = tape.gradient(loss, self.trainable_variables)
        gradients, _ = tf.clip_by_global_norm(gradients, 1.0)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        return loss, gradients

# Specify an optimizer and instances of the model
optimizer_not_clipped = tf.keras.optimizers.SGD(0.01, 0.9)
optimizer_clipped = tf.keras.optimizers.SGD(0.01, 0.9)

net_not_clipped = RegressorNet(input_shape=trainX.shape[1], optimizer=optimizer_not_clipped)
net_clipped = RegressorNet(input_shape=trainX.shape[1], optimizer=optimizer_clipped)

# Show summary
net_clipped.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_1 (Dense)             (None, 25)                525       
                                                                 
 out (Dense)                 (None, 1)                 26        
                                                                 
Total params: 551
Trainable params: 551
Non-trainable params: 0
_________________________________________________________________


In [13]:
def train(model, epochs, train_dataset, save_dir, clip=False):

    writer = make_writer(os.path.join('tb_logs'), save_dir)

    for epoch in range(0, epochs + 1):


        if epoch % 10 == 0:
            print('Epoch {} is running...'.format(epoch))

        for X, y in train_dataset:
            # Gradient update step
            if clip:
              loss, gradients = model.grad_step_clipped(X, y)
            else:
              loss, gradients = model.grad_step(X, y)
            loss = tf.math.reduce_mean(loss, axis=0)

        if epoch % 10 == 0:
            print(f'Train loss: {loss}')

        # Tensorboard
        with writer.as_default():
            tf.summary.scalar('Train loss', loss, step=epoch)

            for layer_number, layer in enumerate(model.trainable_variables):
                tf.summary.histogram(layer.name, gradients[layer_number], step=epoch, buckets=1)


In [15]:
train(net_not_clipped, 100, train_dataset, 'exploding_grads/no_clipping', clip=False)
#train(net_clipped, 100, train_dataset, 'exploding_grads/clipped', clip=True)

Creating a tensorboard directory: tb_logs/exploding_grads/no_clipping
Epoch 0 is running...
Train loss: nan
Epoch 10 is running...
Train loss: nan
Epoch 20 is running...
Train loss: nan
Epoch 30 is running...
Train loss: nan
Epoch 40 is running...
Train loss: nan
Epoch 50 is running...
Train loss: nan
Epoch 60 is running...
Train loss: nan
Epoch 70 is running...
Train loss: nan
Epoch 80 is running...
Train loss: nan
Epoch 90 is running...
Train loss: nan
Epoch 100 is running...
Train loss: nan


In [17]:
!kill 8123

In [18]:
%tensorboard --logdir tb_logs/exploding_grads --port 6008

<IPython.core.display.Javascript object>

## OOM errors

### Common issues and causes

- Too big a tensor:
    - Too large a batch size for your model
    - Too many fully connected layers
- Too much data:
    - Loading a too big dataset into memory instead of using, e.g. tf.data queue loading
    - Allocating to large a buffer for dataset creation
- Duplicating operations:
    - Memory leak due to creating multiple models at the same time
    - Repeatedly creating an operation (e.g. in a function that gets called many times)
- Other processes:
    - Other processes taking GPU memory