**YOUR NAMES HERE**

Spring 2025

CS 444: Deep Learning

Project 1: Deep Neural Networks 

#### Week 2: Training deeper networks with blocks

The focus this week is on block design organizing deeper neural networks.

In [7]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=3)

# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

## Task 5: Building deeper neural networks with blocks

In the quest to classify CIFAR-10 images with the highest accuracy possible, we would like to build a neural network that is deeper than VGG4 and has a greater capacity to learn more complex, nonlinear patterns in the images. Let's focus on designing a slightly deeper network than VGG4 that we will call VGG6 that has the following architecture:

Conv2D → Conv2D → MaxPool2D → **Conv2D → Conv2D → MaxPool2D** → Flatten → *Dense → Dropout* → Dense

Notice how the bold set of `Conv2D`/`MaxPool2D` layers are repeats of the layers to their left. It turns out, it may be beneficial to replicate the `Dense`/`Dropout` layers (italicized) toward the end of the network multiple times as well in deeper versions.

Review your code for assembling `VGG4`. Building `VGG6` would require some copy-pasting of layer creation code. Imagine building even deeper versions with even more layers (e.g. `VGG9`) — this copy-paste process would get tedious, unwieldy, and potentially be error prone the bigger the network gets!

For this reason, modern deep neural networks are often built using **blocks**: sequences of layers that repeat over and over again as you get farther into the network. For example, imagine replacing the layers **Conv2D → Conv2D → MaxPool2D** with a SINGLE new object that represents performing that sequence of those 3 layers. If we also do this for the `Dense`/`Dropout` layers, the architecture would look like:

VGGConvBlock_0 → **VGGConvBlock_1** → Flatten → *VGGDenseBlock_0* → Dense

Much simpler, more manageable, and easier to scale up to deeper nets!



### 5a. Build and test `VGG` blocks

The file `block.py` contains both the `Block` class and the `VGGConvBlock` and `VGGDenseBlock` classes referenced above. The `Block` class is the parent class to all `Block` classes (*both ones you write this week and for the rest of the semester!*) and is designed to work with `DeepNetwork`. Just like `DeepNetwork`, it contains all the "boilerplate" code that needs to be written for ANY block.

Aside from the constructor, I am providing you with the `Block` class fully implemented :) You only need to write code that assembles the layers that belong to a block and specify how the forward pass thru them is done. Blocks can be mixed-and-matched and interspersed with regular layers! Nice!

Implement and test the following classes and methods.

**Block:**
- constructor.

**VGGConvBlock:**
- constructor: What layers belong to a `VGGConvBlock` block?
- `__call__`: How do we perform the forward pass thru the block?

**VGGDenseBlock:**
- constructor: What layers belong to a `VGGDenseBlock` block?
- `__call__`: How do we perform the forward pass thru the block?


In [8]:
from block import VGGConvBlock, VGGDenseBlock

#### Test: `VGGConvBlock` part 1/2

In [27]:
tf.random.set_seed(0)
x_test_1 = tf.random.normal(shape=(1, 4, 4, 3))
conv_block = VGGConvBlock('TestBlock', units=5, prev_layer_or_block=None, wt_scale=1e-1)
conv_block(x_test_1)
print(conv_block)

TestBlock:
	MaxPool2D layer output(TestBlock/max_pool_layer_1) shape: [1, 2, 2, 5]
	Conv2D layer output(TestBlock/conv_layer_1) shape: [1, 4, 4, 5]
	Conv2D layer output(TestBlock/conv_layer_0) shape: [1, 4, 4, 5]


The above should print (naming might be different):

```
TestBlock:
	MaxPool2D layer output(TestBlock/maxpool2) shape: [1, 2, 2, 5]
	Conv2D layer output(TestBlock/conv1) shape: [1, 4, 4, 5]
	Conv2D layer output(TestBlock/conv0) shape: [1, 4, 4, 5]
```

In [28]:
tf.random.set_seed(1)
x_test_2 = tf.random.normal(shape=(2, 4, 4, 3))
acts = conv_block(x_test_2)
print(f'Your block net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[[[0.372 0.487 0.071 0.158 0.   ]
   [0.156 0.412 0.175 0.116 0.019]]

  [[0.51  0.548 0.085 0.299 0.   ]
   [0.375 0.327 0.169 0.209 0.   ]]]


 [[[0.25  0.551 0.    0.321 0.022]
   [0.461 0.47  0.116 0.132 0.044]]

  [[0.37  0.546 0.003 0.221 0.009]
   [0.37  0.486 0.123 0.054 0.   ]]]]''')

Your block net_acts are
[[[[0.372 0.487 0.071 0.158 0.   ]
   [0.156 0.412 0.175 0.116 0.019]]

  [[0.51  0.548 0.085 0.299 0.   ]
   [0.375 0.327 0.169 0.209 0.   ]]]


 [[[0.25  0.551 0.    0.321 0.022]
   [0.461 0.47  0.116 0.132 0.044]]

  [[0.37  0.546 0.003 0.221 0.009]
   [0.37  0.486 0.123 0.054 0.   ]]]]
and they should be:
[[[[0.372 0.487 0.071 0.158 0.   ]
   [0.156 0.412 0.175 0.116 0.019]]

  [[0.51  0.548 0.085 0.299 0.   ]
   [0.375 0.327 0.169 0.209 0.   ]]]


 [[[0.25  0.551 0.    0.321 0.022]
   [0.461 0.47  0.116 0.132 0.044]]

  [[0.37  0.546 0.003 0.221 0.009]
   [0.37  0.486 0.123 0.054 0.   ]]]]


#### Test: `VGGConvBlock` part 2/2

In [29]:
tf.random.set_seed(0)
x_test_1 = tf.random.normal(shape=(1, 4, 4, 3))
conv_block = VGGConvBlock('TestBlock', units=7, prev_layer_or_block=None, dropout=True)
conv_block(x_test_1)
print(conv_block)

TestBlock:
	Dropout layer output(TestBlock/dropout_layer_1) shape: [1, 2, 2, 7]
	MaxPool2D layer output(TestBlock/max_pool_layer_1) shape: [1, 2, 2, 7]
	Conv2D layer output(TestBlock/conv_layer_1) shape: [1, 4, 4, 7]
	Conv2D layer output(TestBlock/conv_layer_0) shape: [1, 4, 4, 7]


The above should print (naming might be different):

```
TestBlock:
	Dropout layer output(TestBlock/dropout) shape: [1, 2, 2, 7]
	MaxPool2D layer output(TestBlock/maxpool2) shape: [1, 2, 2, 7]
	Conv2D layer output(TestBlock/conv1) shape: [1, 4, 4, 7]
	Conv2D layer output(TestBlock/conv0) shape: [1, 4, 4, 7]
```

In [30]:
tf.random.set_seed(1)
x_test_2 = tf.random.normal(shape=(1, 4, 4, 3))
acts = conv_block(x_test_2)
print(f'Your block net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[[[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]

  [[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]]]''')

Your block net_acts are
[[[[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]

  [[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]]]
and they should be:
[[[[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]

  [[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]]]


#### Test: `VGGDenseBlock` part 1/2

In [31]:
tf.random.set_seed(0)
x_test_1 = tf.random.normal(shape=(1, 6))
dense_block = VGGDenseBlock('TestDenseBlock', units=(2,), prev_layer_or_block=None, wt_scale=1e-1)
dense_block(x_test_1)
print(dense_block)

TestDenseBlock:
	Dropout layer output(TestDenseBlock/dropout_layer_0) shape: [1, 2]
	Dense layer output(TestDenseBlock/dense_layer_0) shape: [1, 2]


The above should print (naming might be different):

```
TestDenseBlock:
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 2]
	Dense layer output(TestDenseBlock/dense0) shape: [1, 2]
```

In [32]:
tf.random.set_seed(1)
x_test_2 = tf.random.normal(shape=(3, 6))
acts = dense_block(x_test_2)
print(f'Your block net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[0.   0.  ]
 [0.   0.  ]
 [0.   0.07]]''')

Your block net_acts are
[[0.   0.  ]
 [0.   0.  ]
 [0.   0.07]]
and they should be:
[[0.   0.  ]
 [0.   0.  ]
 [0.   0.07]]


#### Test: `VGGDenseBlock` part 2/2

In [33]:
tf.random.set_seed(0)
x_test_1 = tf.random.normal(shape=(1, 7))
dense_block = VGGDenseBlock('TestDenseBlock', units=(4,5), prev_layer_or_block=None, num_dense_blocks=2)
dense_block(x_test_1)
print(dense_block)

TestDenseBlock:
	Dropout layer output(TestDenseBlock/dropout_layer_1) shape: [1, 5]
	Dense layer output(TestDenseBlock/dense_layer_1) shape: [1, 5]
	Dropout layer output(TestDenseBlock/dropout_layer_0) shape: [1, 4]
	Dense layer output(TestDenseBlock/dense_layer_0) shape: [1, 4]


The above should print (naming might be different):

```
TestDenseBlock:
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 5]
	Dense layer output(TestDenseBlock/dense1) shape: [1, 5]
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 4]
	Dense layer output(TestDenseBlock/dense0) shape: [1, 4]
```

In [34]:
tf.random.set_seed(1)
x_test_2 = tf.random.normal(shape=(2, 7))
acts = dense_block(x_test_2)
print(f'Your block net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[0.002 0.002 0.    0.    0.   ]
 [0.002 0.002 0.    0.    0.   ]]''')

Your block net_acts are
[[0.002 0.002 0.    0.    0.   ]
 [0.002 0.002 0.    0.    0.   ]]
and they should be:
[[0.002 0.002 0.    0.    0.   ]
 [0.002 0.002 0.    0.    0.   ]]


### 5b. Build `VGG6`

Now that you have both types of VGG blocks implemented and tested, make us of them to write the `VGG6` constructor and `__call__` methods in `vgg_nets.py`. This should be a quick process.

In [46]:
from vgg_nets import VGG6

#### Test: `VGG6`

In [47]:
test_vgg6_0 = VGG6(C=5, input_feats_shape=(8, 8, 3))
test_vgg6_0.compile()



---------------------------------------------------------------------------
Dense layer output(output) shape: [1, 5]
dense_block:
	Dropout layer output(dense_block/dropout_layer_0) shape: [1, 256]
	Dense layer output(dense_block/dense_layer_0) shape: [1, 256]
Flatten layer output(flatten) shape: [1, 512]
conv_block_2:
	MaxPool2D layer output(conv_block_2/max_pool_layer_1) shape: [1, 2, 2, 128]
	Conv2D layer output(conv_block_2/conv_layer_1) shape: [1, 4, 4, 128]
	Conv2D layer output(conv_block_2/conv_layer_0) shape: [1, 4, 4, 128]
conv_block_1:
	MaxPool2D layer output(conv_block_1/max_pool_layer_1) shape: [1, 4, 4, 64]
	Conv2D layer output(conv_block_1/conv_layer_1) shape: [1, 8, 8, 64]
	Conv2D layer output(conv_block_1/conv_layer_0) shape: [1, 8, 8, 64]
---------------------------------------------------------------------------


The above should print something like (*layer/block names may be different and that's ok*):

```
---------------------------------------------------------------------------
Dense layer output(output) shape: [1, 5]
DenseBlock1:
	Dropout layer output(DenseBlock1/dropout) shape: [1, 256]
	Dense layer output(DenseBlock1/dense0) shape: [1, 256]
Flatten layer output(flat) shape: [1, 512]
ConvBlock2:
	MaxPool2D layer output(ConvBlock2/maxpool2) shape: [1, 2, 2, 128]
	Conv2D layer output(ConvBlock2/conv1) shape: [1, 4, 4, 128]
	Conv2D layer output(ConvBlock2/conv0) shape: [1, 4, 4, 128]
ConvBlock1:
	MaxPool2D layer output(ConvBlock1/maxpool2) shape: [1, 4, 4, 64]
	Conv2D layer output(ConvBlock1/conv1) shape: [1, 8, 8, 64]
	Conv2D layer output(ConvBlock1/conv0) shape: [1, 8, 8, 64]
---------------------------------------------------------------------------
```

In [48]:
tf.random.set_seed(0)
x_test_3 = tf.random.normal(shape=(6, 8, 8, 3))

tf.random.set_seed(1)
test_vgg6 = VGG6(C=5, input_feats_shape=(8, 8, 3), wt_scale=1e-1)
acts = test_vgg6(x_test_3)
print(f'Your VGG6 output layer net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    0.999 0.    0.001]
 [0.    0.    0.652 0.003 0.345]]''')

Your VGG6 output layer net_acts are
[[0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    0.999 0.    0.001]
 [0.    0.    0.652 0.003 0.345]]
and they should be:
[[0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    0.999 0.    0.001]
 [0.    0.    0.652 0.003 0.345]]


### 5c. Train `VGG6` on CIFAR-10 with the default learning rate

In the cells below:
1. Load in the CIFAR-10 dataset.
2. Train for `25` epochs with default lr and other default hyperparameters. Your initial training and val losses should be 2.30 and should hold steady.
3. Print out the final test accuracy.

#### Important notes

#### 1. Running on CoCalc and GPU

You should do this training session (and all subsequent "real" training sessions this semester on the GPU in CoCalc). Training at this point on your CPU is basically infeasible (*feel free to try it!*).

#### 2. JIT compiling the train and test steps

While training VGG6 on the GPU should take ~15 secs per epoch, which is not too bad, soon deeper networks and larger datasets will make the training too slow for us (*even on the GPU!*). To speed things up considerably now and going forward, use the process we discussed in class to decorate `train_step` and `test_step` with `@tf.function(jit_compile=True)`. The 1st epoch might be a little slow, but subsequent epochs should now fly by.

**Note:**
- If you have have trouble just-in-time (JIT) compiling the train and test steps, you should be able to decorate with `@tf.function` to statically compile the network (non-JIT). This may be slower than JIT compiling the network, but should still be faster than no compilation. **If JIT compiling does not work, please seek help. JIT compiling on CoCalc will be very helpful going forward.**
- If you are training locally on macOS, JIT compiling will not work, but falling back to `@tf.function` should work fine.

In [None]:
from datasets import get_dataset

In [None]:
# KEEP ME
tf.random.set_seed(0)


### 5d. Train `VGG6` on CIFAR-10 with a smaller learning rate

In the cells below, repeat what you did in the previous subtask, but this time change the learning rate to `1e-5`. You should get a very different result.

In [None]:
# KEEP ME
tf.random.set_seed(0)


### 5e. Questions

**Question 4:** How does the modified learning rate compare with the default? Why do you think you observed what you did for VGG6 and not VGG4 trained on CIFAR-10 with the default lr?

**Answer 4:** 

## Task 6: Early stopping and He/Kaiming initialization

The experiment that you just ran illuminates two major issues with our training workflow:
1. Cutting off training while the net is learning after waiting a long time at some prespecified number of epochs is frustrating. It would be nice to not have to manually set the number of training epochs as long as the net is making progress.
2. Picking the correct lr that could make or break training is frustrating. It would be nice to have the net work well to a wide range of lr choices and number of layers.

In this section, we will introduce the following techniques to combat these respective issues:
1. Early stopping.
2. He/Kaiming weight initialization (*next week*).

In [None]:
from network import DeepNetwork

### 6a. Implement early stopping

Implement the `early_stopping` method in `DeepNetwork` to determine the appropriate conditions to stop during training.

#### Test: `early_stopping`

In [13]:
dn = DeepNetwork((1,), 0.)

# Test 1
patience_1 = 5
es_lost_hist_1 = []
for iter in range(10):
    curr_loss = float(iter)
    es_lost_hist_1, stop = dn.early_stopping(es_lost_hist_1, curr_loss, patience=patience_1)

    if stop:
        break
print(f'Early stopping Test 1 ({patience_1=}):\n Stopped after {iter} iterations (should be 5 iterations).')
print(f' Recent loss history is {es_lost_hist_1} and should be [1.0, 2.0, 3.0, 4.0, 5.0]')
print()

# Test 2
tf.random.set_seed(1)
patience_2 = 3
es_lost_hist_2 = []
test_2_loss_vals = list(tf.random.uniform(shape=(20,)).numpy())
for iter in range(30):
    curr_loss = test_2_loss_vals[iter]
    es_lost_hist_2, stop = dn.early_stopping(es_lost_hist_2, curr_loss, patience=patience_2)

    if stop:
        break
print(f'Early stopping Test 2 ({patience_2=}):\n Stopped after {iter} iterations (should be 6 iterations).')
print(f' Recent loss history is {es_lost_hist_2} and should be [0.29193902, 0.64250207, 0.9757855]')
print()

# Test 3
tf.random.set_seed(1)
patience_3 = 6
es_lost_hist_3 = []
test_3_loss_vals = list(tf.random.uniform(shape=(20,)).numpy())
for iter in range(30):
    curr_loss = test_3_loss_vals[iter]
    es_lost_hist_3, stop = dn.early_stopping(es_lost_hist_3, curr_loss, patience=patience_3)

    if stop:
        break
print(f'Early stopping Test 3 ({patience_3=}):\n Stopped after {iter} iterations (should be 9 iterations).')
print(f' Recent loss history is\n {es_lost_hist_3}\n and should be')
print(' [0.29193902, 0.64250207, 0.9757855, 0.43509948, 0.6601019, 0.60489583]')
print()



Early stopping Test 1 (patience_1=5):
 Stopped after 5 iterations (should be 5 iterations).
 Recent loss history is [1.0, 2.0, 3.0, 4.0, 5.0] and should be [1.0, 2.0, 3.0, 4.0, 5.0]

Early stopping Test 2 (patience_2=3):
 Stopped after 6 iterations (should be 6 iterations).
 Recent loss history is [0.29193902, 0.64250207, 0.9757855] and should be [0.29193902, 0.64250207, 0.9757855]

Early stopping Test 3 (patience_3=6):
 Stopped after 9 iterations (should be 9 iterations).
 Recent loss history is
 [0.29193902, 0.64250207, 0.9757855, 0.43509948, 0.6601019, 0.60489583]
 and should be
 [0.29193902, 0.64250207, 0.9757855, 0.43509948, 0.6601019, 0.60489583]



### 6b. Integrate early stopping into training

Modify your `fit` function to support early stopping. Here are the changes to make:

1. Before the training loop create an empty list to record the rolling list of recent validation loss values within the patience window of epochs.
2. Each time the validation loss is computed, update and check the early stopping conditions. If the conditions are met, end the training early before `max_epochs` epochs is reached.
3. Make sure you are returning as the 4th return argument the number of epochs before training ended.

#### Test: `fit` with early stopping

The following test should end:
- in about 10 secs.
- after 300 epochs.
- with final training loss of 0.04, Val loss of 0.06, Val acc of 96.00%

In [14]:
from layers import Dense

In [16]:
# Quickly make a mock network for testing
class SoftmaxNet(DeepNetwork):
    def __init__(self, input_feats_shape, C, reg=0):
        super().__init__(input_feats_shape, reg)
        self.output_layer = Dense('TestDense', units=C, activation='softmax', prev_layer_or_block=None)

    def __call__(self, x):
        return self.output_layer(x)

# Load in Iris train/validation sets
train_samps = tf.constant(np.load('data/iris/iris_train_samps.npy'), dtype=tf.float32)
train_labels = tf.constant(np.load('data/iris/iris_train_labels.npy'), dtype=tf.int32)
val_samps = tf.constant(np.load('data/iris/iris_val_samps.npy'), dtype=tf.float32)
val_labels = tf.constant(np.load('data/iris/iris_val_labels.npy'), dtype=tf.int32)

# Set some vars
C = 3
M = train_samps.shape[1]
mini_batch_sz = 25
lr = 1e-1
max_epochs = 5000
patience = 3
val_every = 100  # how often (in epochs) we check the val loss/acc/early stopping

# Create our test net
tf.random.set_seed(0)
slnet = SoftmaxNet((M,), C)
slnet.compile(lr=lr)

_, val_loss_hist, val_acc_hist, e = slnet.fit(train_samps, train_labels, val_samps, val_labels,
                                              batch_size=mini_batch_sz,
                                              max_epochs=max_epochs,
                                              patience=patience,
                                              val_every=val_every)

print(75*'-')
print(f'Iris test ended after {e} epochs with final val loss/acc of {val_loss_hist[-1]:.2f}/{val_acc_hist[-1]:.2f}')
print(75*'-')



---------------------------------------------------------------------------
Dense layer output(TestDense) shape: [1, 3]
---------------------------------------------------------------------------
Epoch 1/5000: Train Loss: 1.1915, Time: 0.34s
Epoch 2/5000: Train Loss: 0.9937, Time: 0.00s
Epoch 3/5000: Train Loss: 0.6202, Time: 0.00s
Epoch 4/5000: Train Loss: 0.4982, Time: 0.00s
Epoch 5/5000: Train Loss: 0.4816, Time: 0.00s
Epoch 6/5000: Train Loss: 0.4356, Time: 0.00s
Epoch 7/5000: Train Loss: 0.3672, Time: 0.00s
Epoch 8/5000: Train Loss: 0.4685, Time: 0.00s
Epoch 9/5000: Train Loss: 0.4572, Time: 0.00s
Epoch 10/5000: Train Loss: 0.3995, Time: 0.00s
Epoch 11/5000: Train Loss: 0.3715, Time: 0.00s
Epoch 12/5000: Train Loss: 0.3322, Time: 0.00s
Epoch 13/5000: Train Loss: 0.4086, Time: 0.00s
Epoch 14/5000: Train Loss: 0.3940, Time: 0.00s
Epoch 15/5000: Train Loss: 0.2666, Time: 0.00s
Epoch 16/5000: Train Loss: 0.3424, Time: 0.00s
Epoch 17/5000: Train Loss: 0.3614, Time: 0.00s
Epoch 18/5000: