# 1. Deep Learning.
a. Build a DNN with five hidden layers of 100 neurons each, He initialization, and the
ELU activation function.
b. Using Adam optimization and early stopping, try training it on MNIST but only on
digits 0 to 4, as we will use transfer learning for digits 5 to 9 in the next exercise. You
will need a softmax output layer with five neurons, and as always make sure to save
checkpoints at regular intervals and save the final model so you can reuse it later.
c. Tune the hyperparameters using cross-validation and see what precision you can
achieve.
d. Now try adding Batch Normalization and compare the learning curves: is it
converging faster than before? Does it produce a better model?
e. Is the model overfitting the training set? Try adding dropout to every layer and try
again. Does it help?


a. To build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function, you can use Python with the help of deep learning libraries like TensorFlow or Keras. Here's an example of how you can create such a model using Keras:

```python
import tensorflow as tf
from tensorflow import keras

# Load the MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Filter the dataset to include only digits 0 to 4
train_mask = train_labels < 5
test_mask = test_labels < 5
train_images = train_images[train_mask]
train_labels = train_labels[train_mask]
test_images = test_images[test_mask]
test_labels = test_labels[test_mask]

# Preprocess the data
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.Dense(5, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Set up callbacks for checkpoint saving and early stopping
checkpoint_cb = keras.callbacks.ModelCheckpoint('model_checkpoint.h5', save_best_only=True)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

# Train the model
history = model.fit(train_images, train_labels, epochs=100, validation_data=(test_images, test_labels),
                    callbacks=[checkpoint_cb, early_stopping_cb])

# Save the final model
model.save('final_model.h5')
```

b. The code snippet provided in part (a) already covers training the model on MNIST for digits 0 to 4 and saving checkpoints and the final model.

c. To tune the hyperparameters and evaluate the model's precision using cross-validation, you can use techniques like grid search or random search. Here's an example of using scikit-learn's `GridSearchCV` to tune the learning rate and batch size:

```python
import numpy as np
from tensorflow import keras
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

# Define a function to create the model with hyperparameters
def create_model(learning_rate=0.001, batch_size=32):
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(28, 28)),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(5, activation='softmax')
    ])
    optimizer = keras.optimizers.Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer,
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    return model

# Create the KerasClassifier wrapper for scikit-learn
model = K

erasClassifier(build_fn=create_model, epochs=100, verbose=0)

# Define the hyperparameters to tune
param_grid = {
    'learning_rate': [0.001, 0.01, 0.1],
    'batch_size': [32, 64, 128]
}

# Perform grid search cross-validation
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)
grid_result = grid.fit(train_images, train_labels)

# Print the best precision achieved
best_precision = grid_result.best_score_
print("Best Precision: {:.2f}%".format(best_precision * 100))
```

d. To add Batch Normalization to the model and compare the learning curves, you can modify the model architecture as follows:

```python
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(5, activation='softmax')
])
```

After making this change, retrain the model using the same code provided in part (a) to compare the learning curves.

e. To address potential overfitting, you can add dropout regularization to every layer. Modify the model architecture as follows:

```python
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(5, activation='softmax')
])
```

Then, retrain the model using the same code provided in part (a) to see if dropout helps in reducing overfitting.

# 2. Transfer learning.
a. Create a new DNN that reuses all the pretrained hidden layers of the previous
model, freezes them, and replaces the softmax output layer with a new one.
b. Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how
long it takes. Despite this small number of examples, can you achieve high precision?
c. Try caching the frozen layers, and train the model again: how much faster is it now?
d. Try again reusing just four hidden layers instead of five. Can you achieve a higher
precision?
e. Now unfreeze the top two hidden layers and continue training: can you get the
model to perform even better?


a. To create a new DNN that reuses the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a new one, you can follow these steps:

```python
import tensorflow as tf
from tensorflow import keras

# Load the MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Filter the dataset to include only digits 5 to 9
train_mask = train_labels >= 5
test_mask = test_labels >= 5
train_images = train_images[train_mask]
train_labels = train_labels[train_mask]
test_images = test_images[test_mask]
test_labels = test_labels[test_mask]

# Preprocess the data
train_images = train_images / 255.0
test_images = test_images / 255.0

# Load the pretrained model
pretrained_model = keras.models.load_model('final_model.h5')

# Freeze the pretrained layers
for layer in pretrained_model.layers:
    layer.trainable = False

# Create a new output layer
output_layer = keras.layers.Dense(5, activation='softmax')

# Create the new model with the pretrained layers and new output layer
new_model = keras.Sequential([
    pretrained_model,
    output_layer
])

# Compile the new model
new_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
```

b. To train the new DNN on digits 5 to 9 using only 100 images per digit and measure the training time, you can use the following code:

```python
import time

# Reduce the number of examples per digit to 100
num_examples_per_digit = 100
train_indices = np.arange(len(train_labels))
np.random.shuffle(train_indices)
train_indices = train_indices[:5*num_examples_per_digit]
train_images = train_images[train_indices]
train_labels = train_labels[train_indices]

# Start the timer
start_time = time.time()

# Train the new model
new_model.fit(train_images, train_labels, epochs=100, validation_data=(test_images, test_labels))

# Calculate the training time
training_time = time.time() - start_time
print("Training time: {:.2f} seconds".format(training_time))
```

Achieving high precision with only 100 examples per digit might be challenging, but transfer learning can help leverage the knowledge gained from the previous model.

c. To cache the frozen layers and train the model again, you can modify the code as follows:

```python
# Cache the frozen layers
for layer in pretrained_model.layers:
    layer.trainable = False
    layer._name += '_cached'

# Compile the new model
new_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Start the timer
start_time = time.time()

# Train the new model
new_model.fit(train_images, train_labels, epochs=100, validation_data=(test_images, test_labels))

# Calculate the training time
training_time = time.time() - start_time
print("Training time with cached frozen layers: {:.2f} seconds".format(training_time))
```

Caching the frozen layers can speed up training since the computations for these layers are not repeated during each epoch.

d. To reuse just four hidden layers instead of five and aim for higher precision, you can modify the code as follows:

```python
# Load the pretrained model
pretrained_model = keras.models.load_model('final_model.h5')

# Freeze the last four pretrained layers
for layer in pretrained_model.layers[:-4]:
    layer.trainable = False

# Create a new output layer
output_layer = keras.layers.Dense(5, activation='softmax')

#

 Create the new model with four pretrained layers and the new output layer
new_model = keras.Sequential([
    pretrained_model.layers[0],
    pretrained_model.layers[1],
    pretrained_model.layers[2],
    pretrained_model.layers[3],
    output_layer
])

# Compile the new model
new_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Start the timer
start_time = time.time()

# Train the new model
new_model.fit(train_images, train_labels, epochs=100, validation_data=(test_images, test_labels))

# Calculate the training time
training_time = time.time() - start_time
print("Training time with four hidden layers: {:.2f} seconds".format(training_time))
```

By reusing fewer hidden layers, you reduce the complexity of the model and may achieve better precision due to better generalization.

e. To unfreeze the top two hidden layers and continue training to further improve performance, you can modify the code as follows:

```python
# Unfreeze the top two pretrained layers
for layer in pretrained_model.layers[-2:]:
    layer.trainable = True

# Compile the new model
new_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Start the timer
start_time = time.time()

# Train the new model
new_model.fit(train_images, train_labels, epochs=100, validation_data=(test_images, test_labels))

# Calculate the training time
training_time = time.time() - start_time
print("Training time with unfrozen top two layers: {:.2f} seconds".format(training_time))
```

By unfreezing the top two hidden layers, the model can fine-tune these layers specifically for the new task, potentially improving the overall performance.

# 3. Pretraining on an auxiliary task.
a. In this exercise you will build a DNN that compares two MNIST digit images and
predicts whether they represent the same digit or not. Then you will reuse the lower
layers of this network to train an MNIST classifier using very little training data. Start
by building two DNNs (let’s call them DNN A and B), both similar to the one you built
earlier but without the output layer: each DNN should have five hidden layers of 100
neurons each, He initialization, and ELU activation. Next, add one more hidden layer
with 10 units on top of both DNNs. To do this, you should use
TensorFlow’s concat() function with axis=1 to concatenate the outputs of both DNNs
for each instance, then feed the result to the hidden layer. Finally, add an output
layer with a single neuron using the logistic activation function.
b. Split the MNIST training set in two sets: split #1 should containing 55,000 images,
and split #2 should contain contain 5,000 images. Create a function that generates a
training batch where each instance is a pair of MNIST images picked from split #1.
Half of the training instances should be pairs of images that belong to the same
class, while the other half should be images from different classes. For each pair, the

training label should be 0 if the images are from the same class, or 1 if they are from
different classes.
c. Train the DNN on this training set. For each image pair, you can simultaneously feed
the first image to DNN A and the second image to DNN B. The whole network will
gradually learn to tell whether two images belong to the same class or not.
d. Now create a new DNN by reusing and freezing the hidden layers of DNN A and
adding a softmax output layer on top with 10 neurons. Train this network on split #2
and see if you can achieve high performance despite having only 500 images per
class.


a. To build two DNNs (DNN A and B) for comparing MNIST digit images, and then reuse their lower layers for training an MNIST classifier, you can follow these steps:

```python
import tensorflow as tf
from tensorflow import keras

# Define the DNN architecture for comparing MNIST digit images
def create_comparison_model():
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(28, 28)),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
        keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    ])
    return model

# Create DNN A
dnn_a = create_comparison_model()

# Create DNN B
dnn_b = create_comparison_model()

# Concatenate the outputs of DNN A and DNN B
concatenated_output = tf.concat([dnn_a.output, dnn_b.output], axis=1)

# Add a hidden layer on top of the concatenated output
hidden_layer = keras.layers.Dense(10, activation='elu', kernel_initializer='he_normal')(concatenated_output)

# Add the output layer with logistic activation
output_layer = keras.layers.Dense(1, activation='sigmoid')(hidden_layer)

# Create the final comparison model
comparison_model = keras.Model(inputs=[dnn_a.input, dnn_b.input], outputs=output_layer)
```

b. To generate a training batch where each instance is a pair of MNIST images with corresponding labels indicating whether they belong to the same class or not, you can use the following function:

```python
import numpy as np

def generate_training_batch(images, labels, batch_size=32):
    num_classes = len(np.unique(labels))
    half_batch_size = batch_size // 2

    # Select half of the batch as same-class pairs
    indices_same_class = np.random.choice(np.where(labels < num_classes)[0], size=half_batch_size, replace=False)
    images_same_class = images[indices_same_class]
    labels_same_class = np.zeros(half_batch_size)

    # Select half of the batch as different-class pairs
    indices_diff_class_1 = np.random.choice(np.where(labels < num_classes)[0], size=half_batch_size, replace=False)
    indices_diff_class_2 = np.random.choice(np.where(labels >= num_classes)[0], size=half_batch_size, replace=False)
    images_diff_class_1 = images[indices_diff_class_1]
    images_diff_class_2 = images[indices_diff_class_2]
    labels_diff_class = np.ones(half_batch_size)

    # Concatenate same-class and different-class pairs
    image_pairs = np.concatenate([images_same_class, images_diff_class_1, images_diff_class_2], axis=0)
    labels_pairs = np.concatenate([labels_same_class, labels_diff_class], axis=0)

    # Shuffle the pairs
    shuffle_indices = np.random.permutation(batch_size)
    image_pairs = image_pairs[shuffle_indices]
    labels_pairs = labels_pairs[shuffle_indices]

    return image_pairs, labels_pairs
```

c. To train the comparison model on the generated training set, you can use the following code:

```python
# Split the MNIST training set into split #1 and split #2
split_1_size = 55000
split_2_size = 5000
split_1_images = train_images[:split_1_size]
split_1_labels = train

_labels[:split_1_size]
split_2_images = train_images[split_1_size:split_1_size + split_2_size]
split_2_labels = train_labels[split_1_size:split_1_size + split_2_size]

# Compile the comparison model
comparison_model.compile(optimizer='adam',
                         loss='binary_crossentropy',
                         metrics=['accuracy'])

# Train the comparison model
batch_size = 32
num_epochs = 10
num_steps_per_epoch = len(split_1_images) // batch_size

for epoch in range(num_epochs):
    for step in range(num_steps_per_epoch):
        image_pairs, labels_pairs = generate_training_batch(split_1_images, split_1_labels, batch_size)
        comparison_model.train_on_batch([image_pairs[:, 0], image_pairs[:, 1]], labels_pairs)
```

d. To create a new DNN by reusing and freezing the hidden layers of DNN A and adding a softmax output layer on top for training on split #2, you can use the following code:

```python
# Freeze the hidden layers of DNN A
for layer in dnn_a.layers:
    layer.trainable = False

# Add a softmax output layer on top of DNN A
output_layer = keras.layers.Dense(10, activation='softmax')(dnn_a.output)

# Create the final classifier model
classifier_model = keras.Model(inputs=dnn_a.input, outputs=output_layer)

# Compile the classifier model
classifier_model.compile(optimizer='adam',
                         loss='sparse_categorical_crossentropy',
                         metrics=['accuracy'])

# Train the classifier model on split #2
classifier_model.fit(split_2_images, split_2_labels, epochs=10, validation_data=(test_images, test_labels))
```

Even with only 500 images per class, transfer learning by reusing the lower layers of DNN A can help achieve high performance on split #2.