# Yashwant Desai –  DL_Theory_Assignment_3

# 1.	Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

No, it's not advisable to initialize all the weights to the same value even with He initialization. The purpose of He initialization is to set the weights to random values drawn from a Gaussian distribution with a mean of 0 and a variance of 2/n where n is the number of input units. Initializing all weights to the same value would defeat the purpose of introducing randomness leading to symmetry problems during training.

# 2.	Is it OK to initialize the bias terms to 0?

Yes, it is generally okay to initialize the bias terms to 0. Bias terms are used to shift the activation function, and setting them to 0 initially is a common practice. However there are variations of initialization techniques that involve non-zero bias initialization but it's not a requirement for most deep learning applications.

# 3.	Name three advantages of the SELU activation function over ReLU.

Self-normalization: SELU is designed to maintain a mean output close to 0 and a standard deviation close to 1 during training which can help mitigate the vanishing/exploding gradient problem.

Smoothness: SELU is smooth and differentiable everywhere which can lead to more stable and efficient training.

Avoiding dead neurons: SELU can help prevent the issue of "dying ReLU" neurons by allowing some activation values to be negative which may lead to better information flow in the network.

# 4.	In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

SELU: Use SELU when you want self-normalization and a smooth, differentiable activation function, particularly in deep feedforward neural networks.

Leaky ReLU and variants: Use them when you want to address the dying ReLU problem or want to introduce a small amount of non-linearity in the network.

ReLU: It's a popular choice for most hidden layers in deep neural networks due to its simplicity and computational efficiency.

tanh: Use tanh when you need centered activations that range from -1 to 1, such as in recurrent neural networks (RNNs).

logistic (sigmoid): Use logistic activation in the output layer for binary classification problems.

softmax: Use softmax activation in the output layer for multi-class classification problems to obtain class probabilities.

# 5.	What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

Setting the momentum hyperparameter too close to 1 can lead to slow convergence or even instability during training. A high momentum value means that the optimizer relies heavily on past gradients making it resistant to change in direction. This can cause the optimizer to overshoot the optimal solution and oscillate around it or diverge completely.

# 6.	Name three ways you can produce a sparse model.

L1 Regularization: By adding an L1 penalty term to the loss function, it encourages many model weights to be exactly zero creating a sparse model.

Weight Pruning: Identify and set small-weight connections to zero after training, which sparsifies the model.

Binary Connect: Train the model with binary weights (either -1 or 1) instead of real-valued weights, which results in a sparse binary network.

# 7.	Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC Dropout?

Dropout can slow down training because it effectively reduces the capacity of the network requiring more training epochs. However dropout doesn't necessarily slow down inference significantly because during inference, dropout is turned off and the full model is used for predictions. MC Dropout, which involves running inference with dropout enabled multiple times and averaging the results can be slower but can provide better uncertainty estimates.

# 8.	Practice training a deep neural network on the CIFAR10 image dataset:
a.	Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the ELU activation function.

b.	Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. You can load it with keras.datasets.cifar10.load_data(). The dataset is composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model’s architecture or hyperparameters.

c.	Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?

d.	Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).

e.	Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.


In [4]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, ELU
from tensorflow.keras.optimizers import Nadam
from tensorflow.keras.datasets import cifar10
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import BatchNormalization, AlphaDropout
from tensorflow.keras.initializers import lecun_normal

# Step a: Build a DNN with 20 hidden layers of 100 neurons each using He initialization and ELU activation.
model = Sequential()
model.add(Dense(100, activation='elu', kernel_initializer='he_normal', input_shape=(32*32*3,)))
for _ in range(19):
    model.add(Dense(100, activation='elu', kernel_initializer='he_normal'))

model.add(Dense(10, activation='softmax'))

# Step b: Load and preprocess the CIFAR-10 dataset.
(X_train_full, y_train_full), (X_test, y_test) = cifar10.load_data()
X_train_full = X_train_full.reshape(X_train_full.shape[0], -1).astype('float32') / 255.0
X_test = X_test.reshape(X_test.shape[0], -1).astype('float32') / 255.0
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size=0.1, random_state=42)

# Compile the model and configure early stopping.
model.compile(loss='sparse_categorical_crossentropy', optimizer=Nadam(learning_rate=1e-4), metrics=['accuracy'])

early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

# Train the model.
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=[early_stopping_cb])

# Step c: Add Batch Normalization and compare the learning curves.
model_with_bn = Sequential()
model_with_bn.add(Dense(100, activation='elu', kernel_initializer='he_normal', input_shape=(32*32*3,)))
model_with_bn.add(BatchNormalization())
for _ in range(19):
    model_with_bn.add(Dense(100, activation='elu', kernel_initializer='he_normal'))
    model_with_bn.add(BatchNormalization())

model_with_bn.add(Dense(10, activation='softmax'))

model_with_bn.compile(loss='sparse_categorical_crossentropy', optimizer=Nadam(learning_rate=1e-4), metrics=['accuracy'])
history_with_bn = model_with_bn.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=[early_stopping_cb])

# Step d: Replace Batch Normalization with SELU and make necessary adjustments.
model_with_selu = Sequential()
model_with_selu.add(Dense(100, activation='selu', kernel_initializer=lecun_normal(), input_shape=(32*32*3,)))
for _ in range(19):
    model_with_selu.add(Dense(100, activation='selu', kernel_initializer=lecun_normal()))

model_with_selu.add(Dense(10, activation='softmax'))

model_with_selu.compile(loss='sparse_categorical_crossentropy', optimizer=Nadam(learning_rate=1e-4), metrics=['accuracy'])
history_with_selu = model_with_selu.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=[early_stopping_cb])

# Step e: Regularize the model with alpha dropout and experiment with MC Dropout.
model_with_alpha_dropout = Sequential()
model_with_alpha_dropout.add(Dense(100, activation='selu', kernel_initializer=lecun_normal(), input_shape=(32*32*3,)))
for _ in range(19):
    model_with_alpha_dropout.add(Dense(100, activation='selu', kernel_initializer=lecun_normal()))
    model_with_alpha_dropout.add(AlphaDropout(0.5))  # Alpha dropout with a rate of 0.5

model_with_alpha_dropout.add(Dense(10, activation='softmax'))

model_with_alpha_dropout.compile(loss='sparse_categorical_crossentropy', optimizer=Nadam(learning_rate=1e-4), metrics=['accuracy'])
history_with_alpha_dropout = model_with_alpha_dropout.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=[early_stopping_cb])

# MC Dropout: Use a function to make predictions with dropout enabled, and run it multiple times to average predictions.
def predict_with_mc_dropout(model, X, n_iterations=100):
    result = np.stack([model.predict(X) for _ in range(n_iterations)])
    return result.mean(axis=0)



Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100


Epoch 29/100


# Done all 8 questions 

# Regards,Yashwant