### 1. Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?


### This is not recommended, even if its randomly selected. The purpose of weight initialization techniques like He is to break the symmetry and ensure that neurons in the network have different initial values, helps the network learn more effectively. 
### He initialization, which is designed for ReLU (Rectified Linear Unit) activation functions, initializes weights by sampling from a Gaussian distribution with mean 0 and variance 2/n, where "n" is the number of input units to the neuron. This variance is specifically chosen to ensure that the weights are not too small or too large, which can lead to problems like vanishing or exploding gradients during training.

------------

### 2. Is it OK to initialize the bias terms to 0?


### Initializing bias terms to 0 is a common practice in neural network initialization, and it is generally considered acceptable. Unlike weight initialization, where you want to break symmetry and provide some randomness to the initial weights, bias terms are not subject to the same concerns. Bias terms are used to shift the activation function, introducing a level of freedom that doesn't affect symmetry.

----------

### 3. Name three advantages of the SELU activation function over ReLU.



### 1. SELU has the property of self normalization, which means that when used in specific types of DNN, it can help stablize and maintain a consistent mean and variance throughout the layers. This helps dealing with issues such as vanishing gradient or exploding gradients. 

### 2. SELU encourages sparsity in NN helping to deal with overfitting. Sparsity means more neurons being inactive compared to RELU.

### 3. SELU is smooth, piecewise linear function that is differentiable everywhere. SELU enables the use of gradient based optimization like gradient descent which can lead to more stable and predictable convergence during training. 


----------

### 4. In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

### The use of activation functions depends on the problem, architecture, and data characteristics. 

1. SELU
- Use SELU - when working with deep neural networks, especially deep feedforward networks or recurrent networks like LSTMS.
- SELU can help address the vanishing gradient problem and encourage self normalization in deep networks, leading to faster convergence.

2. Leaky RELU
- use leaky RELU when you want to mitigate the 'dying RELU' problem where neurons can become inactive and not update their weights during training.
- parametric RELU and randomized Leaky RELU can be considered when you want to learn the leaky slope or introduce randomness, respectively to improve training.

3. RELU
- Use RELU when you want a simple, computationally efficient activation function that is effective in most cases.

4. tanh
- Use tanh when you need an activation function that squashes input values between -1 and 1, which helps with zero centered activations.

5. Logistic(Sigmoid)
- Use this AF when you need binary classification output (0 or 1).
- It is commonly used in the output layer of binary classifiers or networks designed for probability estimation. 

6. Softmax
- Use softmax in the output layer when you have a multiclass classification problem. 
- it converts raw scores or logits into probability distributions over multiple classes.
- It is used in conjunction with cross-entropy loss for training and evaluation.

-------------

### 5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

### The momentum parameter is a technique used to accelerate convergence and stabilize training, but when its set too high, it can have several unintended consequences.

1. Overshooting the minima - high momentum values will continue moving in a particular direction even when it should be slowing down or changing direction. As a result, the optimizer may overshoot the minimum of the loss func, causing training to oscillate or diverge.
2. Difficulty converging
3. Difficulting escaping local minima
4. sensitivity to learning rate
5. loss function fluctuations

To avoid these issue choose a reasonable momentum value, typically in the range of 0.5 to 0.9.

------------

### 6. Name three ways you can produce a sparse model.

Producing a sparse model means reducing the number of parameters or connections in a neural network.

1. Pruning - removes less important weights or connections from a trained neural network.
2. Quantization - involves reducing the precision of weights and activations in a neural network, typically from a 32 bit floating point values to lower bit represenations.
3. Knowledge distillation - is a technique in which a large, complex model (teacher) transfers its knowledge to a smaller, simpler model (student).

These methods can be used individually or in combination to produce sparse models that are suitable for deployment on resource-constrained devices or for reducing the computational cost of large neural networks while preserving their performance to a reasonable extent.

--------------

### 7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC Dropout?

### Dropout is a regularization technique commonly used in neural networks to prevent overfitting. It involves randomly deactivating (dropping out) a fraction of neurons during each training iteration. While dropout can introduce some additional computational overhead during training, it typically does not slow down inference (making predictions on new instances).

### During training, dropout randomly sets a fraction of neuron activations to zero. This introduces a form of stochasticity and encourages the network to be more robust by preventing it from relying too heavily on any single neuron or feature.

### MC (Monte Carlo) Dropout is a technique that involves running inference with dropout applied multiple times (usually several forward passes) and averaging the predictions to obtain more reliable uncertainty estimates.

### While MC Dropout does introduce some additional computational cost during inference because you perform multiple forward passes, it can provide valuable uncertainty estimates for the model's predictions, which can be useful in applications like Bayesian deep learning or active learning.

-----------

### 8. Practice training a deep neural network on the CIFAR10 image dataset:

### a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the ELU activation function.

You can load it with keras.datasets.cifar10.load_​data(). The dataset is
composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for
testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons.
Remember to search for the right learning rate each time you change the model’s
architecture or hyperparameters.

### b. Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. 

In [None]:
#import the libraries

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, BatchNormalization, Activation
from tensorflow.keras.optimizers import Nadam
from tensorflow.keras.callbacks import EarlyStopping

#load and unpack the CIFAR-10 dataset

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

#Normalize pixel values to between 0 and 1
X_train, X_test = X_train / 255.0, X_test / 255.0

#Build the deep neural network
model = Sequential()
model.add(InputLayer(input_shape=(32, 32,3))) #add the input layer
model.add(tf.keras.layers.Flatten()) #flatten layer to input images

for _ in range (20):
    model.add(Dense(100, kernel_initializer='he_normal'))
    model.add(Activation('elu'))
    
    model.add(Dense(10, activation = 'softmax')) #output layer with 10 neurons
    model.compile(optimizer=Nadam(),loss = 'sparse_categorical_crossentropy',metrics=['accuracy'])

return model

model.summary()

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the model
history = model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test), callbacks=[early_stopping])

print(history)

### c. Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?

In [None]:
def build_dnn_with_batch_norm():
    model = Sequential()
    model.add(InputLayer(input_shape=(32, 32, 3)))  # Input layer
    model.add(tf.keras.layers.Flatten())  # Flatten layer to flatten input images
    
    # Add 20 hidden layers with He initialization, ELU activation, and Batch Normalization
    for _ in range(20):
        model.add(Dense(100, kernel_initializer='he_normal'))
        model.add(BatchNormalization()) #add batchnormalization
        model.add(Activation('elu'))
    
    # Output layer with softmax activation for 10 classes
    model.add(Dense(10, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer=Nadam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

# Build and train the model with Batch Normalization
dnn_model_with_batch_norm = build_dnn_with_batch_norm()
history_with_batch_norm = dnn_model_with_batch_norm.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test), callbacks=[early_stopping])


### Yes converges faster for sure, training speed is also fast. Producing a better model depends on many factors.

### d. Try replacing Batch Normalization with SELU, and make the necessary adjustments to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).


In [None]:
# Modify the build_dnn function to use SELU activation
def build_dnn_with_selu():
    model = Sequential()
    model.add(InputLayer(input_shape=(32, 32, 3)))  # Input layer
    model.add(tf.keras.layers.Flatten())  # Flatten layer to flatten input images
    
    # Add 20 hidden layers with LeCun initialization and SELU activation
    for _ in range(20):
        model.add(Dense(100, kernel_initializer='lecun_normal', activation='selu'))
    
    # Output layer with softmax activation for 10 classes
    model.add(Dense(10, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer=Nadam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

# Build and train the model with SELU activation
# tried with lesser epochs...100 epochs took lots of time.

dnn_model_with_selu = build_dnn_with_selu()
history_with_selu = dnn_model_with_selu.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test), callbacks=[early_stopping])


### e. Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.

In [None]:
from tensorflow.keras.layers import AlphaDropout

# Modify the build_dnn function to include AlphaDropout
def build_dnn_with_alpha_dropout():
    model = Sequential()
    model.add(InputLayer(input_shape=(32, 32, 3)))  # Input layer
    model.add(tf.keras.layers.Flatten())  # Flatten layer to flatten input images
    
    # Add 20 hidden layers with He initialization, ELU activation, Batch Normalization, and AlphaDropout
    for _ in range(20):
        model.add(Dense(100, kernel_initializer='he_normal'))
        model.add(BatchNormalization())
        model.add(Activation('elu'))
        model.add(AlphaDropout(0.2))  # Adjust dropout rate as needed
    
    # Output layer with softmax activation for 10 classes
    model.add(Dense(10, activation='softmax'))
    
    # Compile the model
    model.compile(optimizer=Nadam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

# Build and train the model with AlphaDropout
dnn_model_with_alpha_dropout = build_dnn_with_alpha_dropout()
history_with_alpha_dropout = dnn_model_with_alpha_dropout.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping])

# Apply MC Dropout for predictions (example)
mc_predictions = [dnn_model_with_alpha_dropout.predict(X_test) for _ in range(100)]
mc_pred_mean = np.mean(mc_predictions, axis=0)
