# Assignment 06 Solutions

Submitted By: ANSARI PARVEJ

#### 1.	What are the advantages of a CNN over a fully connected DNN for image classification?

**Ans:**
![image.png](attachment:image.png)

CNNs have several advantages over fully connected DNNs for image classification:

- Parameter sharing: In a CNN, the same filter is applied to multiple parts of the input image, allowing the network to detect similar features in different parts of the image. This significantly reduces the number of parameters in the network, which can speed up training and reduce overfitting.

- Translation invariance: Because of parameter sharing, CNNs are able to recognize patterns regardless of their position in the image. This makes them more robust to translation, rotation, and other transformations.

- Hierarchical representation: CNNs are typically composed of multiple layers, each of which learns to recognize increasingly complex patterns. The first layer might learn to detect edges, the second layer might learn to detect shapes, and so on. This hierarchical representation allows CNNs to capture complex patterns and relationships between features.

- Local connectivity: CNNs are able to capture local features and spatial relationships between them. Fully connected DNNs, on the other hand, treat each input feature as independent, ignoring any spatial relationships that might exist.


#### 2.	Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of 2, and "same" padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels.
What is the total number of parameters in the CNN? If we are using 32-bit floats, at least how much RAM will this network require when making a prediction for a single instance? What about when training on a mini-batch of 50 images?

**Ans:**

The total number of parameters in the CNN can be calculated as follows:

First Convolutional layer:

    Number of parameters in the kernel: 3 * 3 * 3 * 100 = 2700
    Number of biases: 100
    Total parameters in the layer: 2800

Second Convolutional layer:

    Number of parameters in the kernel: 3 * 3 * 100 * 200 = 180,000
    Number of biases: 200
    Total parameters in the layer: 180,200

Third Convolutional layer:

    Number of parameters in the kernel: 3 * 3 * 200 * 400 = 1,152,000
    Number of biases: 400
    Total parameters in the layer: 1,152,400

Total number of parameters in the CNN = 2,335,400

If we are using 32-bit floats, then the network will require the following amount of RAM:

    Prediction for a single instance: 200 * 300 * 3 * 32 bits = 57.6 MB
    Training on a mini-batch of 50 images: 50 * 200 * 300 * 3 * 32 bits = 2.88 GB

#### 3.	If your GPU runs out of memory while training a CNN, what are five things you could try to solve the problem?

**Ans:**

Here are five things you could try to solve the problem of running out of GPU memory while training a CNN:

- Decrease the batch size: reducing the number of instances processed in one batch will lower the amount of memory required.
- Use mixed precision training: using half-precision (float16) instead of single-precision (float32) for training can reduce the memory usage.
- Reduce the size of the model: decreasing the number of layers, the number of filters in each layer, or the kernel size will lower the number of parameters and memory usage.
- Use data augmentation: generating augmented images on the fly during training can reduce the need to store all the images in memory.
- Increase the swap space: if the GPU is running out of memory, increasing the size of the swap space may help, although it will likely result in slower training times due to increased data transfer between CPU and GPU.

#### 4.	Why would you want to add a max pooling layer rather than a convolutional layer with the same stride?

**Ans:**

Max pooling layers are typically used after convolutional layers in CNNs to reduce the spatial dimensionality of the feature maps. The pooling operation involves partitioning the feature maps into non-overlapping rectangular subregions and outputting the maximum value within each subregion. This reduces the number of parameters and computation required in subsequent layers while also providing some degree of translation invariance.

Using a convolutional layer with the same stride as the pooling layer would not have the same effect of reducing the spatial dimensionality, because the convolutional layer would still output a feature map of the same size as the input. Additionally, a convolutional layer would require more computation and parameters than a pooling layer, making the network slower and more prone to overfitting.

#### 5.	When would you want to add a local response normalization layer?

**Ans:**

Local response normalization is a technique used in some convolutional neural network (CNN) architectures to normalize the output of a layer by dividing each value by the sum of the squares of a fixed number of adjacent values. This technique is generally used to encourage competition between the feature maps of a layer, which can help to prevent overfitting and improve generalization performance.

However, it has been shown that local response normalization layers do not always improve performance and can even harm it in some cases. Therefore, the use of local response normalization layers is somewhat controversial, and they are not used as frequently as other types of layers such as convolutional, pooling, and normalization layers.

In general, it may be useful to experiment with adding local response normalization layers to a CNN to see if they improve performance on a particular task, but they should not be added by default without careful consideration of their potential benefits and drawbacks.

#### 6.	Can you name the main innovations in AlexNet, compared to LeNet-5? What about the main innovations in GoogLeNet, ResNet, SENet, and Xception?

**Ans:**

 The main innovations in these popular CNN architectures:

- AlexNet (2012): It was deeper than LeNet-5 and used ReLU activation functions instead of sigmoid. It also used dropout regularization to prevent overfitting. Another key innovation was the use of data augmentation, which helped increase the size of the training set. AlexNet also used parallel GPU processing to speed up training.

- GoogLeNet (2014): One of the main innovations in GoogLeNet was the use of the Inception module, which allowed the network to learn both low-level and high-level features in parallel. This module consists of several parallel convolutional layers of different sizes and a pooling layer, followed by concatenation of their outputs. GoogLeNet also used global average pooling instead of fully connected layers, which helped reduce overfitting.

- ResNet (2015): ResNet introduced the concept of residual blocks, which allow the network to learn residual connections that can shortcut over layers. This makes it easier to train much deeper networks by preventing the vanishing gradient problem. ResNet also used batch normalization, which further helped with the gradient flow problem and improved the performance of deep networks.

- SENet (2017): SENet used a technique called "squeeze-and-excitation" to allow the network to adaptively recalibrate channel-wise feature responses. This mechanism helped the network to focus on the most informative features while suppressing irrelevant ones, which improved performance while reducing the number of parameters.

- Xception (2017): Xception stands for "Extreme Inception", and it takes the Inception module to the extreme by replacing the traditional convolutional layers with depthwise separable convolutions. This greatly reduces the number of parameters while maintaining high accuracy. Xception also used residual connections and batch normalization, like ResNet.

#### 7.	What is a fully convolutional network? How can you convert a dense layer into a convolutional layer?

**Ans:**

A fully convolutional network (FCN) is a type of neural network where all the layers are convolutional, without any fully connected layers. FCNs are often used for image segmentation tasks, where the output is a pixel-wise classification of the input image.

To convert a dense layer into a convolutional layer, we need to reshape the dense layer's weights as if they were convolutional kernels. For example, if we have a dense layer with 100 neurons that takes a 5x5 input, and we want to convert it into a convolutional layer with 100 3x3 kernels, we can reshape the dense layer's weights from (25, 100) to (3, 3, 5, 5, 100), and set the convolutional layer's weights to these reshaped weights. The output of the convolutional layer will be equivalent to the output of the dense layer. This technique can be used to convert any dense layer into a convolutional layer with a similar number of parameters.

#### 8.	What is the main technical difficulty of semantic segmentation?

**Ans:**

The main technical difficulty of semantic segmentation is to accurately identify and classify each pixel in an image to a specific class or category while preserving spatial information. This requires the model to have a large receptive field, capturing the global context of the image, while maintaining high resolution feature maps to capture fine details. Additionally, handling class imbalance and class confusion can also be challenging.

#### 9.	Build your own CNN from scratch and try to achieve the highest possible accuracy on MNIST.

In [5]:
import tensorflow as tf

from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical


def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    return model

model = create_model()
model.summary()

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

### Normalize pixel values to be between 0 and 1
train_images = train_images.reshape((60000, 28, 28, 1)) / 255.0
test_images = test_images.reshape((10000, 28, 28, 1)) / 255.0

### One-hot encode labels
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

### Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

### Train the model
model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_data=(test_images, test_labels))


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_3 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 13, 13, 32)       0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 3, 3, 64)          36928     
                                                                 
 flatten_1 (Flatten)         (None, 576)              

2023-05-02 17:45:24.056775: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 188160000 exceeds 10% of free system memory.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f7fe699c9a0>

#### 10.	Use transfer learning for large image classification, going through these steps:
- a.	Create a training set containing at least 100 images per class. For example, you could classify your own pictures based on the location (beach, mountain, city, etc.), or alternatively you can use an existing dataset (e.g., from TensorFlow Datasets).
- b.	Split it into a training set, a validation set, and a test set.
- c.	Build the input pipeline, including the appropriate preprocessing operations, and optionally add data augmentation.
- d.	Fine-tune a pretrained model on this dataset.

**Ans:** ??????????