### TOPIC: Understanding Pooling and Padding in CNN

#### 1. Describe the purpose and brnefits of pooling in CNN.

Pooling, or max-pooling, is a crucial operation in Convolutional Neural Networks (CNNs) that plays a significant role in their architecture. Its primary purpose is to reduce the spatial dimensions of the input data while retaining its essential features. This process offers several benefits in the context of CNNs:

1. **Dimensionality Reduction**: Pooling reduces the size of feature maps by taking the maximum (max-pooling) or average (average-pooling) value within a local region. This reduction in dimensionality helps in decreasing the computational complexity of the network and can prevent overfitting by reducing the number of parameters.

2. **Translation Invariance**: Pooling helps CNNs achieve translation invariance, meaning that the network can recognize patterns in an image regardless of their exact position. By aggregating information in small local regions, the network becomes less sensitive to small shifts and translations in the input.

3. **Information Retention**: Pooling retains the most important information while discarding less relevant details. Max-pooling, in particular, keeps the maximum value from each region, which tends to capture the most salient features, making it useful for feature selection.

4. **Reduced Computational Load**: Smaller feature maps resulting from pooling operations require fewer computations, making the CNN more computationally efficient, especially in deep networks.

5. **Parameter Sharing**: Pooling involves no learnable parameters, which reduces the risk of overfitting and simplifies the training process compared to convolutional layers that require weight learning.

6. **Improved Generalization**: Pooling contributes to better generalization by promoting feature detection. It focuses on local features that are useful for classification and reduces the impact of noise or small variations.

7. **Effective Feature Hierarchies**: In CNN architectures, pooling is often interleaved with convolutional layers. This hierarchy of convolutional layers followed by pooling layers allows the network to learn increasingly abstract and complex features as the depth of the network increases.

8. **Faster Training**: Smaller feature maps require less memory and computational resources, which can speed up training, particularly in cases where memory is a constraint.

However, it's essential to note that while pooling has many advantages, it can also result in some loss of spatial information. In some situations, this may not be desirable, and alternative techniques, such as dilated convolutions or global average pooling, are used to achieve different trade-offs between spatial resolution and feature reduction. The choice of pooling method and parameters depends on the specific task and the architecture of the CNN.

#### 2. Explain the difference between min pooling and max pooling.

Min pooling and max pooling are two common types of pooling operations in Convolutional Neural Networks (CNNs). They are both used to reduce the spatial dimensions of feature maps while retaining essential features, but they differ in how they select and propagate information from the input.

1. **Max Pooling**:

   - **Operation**: Max pooling involves selecting the maximum value from a local region of the input. The most prominent feature within the region is retained, while the others are discarded.
   - **Advantages**:
     - Max pooling is particularly effective at capturing and preserving the most salient features in an image. It is robust to small variations and noise.
     - It helps in achieving translation invariance by focusing on the most dominant features.
   - **Common Use**: Max pooling is commonly used in CNN architectures, especially in image classification tasks.

2. **Min Pooling**:

   - **Operation**: Min pooling, on the other hand, selects the minimum value from a local region of the input. The smallest feature within the region is retained, while the others are discarded.
   - **Advantages**:
     - Min pooling can be useful in specific situations where the smallest features are more informative. For instance, it might be applied in certain edge detection scenarios.
     - It can provide a complementary approach to feature selection, emphasizing different aspects of the data compared to max pooling.
   - **Less Common**: Min pooling is less common in CNNs compared to max pooling, as max pooling is generally more effective for preserving dominant features.

In summary, the main difference between min pooling and max pooling lies in the type of information they retain from the local regions of the input. Max pooling selects the most significant features by keeping the maximum value, while min pooling selects the smallest features by keeping the minimum value. The choice between the two depends on the specific task and the nature of the features you want to emphasize in the CNN's architecture. Max pooling is more commonly used in practice, but min pooling may find application in specialized scenarios where it is more appropriate.

#### 3. Discuss the concept of padding in CNN and its significance.

Padding in Convolutional Neural Networks (CNNs) is a technique used to control the spatial dimensions of the feature maps produced during the convolutional and pooling operations. It involves adding extra rows and columns of zeros (or other values) around the input data before applying convolutions or pooling. Padding is significant for several reasons:

1. **Preserving Spatial Information**:
   - Padding allows the network to preserve the spatial dimensions of the feature maps, ensuring that the output has the same spatial dimensions as the input. This can be important for retaining positional information and ensuring that the network captures features near the edges of the input.

2. **Avoiding Edge Information Loss**:
   - In convolutional operations, without padding, the size of the feature maps decreases as you move deeper into the network layers. This can result in the loss of information at the edges of the input, which may be essential for detecting features near the boundaries of objects in an image.

3. **Controlling Output Size**:
   - Padding enables control over the size of the output feature maps. By adjusting the amount of padding, you can increase or decrease the spatial resolution of the feature maps. This is particularly important when designing architectures and managing memory requirements.

4. **Striding Compatibility**:
   - Padding can be used to ensure that the convolutional or pooling operation aligns with the stride. When the stride is applied, it specifies how far the convolutional filter or pooling window moves horizontally and vertically. Padding can help in cases where the stride does not evenly divide the input size.

5. **Mitigating Information Loss**:
   - Pooling layers, which reduce the spatial dimensions, can lead to a loss of information. Padding can mitigate this loss by extending the input with zeros, making it less likely to lose critical features in the pooling process.

6. **Enabling Different Network Architectures**:
   - Padding flexibility allows for the design of various network architectures with different spatial resolutions. This flexibility is essential for adapting to different tasks and datasets.

There are two common types of padding:

1. **Valid (No Padding)**:
   - In this mode, no padding is added to the input. As a result, the spatial dimensions of the feature maps decrease with each convolution or pooling operation, which is typical in many deep CNN architectures.

2. **Same (Zero Padding)**:
   - In this mode, padding is added such that the output feature maps have the same spatial dimensions as the input. Zero padding is often used because it doesn't introduce additional values and maintains the overall scale of the input data.

The choice of padding and its amount (i.e., the number of rows and columns of padding added) should be carefully considered when designing a CNN architecture, taking into account the specific requirements of the task and the desired balance between spatial resolution and computational efficiency.

#### 4. Compare and contrast zero-padding and valid-padding in terms of their effects on the output feature map size.

Zero-padding and valid-padding are two common techniques used to control the size of the output feature maps in Convolutional Neural Networks (CNNs). They have contrasting effects on the output feature map size:

1. **Zero-padding**:

   - **Effect on Output Size**: Zero-padding increases the size of the output feature map compared to the input size.
   - **Preservation of Spatial Dimensions**: With zero-padding, the spatial dimensions of the output feature map are typically preserved, meaning that the output has the same height and width as the input.
   - **Usage**: Zero-padding is commonly used when you want to maintain the spatial information and ensure that the output feature map has the same spatial dimensions as the input. This is useful in cases where maintaining positional information and spatial resolution is crucial, such as in image segmentation tasks.
   - **Padding Values**: Zero-padding adds rows and columns of zeros around the input data.

2. **Valid-padding** (No Padding):

   - **Effect on Output Size**: Valid-padding reduces the size of the output feature map compared to the input size.
   - **Reduction in Spatial Dimensions**: With valid-padding (or no padding), the spatial dimensions of the output feature map decrease as compared to the input. Each convolution operation reduces the size of the feature map.
   - **Usage**: Valid-padding is commonly used in deep CNN architectures where the objective is to progressively reduce the spatial dimensions to extract hierarchical and abstract features. This reduction in size can help control computational complexity and memory requirements.
   - **Padding Values**: Valid-padding does not add any padding values; it directly applies the convolution operation, leading to a reduction in spatial dimensions.

In summary, zero-padding and valid-padding are used to control the size of the output feature maps in CNNs, but they have opposite effects. Zero-padding increases the size of the output, maintaining spatial dimensions, while valid-padding reduces the size, progressively downsizing the feature maps as you go deeper into the network. The choice between these padding techniques depends on the specific requirements of the task, the desired balance between spatial resolution and computational efficiency, and the architectural design of the CNN.

### TOPIC: Exploring LeNet

#### 1. Provide a brief overview oj LeNet-5 architecture.

LeNet-5 is a classic and pioneering Convolutional Neural Network (CNN) architecture developed by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner in the 1990s. It is one of the earliest CNNs and played a crucial role in advancing the field of deep learning and computer vision. LeNet-5 was primarily designed for handwritten digit recognition, specifically for recognizing digits in postal codes on mail envelopes. Here is a brief overview of the LeNet-5 architecture:

1. **Input Layer**:
   - LeNet-5 accepts grayscale images as input.
   - The original architecture was designed for 32x32 pixel input images, which was a common size for hand-written digits at the time.

2. **Convolutional Layers**:
   - LeNet-5 consists of two convolutional layers followed by max-pooling layers.
   - The first convolutional layer applies six 5x5 filters, producing six feature maps.
   - The second convolutional layer applies sixteen 5x5 filters, producing sixteen feature maps.
   - The convolutional layers use the tanh activation function.

3. **Max-Pooling Layers**:
   - After each convolutional layer, LeNet-5 employs max-pooling layers.
   - The first max-pooling layer uses 2x2 windows with a stride of 2.
   - The second max-pooling layer uses 2x2 windows with a stride of 2.

4. **Fully Connected Layers**:
   - LeNet-5 has three fully connected layers.
   - The first fully connected layer has 120 neurons.
   - The second fully connected layer has 84 neurons.
   - The final output layer has 10 neurons, corresponding to the 10 possible digits (0-9).
   - The fully connected layers use the tanh activation function.

5. **Output Layer**:
   - The output layer uses the softmax activation function to produce class probabilities.

6. **Training**:
   - LeNet-5 was trained using the gradient-based optimization technique known as stochastic gradient descent (SGD).

LeNet-5's architecture reflects some of the fundamental concepts in CNN design, such as the use of convolutional and pooling layers for feature extraction and spatial hierarchy, followed by fully connected layers for classification. While it was originally designed for digit recognition, it laid the foundation for more advanced CNN architectures used in a wide range of computer vision tasks.

It's important to note that over the years, CNN architectures have evolved significantly, with deeper networks, improved activation functions, and more sophisticated techniques, but LeNet-5 remains a historically important and influential model in the field of deep learning.

#### 2. Describe the key components of LeNet-5 and their respective purposes.

LeNet-5 is a pioneering Convolutional Neural Network (CNN) architecture developed by Yann LeCun and his colleagues in the 1990s. It was designed for handwritten digit recognition and played a crucial role in the development of deep learning and computer vision. Here are the key components of LeNet-5 and their respective purposes:

1. **Input Layer**:
   - **Purpose**: The input layer of LeNet-5 accepts grayscale images as input.
   - **Details**: The original architecture was designed for 32x32 pixel input images. It serves as the entry point for the image data.

2. **Convolutional Layers**:
   - **Purpose**: Convolutional layers are responsible for feature extraction. They apply learnable filters to the input image to detect features.
   - **Details**:
     - The first convolutional layer applies six 5x5 filters, producing six feature maps. These filters capture low-level features like edges and simple textures.
     - The second convolutional layer applies sixteen 5x5 filters, producing sixteen feature maps. These filters capture more complex and abstract features.

3. **Max-Pooling Layers**:
   - **Purpose**: Max-pooling layers downsample the feature maps, reducing their spatial dimensions while retaining the most salient information. This aids in translation invariance and reduces computational complexity.
   - **Details**:
     - The first max-pooling layer uses 2x2 windows with a stride of 2. It reduces the size of the feature maps by half.
     - The second max-pooling layer uses 2x2 windows with a stride of 2, similar to the first layer.

4. **Fully Connected Layers**:
   - **Purpose**: Fully connected layers are responsible for classification and decision-making. They take the high-level features extracted by the convolutional and pooling layers and map them to class labels.
   - **Details**:
     - The first fully connected layer has 120 neurons. It further abstracts the features.
     - The second fully connected layer has 84 neurons, capturing more complex patterns.
     - The final output layer has 10 neurons, corresponding to the 10 possible digits (0-9). It uses the softmax activation function to produce class probabilities.

5. **Activation Functions**:
   - **Purpose**: Activation functions introduce non-linearity to the network, allowing it to model complex relationships in the data.
   - **Details**: LeNet-5 uses the hyperbolic tangent (tanh) activation function in the convolutional and fully connected layers. The output layer uses softmax to produce class probabilities.

6. **Training**:
   - **Purpose**: Training is the process of updating the model's weights to minimize the prediction error. It involves an optimization algorithm like stochastic gradient descent (SGD).
   - **Details**: LeNet-5 was trained using SGD and backpropagation to adjust the weights of the network's parameters.

LeNet-5's architecture represents some of the foundational principles of CNN design, such as the use of convolutional and pooling layers for feature extraction, followed by fully connected layers for classification. It demonstrated the effectiveness of deep learning in computer vision tasks, laying the groundwork for the development of more advanced CNN architectures.

#### 3. Discuss the advantages and limitations of LeNet-5 in the context of image classification tasks.

LeNet-5, one of the earliest Convolutional Neural Network (CNN) architectures, has been influential in the field of deep learning and computer vision. However, it has both advantages and limitations when used in the context of image classification tasks:

**Advantages**:

1. **Pioneering Architecture**: LeNet-5 was a pioneering architecture that demonstrated the effectiveness of CNNs for image classification. It laid the foundation for subsequent, more advanced CNN architectures.

2. **Feature Extraction**: LeNet-5 effectively captures features through its convolutional layers, which are designed to recognize edges and textures. This makes it suitable for tasks where these lower-level features are essential.

3. **Translation Invariance**: The use of max-pooling layers helps LeNet-5 achieve translation invariance. It can recognize features regardless of their exact position in the input image.

4. **Simple and Elegant**: LeNet-5 is a relatively simple and easy-to-understand architecture. This simplicity can be advantageous for educational purposes and as a starting point for understanding CNNs.

5. **Low Memory and Computational Requirements**: The architecture is computationally efficient compared to more modern deep CNNs. It requires less memory and computational power, which can be advantageous in resource-constrained environments.

**Limitations**:

1. **Limited Depth**: LeNet-5 is relatively shallow compared to modern CNNs. With only two convolutional layers and a simple architecture, it may struggle with more complex and deep feature hierarchies needed for intricate image classification tasks.

2. **Small Input Size**: The original LeNet-5 was designed for small 32x32 pixel input images. This restricts its applicability to modern high-resolution images, where larger networks are often necessary.

3. **Activation Functions**: LeNet-5 uses the hyperbolic tangent (tanh) activation function, which has vanishing gradient problems. Modern architectures often use more advanced activation functions like ReLU to mitigate this issue.

4. **Lack of Regularization**: LeNet-5 does not incorporate modern regularization techniques such as dropout or batch normalization, which can help prevent overfitting.

5. **Limited Applicability**: While it's suitable for digit recognition, LeNet-5 may not perform as well on more complex and diverse image classification tasks such as object detection or scene recognition.

6. **Training Data**: LeNet-5 was originally designed for handwritten digit recognition, and its performance can vary when applied to different types of data. Modern CNN architectures are typically more versatile and transferable across various tasks.

In summary, LeNet-5 was groundbreaking for its time and served as a critical step in the evolution of deep learning for image classification. However, its limitations, especially its shallowness, limited input size, and outdated architectural choices, make it less suitable for contemporary image classification tasks, which often demand deeper, more complex, and adaptable CNN architectures.

#### 4. Implement LeNet-5 using a deep learning framework of your choice (e.g., TensorFlow, PyTorch) and train it on a publicly available dataset (e.g., MNIST). Evaluate its performance and provide insights.

In [3]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

(train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0

model = models.Sequential()
model.add(layers.Conv2D(6, (5, 5), activation='tanh', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(16, (5, 5), activation='tanh'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(120, activation='tanh'))
model.add(layers.Dense(84, activation='tanh'))
model.add(layers.Dense(10, activation='softmax'))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images, train_labels, epochs=10, validation_data=(test_images, test_labels))

test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(f"Test accuracy: {test_acc}")


2023-10-18 14:55:18.637152: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-18 14:55:18.703397: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-18 14:55:18.703470: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-18 14:55:18.703525: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-18 14:55:18.716254: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-18 14:55:18.717582: I tensorflow/core/platform/cpu_feature_guard.cc:182] This Tens

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
313/313 - 1s - loss: 0.0454 - accuracy: 0.9875 - 987ms/epoch - 3ms/step
Test accuracy: 0.987500011920929


### TOPIC: Analyzing AlexNet

#### 1. Present an overview of the AlexNet architecture.

AlexNet is a groundbreaking convolutional neural network (CNN) architecture that played a pivotal role in advancing the field of deep learning, particularly in image recognition. Developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, it won the ImageNet Large Scale Visual Recognition Challenge in 2012, significantly reducing the error rate and demonstrating the power of deep neural networks. Here's an overview of the AlexNet architecture:

1. **Input Layer**:
   - AlexNet takes color images as input. The original architecture was designed for images of size 224x224 pixels with three color channels (RGB).

2. **Convolutional Layers**:
   - AlexNet consists of five convolutional layers.
   - The first convolutional layer applies 96 filters of size 11x11 with a stride of 4. It captures low-level features like edges and simple textures.
   - The subsequent convolutional layers apply a variety of filter sizes, including 5x5 and 3x3.
   - These convolutional layers use the ReLU (Rectified Linear Unit) activation function, which introduced non-linearity into the network.

3. **Max-Pooling Layers**:
   - After some of the convolutional layers, max-pooling layers are applied to downsample the feature maps and reduce their spatial dimensions.

4. **Normalization Layers**:
   - Local Response Normalization (LRN) layers are used to normalize the responses of neurons within the same feature map. This helps enhance contrast and response to small features.

5. **Fully Connected Layers**:
   - AlexNet has three fully connected layers.
   - The first fully connected layer has 4096 neurons.
   - The second fully connected layer also has 4096 neurons.
   - The final output layer has 1000 neurons, corresponding to the 1000 classes in the ImageNet dataset.

6. **Dropout**:
   - Dropout is applied before the fully connected layers to prevent overfitting. It randomly drops out a portion of neurons during training.

7. **Output Layer**:
   - The output layer uses the softmax activation function to produce class probabilities for image classification.

8. **Training**:
   - AlexNet was trained using stochastic gradient descent (SGD) with a relatively large learning rate. Data augmentation techniques were also employed during training to improve generalization.

9. **Overlapping Pooling**:
   - In some cases, AlexNet uses overlapping pooling, where the pooling regions overlap, instead of non-overlapping max-pooling.

10. **Parallel Processing**:
    - AlexNet was designed to take advantage of parallel processing by using two GPUs, which significantly accelerated training.

AlexNet was a significant leap forward in the field of deep learning. It demonstrated that deep neural networks, with the right architecture and training techniques, could achieve remarkable performance in image classification tasks. While it's been surpassed by more recent architectures in terms of accuracy and efficiency, it remains a landmark model that paved the way for the development of deeper and more complex CNNs.

#### 2. Explain the architectural innovations introduced in AlexNet that contributed to its brceakthrough performance.

AlexNet, introduced by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, was a groundbreaking Convolutional Neural Network (CNN) architecture that significantly contributed to the resurgence of deep learning in computer vision. Several architectural innovations in AlexNet played a crucial role in its breakthrough performance:

1. **Deep Architecture**:
   - AlexNet was one of the first CNN architectures to have a deep structure. It consisted of eight layers: five convolutional layers and three fully connected layers. This depth allowed the network to capture hierarchical features and abstract representations of the input data.

2. **Rectified Linear Units (ReLU)**:
   - AlexNet used Rectified Linear Units (ReLU) as the activation function, which replaced traditional sigmoid or hyperbolic tangent functions. ReLU is computationally efficient and mitigates the vanishing gradient problem, allowing for faster training and improved convergence.

3. **Local Response Normalization (LRN)**:
   - AlexNet introduced a local response normalization (LRN) layer after some of the convolutional layers. This layer helps neurons to respond to a broader range of inputs and enhances generalization.

4. **Overlapping Max-Pooling**:
   - In the pooling layers, AlexNet used overlapping max-pooling, which means the pooling windows had an overlap. This allowed the network to capture spatial hierarchies and retain more spatial information.

5. **Data Augmentation**:
   - Data augmentation techniques, such as random cropping and horizontal flipping, were used during training. These techniques increased the diversity of the training data and improved the model's ability to handle variations in the input.

6. **Dropout Regularization**:
   - AlexNet employed dropout in the fully connected layers. Dropout randomly deactivates a fraction of neurons during training, which acts as a form of regularization, preventing overfitting and improving generalization.

7. **Large Training Dataset**:
   - AlexNet was trained on a massive dataset, ImageNet, which contained over a million images with 1,000 object categories. The large and diverse training dataset played a significant role in the model's ability to generalize well to a wide range of images.

8. **Parallelization**:
   - AlexNet was one of the first models to effectively utilize GPU hardware for deep learning. It employed a dual-GPU setup to distribute the computational load, reducing training time.

9. **Softmax Cross-Entropy Loss**:
   - AlexNet used the softmax activation function followed by a cross-entropy loss function for classification tasks. This combination provided more discriminative class probabilities and improved the model's classification performance.

The combination of these architectural innovations allowed AlexNet to achieve state-of-the-art performance in the ImageNet Large Scale Visual Recognition Challenge in 2012, significantly reducing the error rate compared to previous methods. This success paved the way for the development of deeper and more sophisticated CNN architectures, such as VGG, GoogLeNet, and ResNet, and played a vital role in the resurgence of deep learning and its widespread adoption in computer vision and beyond.

#### 3. Discuss the role of convolutional layers, pooling layers, and fully connected layers in AlexNet.

In AlexNet, convolutional layers, pooling layers, and fully connected layers play distinct and complementary roles in the architecture, contributing to its success in image classification tasks. Here's an overview of the roles of each layer type in AlexNet:

1. **Convolutional Layers**:
   - **Role**: Convolutional layers are responsible for feature extraction. They apply convolution operations to the input image, which involve learning and applying filters (kernels) to detect local features like edges, textures, and more complex patterns.
   - **Details in AlexNet**:
     - AlexNet consists of five convolutional layers, each followed by an activation function (Rectified Linear Unit - ReLU) and local response normalization (LRN) in some layers.
     - These layers learn a hierarchy of features, starting with simpler features like edges and gradually moving to more complex features, which are crucial for image understanding.
   - **Convolutional Filter Sizes**: In AlexNet, the convolutional filter sizes vary from 11x11 to 3x3, with varying depths (number of filters) in each layer.

2. **Pooling Layers**:
   - **Role**: Pooling layers, specifically max-pooling in AlexNet, serve the purpose of reducing the spatial dimensions of the feature maps. This downsampling operation helps in decreasing the computational load and enhancing the network's translation invariance.
   - **Details in AlexNet**:
     - AlexNet employs max-pooling layers after the convolutional layers, reducing the feature map size by selecting the maximum value from local regions.
     - Overlapping max-pooling with a stride of 2 was used to capture spatial hierarchies.
   - **Pooling Windows**: The pooling windows in AlexNet were typically 3x3 with overlapping regions.

3. **Fully Connected Layers**:
   - **Role**: Fully connected layers in AlexNet are responsible for classification and decision-making. They take the high-level features extracted by the convolutional and pooling layers and map them to class labels.
   - **Details in AlexNet**:
     - AlexNet has three fully connected layers. The first fully connected layer consists of 120 neurons, followed by a layer with 84 neurons, and the final output layer with 10 neurons (corresponding to the 10 possible object categories in the ImageNet dataset).
     - These layers perform a series of linear transformations and nonlinear activations, mapping the features to class probabilities via the softmax activation function in the output layer.
   - **High-Level Features**: The fully connected layers aggregate high-level and abstract representations of the input image, making them suitable for classification.

In summary, convolutional layers in AlexNet capture hierarchical image features, including edges, textures, and more complex patterns. Pooling layers reduce the spatial dimensions of the feature maps and help the network maintain translation invariance. Fully connected layers make classification decisions based on the high-level features extracted by the previous layers. The combined use of these layer types enables AlexNet to effectively extract and learn discriminative features from images and classify them into different object categories, which was a key factor in its breakthrough performance in image classification tasks.

#### 4. Implement AlexNet using a deep learning framework of your choice and evaluate its performance on a dataset of your choice.

In [2]:
pip install torch torchvision

Collecting torch
  Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torchvision
  Downloading torchvision-0.16.0-cp310-cp310-manylinux1_x86_64.whl (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m71.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCollecting filelock
  Downloading filelock-3.13.0-py3-none-any.whl (11 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [9]:
import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler


# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [10]:
def get_train_valid_loader(data_dir,
                           batch_size,
                           augment,
                           random_seed,
                           valid_size=0.1,
                           shuffle=True):
    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010],
    )

    # define transforms
    valid_transform = transforms.Compose([
            transforms.Resize((227,227)),
            transforms.ToTensor(),
            normalize,
    ])
    if augment:
        train_transform = transforms.Compose([
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ])
    else:
        train_transform = transforms.Compose([
            transforms.Resize((227,227)),
            transforms.ToTensor(),
            normalize,
        ])

    # load the dataset
    train_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=train_transform,
    )

    valid_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=valid_transform,
    )

    num_train = len(train_dataset)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))

    if shuffle:
        np.random.seed(random_seed)
        np.random.shuffle(indices)

    train_idx, valid_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler)
 
    valid_loader = torch.utils.data.DataLoader(
        valid_dataset, batch_size=batch_size, sampler=valid_sampler)

    return (train_loader, valid_loader)


def get_test_loader(data_dir,
                    batch_size,
                    shuffle=True):
    normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    )

    # define transform
    transform = transforms.Compose([
        transforms.Resize((227,227)),
        transforms.ToTensor(),
        normalize,
    ])

    dataset = datasets.CIFAR10(
        root=data_dir, train=False,
        download=True, transform=transform,
    )

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle
    )

    return data_loader


# CIFAR10 dataset 
train_loader, valid_loader = get_train_valid_loader(data_dir = './data',                                      batch_size = 64,
                       augment = False,                             		     random_seed = 1)

test_loader = get_test_loader(data_dir = './data',
                              batch_size = 64)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


In [11]:
class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=0),
            nn.BatchNorm2d(96),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.layer3 = nn.Sequential(
            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(384),
            nn.ReLU())
        self.layer4 = nn.Sequential(
            nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(384),
            nn.ReLU())
        self.layer5 = nn.Sequential(
            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(9216, 4096),
            nn.ReLU())
        self.fc1 = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU())
        self.fc2= nn.Sequential(
            nn.Linear(4096, num_classes))
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

In [12]:
num_classes = 10
num_epochs = 20
batch_size = 64
learning_rate = 0.005

model = AlexNet(num_classes).to(device)


# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay = 0.005, momentum = 0.9)  


# Train the model
total_step = len(train_loader)

In [None]:

total_step = len(train_loader)

for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))
            
    # Validation
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in valid_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            del images, labels, outputs
    
        print('Accuracy of the network on the {} validation images: {} %'.format(5000, 100 * correct / total)) 

Epoch [1/20], Step [704/704], Loss: 0.4558
Accuracy of the network on the 5000 validation images: 60.92 %
