# **Image Classification using CNN Architectures Assignment**

Question 1: What is a Convolutional Neural Network (CNN), and how does it differ from
traditional fully connected neural networks in terms of architecture and performance on
image data?
sol)A Convolutional Neural Network (CNN) is a specialized type of deep learning architecture designed primarily for processing data with a grid-like topology, such as images.1While traditional neural networks (often called Multi-Layer Perceptrons or Fully Connected Networks) treat every input pixel as an independent variable, CNNs leverage the spatial arrangement of pixels to extract meaningful patterns like edges, textures, and shapes.21. Architectural DifferencesThe fundamental difference lies in how neurons in one layer connect to the next.Traditional Fully Connected (FC) NetworksIn an FC network, every neuron in layer 3$i$ is connected to every neuron in layer 4$i+1$.5Input Handling: To process an image (e.g., $64 \times 64$ pixels), the image must be "flattened" into a single 1D vector of $4,096$ values.Global Connectivity: Each neuron learns a weight for every single pixel in the image.6Spatial Blindness: Because the image is flattened, the network loses information about which pixels were neighbors.7 It treats a pixel at $(1, 1)$ and $(1, 2)$ the same way it treats pixels on opposite corners.Convolutional Neural Networks (CNNs)CNNs introduce three architectural concepts that make them superior for images:Local Receptive Fields: Instead of looking at the whole image at once, a neuron in a convolutional layer only "sees" a small local patch (e.g., 8$3 \times 3$ or 9$5 \times 5$ pixels).10 This mimics how the human visual cortex works.11Shared Weights (Parameter Sharing): A single "filter" (a small matrix of weights) slides across the entire image.12 The same filter that detects an edge in the top-left corner is used to detect an edge in the bottom-right.13Pooling Layers: These layers downsample the data (e.g., Max Pooling), reducing the resolution while keeping the most important features.14 This makes the network robust to small shifts or rotations of the object in the image.152. Performance on Image DataCNNs significantly outperform traditional networks on image tasks due to two main factors:16Efficiency (The Parameter Explosion)Traditional networks suffer from the "curse of dimensionality."FC Example: For a 4K image ($3840 \times 2160$ pixels), a single neuron in the first hidden layer would require over 8 million weights. A network with hundreds of neurons would have billions of parameters, making it impossible to train.CNN Example: A 17$3 \times 3$ convolutional filter has only 9 weights, regardless of how large the input image is.18 This allows CNNs to be much deeper and more complex without crashing your computer's memory.Feature HierarchyCNNs naturally learn to see the world in layers:Lower Layers: Detect basic edges and lines.19Middle Layers: Combine edges into textures and simple shapes (circles, squares).Higher Layers: Recognize complex objects (faces, cars, dogs).20Traditional networks struggle to build this hierarchy because they lack the "inductive bias" that nearby pixels are related.FeatureTraditional (Fully Connected)Convolutional Neural Network (CNN)InputFlattened 1D Vector2D/3D Tensors (Height, Width, Color)ConnectivityGlobal (Every node to every node)Local (Small receptive fields)WeightsUnique for every connectionShared across the image (Filters)Spatial InfoLost during flatteningPreserved and leveragedComplexityHigh (Huge number of parameters)Efficient (Fewer parameters)

Question 2: Discuss the architecture of LeNet-5 and explain how it laid the foundation
for modern deep learning models in computer vision. Include references to its original
research paper.
sol)The LeNet-5 architecture, introduced in the 1998 landmark paper "Gradient-Based Learning Applied to Document Recognition" by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, is widely considered the "birth" of modern Convolutional Neural Networks (CNNs).1While previous versions (LeNet-1 through LeNet-4) existed, LeNet-5 was the first to successfully demonstrate that a network could learn to extract features and classify patterns directly from raw pixels using backpropagation.1. The Architecture of LeNet-5LeNet-5 is a 7-layer architecture (excluding the input) designed for the MNIST digit recognition task.2 It processes grayscale images of size 3$32 \times 32$.4Detailed Layer BreakdownInput Layer: 5$32 \times 32$ grayscale images (normalized so pixels are between -0.1 and 1.1).6C1 (Convolutional Layer): Uses 6 filters of size 7$5 \times 5$.8 This produces 6 feature maps of size 9$28 \times 28$.10S2 (Subsampling/Pooling Layer): Uses Average Pooling with a 11$2 \times 2$ window and stride 2.12 This reduces the size to $14 \times 14 \times 6$.C3 (Convolutional Layer): Uses 16 filters of size 13$5 \times 5$.14 A key innovation here was partial connectivity: not all 6 input maps were connected to all 16 output maps, which forced different filters to learn different (complementary) features.15S4 (Subsampling Layer): Another 16$2 \times 2$ Average Pooling layer, reducing the volume to 17$5 \times 5 \times 16$.18C5 (Convolutional Layer):19 Uses 120 filters of size 20$5 \times 5$.21 Since the input is $5 \times 5$, the output is a $1 \times 1$ vector, making this effectively a fully connected layer.F6 (Fully Connected Layer):22 A dense layer with 84 neurons using the 23$tanh$ activation function.24Output Layer: 10 nodes (one for each digit 0–9) using a Euclidean Radial Basis Function (RBF) unit, later replaced by Softmax in modern implementations.252. Foundations for Modern Computer VisionLeNet-5 laid the groundwork for almost every modern vision model (like AlexNet, VGG, and ResNet) by establishing three fundamental principles:26A. Local Receptive FieldsLeNet-5 proved that neurons should only look at small local areas of an image. This allows the network to capture "local" features like edges and corners before aggregating them into complex objects, a process that mirrors the human visual cortex.B. Shared Weights (Parameter Sharing)27By using filters that slide across the image, LeNet-5 ensured that a feature detector (e.g., a "vertical line" detector) learned in one part of the image could be applied everywhere else.28 This drastically reduced the number of trainable parameters compared to fully connected networks.C. Spatial Subsampling (Pooling)29The introduction of pooling layers provided translation invariance.30 This means that if a digit is shifted slightly to the left or right, the network can still identify it because the pooling layers "blur" the exact location of features, focusing instead on their relative presence.31Original Paper Citation:LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).32 Gradient-based learning applied to document recognition.33 Proceedings of the IEEE, 86(11), 2278-2324.

Question 3: Compare and contrast AlexNet and VGGNet in terms of design principles,
number of parameters, and performance. Highlight key innovations and limitations of
each.
sol)While AlexNet (2012) was the "proof of concept" that deep learning could dominate computer vision, VGGNet (2014) was the "refined manual" that established standard design patterns still used in modern CNNs.1. Design PrinciplesAlexNet (Heterogeneous Design):Used a variety of filter sizes ($11 \times 11$, $5 \times 5$, and $3 \times 3$) to capture features at different scales.Featured a split-stream architecture (two parallel paths) because the model was too large to fit on a single GPU (NVIDIA GTX 580) at the time.1VGGNet (Homogeneous Design):Introduced the concept of blocks.2 It strictly used 3$3 \times 3$ filters with a stride of 1 and 4$2 \times 2$ Max Pooling throughout.5Philosophy: Increasing depth is more important than increasing filter size.6 VGG showed that three stacked $3 \times 3$ filters have the same "receptive field" as one $7 \times 7$ filter but with fewer parameters and more non-linearity (ReLU layers).2. Comparison TableFeatureAlexNet (2012)VGG-16 (2014)Depth8 Layers (5 Conv, 3 FC)16-19 Layers (13-16 Conv, 3 FC)Filter SizesLarge ($11 \times 11$, $5 \times 5$)Small (strictly $3 \times 3$)Parameters~60 Million~138 MillionTop-5 Error~15.3%~7.3%ActivationReLUReLUMemory UsageModerateVery High (due to depth/width)3. Key InnovationsAlexNet:ReLU Activation: First major use of ReLU instead of Tanh, solving the vanishing gradient problem and speeding up training.7Dropout: Introduced to prevent overfitting in the huge fully connected layers.Local Response Normalization (LRN): A technique to mimic "lateral inhibition" in biological neurons (largely unused today).VGGNet:Simplicity & Modularity: Established that you don't need fancy filter sizes; just stack 8$3 \times 3$ conv layers.9Pre-training/Weight Initialization: Demonstrated that training a shallow version of the network first helps initialize deeper versions.4. LimitationsAlexNet:Irregularity: The specific choices of filter sizes were somewhat arbitrary and lacked a clear mathematical pattern.Spatial Resolution: Rapidly downsampled the image in early layers, which can lose fine-grained detail.VGGNet:Massive Parameter Count: Despite having $3 \times 3$ filters, the network gets very wide in the later layers. VGG-16 is twice as large as AlexNet, making it very slow to train and difficult to deploy on mobile devices.Fully Connected Bottleneck: Most of its 138M parameters are in the last three FC layers, which is computationally inefficient.

Question 4: What is transfer learning in the context of image classification? Explain
how it helps in reducing computational costs and improving model performance with
limited data.
sol)Transfer learning is a machine learning technique where a model developed for one task (the "source" task) is reused as the starting point for a model on a second, related task (the "target" task).1In computer vision, this typically involves taking a powerful model (like ResNet or VGG) that has already been trained on a massive dataset (like ImageNet, which contains 14 million images) and adapting it to a specific, smaller dataset (such as identifying specific plant diseases or medical anomalies).21. How Transfer Learning WorksA deep CNN can be thought of as two distinct parts:Feature Extractor (Base): The early layers that learn to recognize general features (edges, textures, shapes).3 These features are universal across most images.4Classifier (Head): The final layers that map those features to specific categories (e.g., "Golden Retriever" vs. "German Shepherd").To apply transfer learning, you:Keep the "Base": Reuse the pre-trained feature extractor.5Replace the "Head": Swap the final classification layer with a new one tailored to your specific classes.6Freeze Layers: You "lock" the weights of the early layers so they don't change during training, ensuring the model retains its general knowledge.72. Reducing Computational Costs8Training a deep neural network from scratch is incredibly "expensive" in terms of time and hardware.9 Transfer learning reduces these costs in three ways:10Fewer Parameters to Train: By "freezing" the majority of the network, you only calculate gradients and update weights for the final few layers.11 This can reduce the number of trainable parameters from millions to just a few thousand.Faster Convergence: Because the model already "knows" what an edge or a curve looks like, it doesn't spend thousands of iterations wandering through random weight values. It often reaches peak accuracy in a fraction of the time (e.g., minutes instead of days).12Lower Hardware Requirements: You don't need a massive cluster of GPUs to train the final layers of a pre-trained model; a single consumer-grade GPU or even a capable CPU can often handle it.3. Improving Performance with Limited DataDeep learning is notorious for being "data-hungry."13 If you try to train a 50-layer ResNet from scratch with only 100 images, the model will simply overfit—it will memorize those 100 images perfectly but fail to recognize anything else.Transfer learning solves this through Knowledge Transfer:Feature Richness: Even if your dataset only has 100 images of rare birds, the model starts with the "wisdom" gained from seeing millions of other objects. It already understands lighting, shadows, and complex textures.Regularization Effect: The frozen weights act as a form of regularization. Because the feature extraction layers are already optimized and fixed, the model is physically constrained from overfitting to the noise in your small dataset.FeatureTraining from ScratchTransfer LearningData NeededMassive (10,000+ images)Small (100–1,000 images)Training TimeDays or WeeksMinutes or HoursWeight InitRandomPre-trained (ImageNet)RiskHigh OverfittingHigh Generalization

Question 5: Describe the role of residual connections in ResNet architecture. How do
they address the vanishing gradient problem in deep CNNs?
sol)In the 2015 paper "Deep Residual Learning for Image Recognition", Kaiming He and his team introduced Residual Connections (also known as skip connections) to solve a counterintuitive problem: as neural networks became deeper, their accuracy began to saturate and then degrade rapidly.1. What are Residual Connections?In a traditional "plain" neural network, each layer tries to learn a direct mapping, $H(x)$. If the network has 50 layers, the output of layer 1 must pass through 49 subsequent transformations to reach the end.In a ResNet (Residual Network), the architecture is broken into Residual Blocks. Instead of asking a stack of layers to learn the full mapping $H(x)$, we ask them to learn the residual (the difference), defined as:$$F(x) = H(x) - x$$The actual output of the block is then calculated by adding the original input $x$ back to the learned result:$$Output = F(x) + x$$2. Solving the Vanishing Gradient ProblemThe "vanishing gradient" occurs during backpropagation. To update the weights of early layers, the error signal (gradient) must be multiplied by the weights of every layer it passes through. If those weights/gradients are small (e.g., $< 1$), they shrink exponentially as they travel backward, eventually becoming zero before reaching the start of the network.Residual connections address this in two key ways:A. The "Gradient Superhighway"Because the output is $F(x) + x$, the derivative during backpropagation always contains a "1" from the $x$ term:$$\frac{\partial(F(x) + x)}{\partial x} = \frac{\partial F(x)}{\partial x} + 1$$This additive +1 acts as a shortcut. Even if the gradient through the convolutional layers ($F(x)$) vanishes or becomes messy, the signal can still flow unimpeded through the skip connection. This allows ResNet to train models with hundreds or even thousands of layers (like ResNet-101 or ResNet-152).B. Identity Mapping as a DefaultIn deep networks, it is surprisingly hard for a layer to learn to do "nothing" (an identity function where $Output = Input$). In a plain network, the weights must be perfectly tuned to pass the signal through unchanged.In a Residual Block, learning the identity function is easy: the network simply pushes the weights of $F(x)$ toward zero.If a layer is redundant, the network can effectively "skip" it by relying on the $x$ path, ensuring that a deeper model is at least as good as a shallower one.3. Impact on PerformanceFeaturePlain Deep NetworkResNet (Residual)Training ErrorIncreases with extreme depthDecreases with depthConvergenceSlow/UnstableFast and stableGradient FlowMultiplicative (Shrinks)Additive (Preserved)

Question 6: Implement the LeNet-5 architectures using Tensorflow or PyTorch to
classify the MNIST dataset. Report the accuracy and training time.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import time

# 1. Prepare Data (MNIST is 28x28, LeNet-5 expects 32x32)
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_set = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_set = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
test_loader = DataLoader(test_set, batch_size=64, shuffle=False)

# 2. Define LeNet-5 Architecture
class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=5), nn.Tanh(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Conv2d(6, 16, kernel_size=5), nn.Tanh(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 120, kernel_size=5), nn.Tanh()
        )
        self.classifier = nn.Sequential(
            nn.Linear(120, 84), nn.Tanh(),
            nn.Linear(84, 10)
        )

    def forward(self, x):
        x = self.feature_extractor(x)
        x = torch.flatten(x, 1)
        logits = self.classifier(x)
        return logits

# 3. Training Loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LeNet5().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

start_time = time.time()
for epoch in range(5): # 5 Epochs is usually enough for MNIST
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

training_time = time.time() - start_time

100%|██████████| 9.91M/9.91M [00:00<00:00, 42.2MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 1.19MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 10.9MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 8.75MB/s]


Question 7: Use a pre-trained VGG16 model (via transfer learning) on a small custom
dataset (e.g., flowers or animals). Replace the top layers and fine-tune the model.
Include your code and result discussion.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, models, transforms
import time

# 1. Data Augmentation and Loading
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Using Flowers102 dataset (PyTorch built-in)
train_dataset = datasets.Flowers102(root='./data', split='train', download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)

# 2. Load Pre-trained VGG16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)

# 3. Freeze all convolutional layers
for param in model.features.parameters():
    param.requires_grad = False

# 4. Replace the classifier (Top layers)
# VGG16.classifier[6] is the final linear layer (4096 -> 1000)
num_features = model.classifier[6].in_features
model.classifier[6] = nn.Linear(num_features, 102) # 102 flower classes

model = model.to(device)

# 5. Define Loss and Optimizer (Only optimize the classifier)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.classifier.parameters(), lr=0.001)

# 6. Training (Short example: 2 Epochs)
model.train()
start_time = time.time()
for epoch in range(2):
    for images, labels in train

Question 8: Write a program to visualize the filters and feature maps of the first
convolutional layer of AlexNet on an example input image.


In [None]:
import torch
import matplotlib.pyplot as plt
from torchvision import models, transforms
from PIL import Image
import requests
from io import BytesIO

# 1. Load Pre-trained AlexNet
model = models.alexnet(weights=models.AlexNet_Weights.IMAGENET1K_V1)
model.eval()

# 2. Get the First Convolutional Layer
# AlexNet.features[0] is the first Conv2d layer
first_layer_weights = model.features[0].weight.data.cpu()

# 3. Visualization: Filters (Weights)
def plot_filters(weights):
    # Normalize weights to [0, 1] for visualization
    w_min, w_max = weights.min(), weights.max()
    weights = (weights - w_min) / (w_max - w_min)

    fig, axes = plt.subplots(8, 8, figsize=(10, 10))
    for i, ax in enumerate(axes.flat):
        if i < weights.shape[0]:
            # Show RGB filters
            ax.imshow(weights[i].permute(1, 2, 0))
        ax.axis('off')
    plt.suptitle("First Layer Convolutional Filters (AlexNet)")
    plt.show()

# 4. Visualization: Feature Maps (Activations)
def plot_feature_maps(model, img_tensor):
    # Pass image through the first layer only
    with torch.no_grad():
        feature_maps = model.features[0](img_tensor)

    fig, axes = plt.subplots(8, 8, figsize=(10, 10))
    for i, ax in enumerate(axes.flat):
        if i < feature_maps.shape[1]:
            ax.imshow(feature_maps[0, i].cpu(), cmap='gray')
        ax.axis('off')
    plt.suptitle("Feature Maps of First Conv Layer")
    plt.show()

# --- Execution ---
# Load a sample image (e.g., a dog)
url = "https://images.dog.ceo/breeds/retriever-golden/n02099601_3004.jpg"
response = requests.get(url)
img = Image.open(BytesIO(response.content)).convert('RGB')

preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])
img_tensor = preprocess(img).unsqueeze(0)

plot_filters(first_layer_weights)
plot_feature_maps(model, img_tensor)

Question 9: Train a GoogLeNet (Inception v1) or its variant using a standard dataset
like CIFAR-10. Plot the training and validation accuracy over epochs and analyze
overfitting or underfitting

In [None]:

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision.models import googlenet
import matplotlib.pyplot as plt

# 1. Data Augmentation (Crucial to prevent overfitting)
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True)

testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)
test_loader = torch.utils.data.DataLoader(testset, batch_size=128, shuffle=False)

# 2. Load and Modify GoogLeNet
# We set aux_logits=False for simplicity in this example
model = googlenet(num_classes=10, aux_logits=False, init_weights=True)

# Optimization for CIFAR-10: The original first layer has 7x7 conv and stride 2.
# For 32x32 images, this downsamples too aggressively. We replace it.
model.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
model.maxpool1 = nn.Identity() # Skip early maxpool to keep spatial resolution

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# 3. Training Config
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# (Simplified Training Loop Tracking)
history = {'train_acc': [], 'val_acc': []}

for epoch in range(50):
    model.train()
    correct, total = 0, 0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    history['train_acc'].append(100 * correct / total)
    scheduler.step()
    # (Validation logic would go here, appending to history['val_acc'])

Question 10: You are working in a healthcare AI startup. Your team is tasked with
developing a system that automatically classifies medical X-ray images into normal,
pneumonia, and COVID-19. Due to limited labeled data, what approach would you
suggest using among CNN architectures discussed (e.g., transfer learning with ResNet
or Inception variants)? Justify your approach and outline a deployment strategy for
production use.
sol)In a healthcare setting, the stakes for accuracy and reliability are exceptionally high, while the availability of high-quality labeled medical data is typically low.11. Suggested Approach: Transfer Learning with ResNet-50Among the architectures discussed, I suggest using Transfer Learning with ResNet-50 (or ResNet-101).JustificationAddressing Data Scarcity: Medical datasets are rarely as large as ImageNet. Transfer learning allows us to leverage pre-trained weights that already understand shapes, densities, and textures.2 We only need to "fine-tune" the model to distinguish between clinical pathologies like opacities (pneumonia) and ground-glass patterns (COVID-19).Vanishing Gradient Protection: X-rays are high-resolution but subtle. To capture minute differences in lung tissue, we need a deep network. ResNet’s residual connections ensure that gradients flow back to early layers, preventing the model from becoming "untrainable" during fine-tuning.3Feature Re-usability: Research (e.g., CheXNet) has shown that features learned on natural images (dogs, cars) transfer surprisingly well to medical imaging once the final layers are retrained.Inception vs. ResNet: While Inception variants (GoogLeNet) are efficient, ResNet tends to be more stable during training and is often the standard "baseline" for medical imaging research, making it easier to troubleshoot and benchmark.42. Outline of the Training StrategyStep 1: Feature Extraction: Freeze the ResNet-50 backbone and train only the final classification head (Softmax layer with 3 outputs: Normal, Pneumonia, COVID-19).Step 2: Fine-Tuning: Once the head is stable, unfreeze the top-level residual blocks and retrain with a very low learning rate ($10^{-5}$ or $10^{-6}$) to adapt the filters to the specific "haze" and "shadows" of chest X-rays.Step 3: Class Imbalance Handling: COVID-19 cases may be fewer than pneumonia cases. I would use Weighted Cross-Entropy Loss or Oversampling to ensure the model doesn't become biased toward the more common classes.3. Deployment Strategy for ProductionDeploying in a clinical environment requires more than just a high-accuracy model; it requires trust and integration.A. Inference Pipeline (Cloud or On-Prem)API Service: Wrap the model in a FastAPI or Flask containerized with Docker.Edge/Local Deployment: Given patient privacy (HIPAA/GDPR), deploying the model on a local server within the hospital’s firewall (using NVIDIA Triton Inference Server) is often preferred over the public cloud.B. Model Explainability (Crucial for Doctors)Radiologists will not trust a "Black Box." I would implement Grad-CAM (Gradient-weighted Class Activation Mapping).This generates a heatmap over the X-ray, showing the doctor exactly which area of the lung led the AI to predict "Pneumonia."C. Human-in-the-loop (HITL) MonitoringThe system should act as a triage tool, not a final diagnostic.Strategy: High-confidence predictions can be flagged for immediate review, while low-confidence predictions are routed to a senior radiologist for manual verification. This creates a feedback loop where the model can be retrained on "hard" cases periodically.Summary TableComponentChoiceReasonModelResNet-50Robust, handles deep feature extraction, prevents vanishing gradients.TechniqueTransfer LearningOvercomes limited medical data; faster convergence.ExplainabilityGrad-CAMProvides visual "proof" for clinical trust.MonitoringDrift DetectionEnsures model doesn't lose accuracy if X-ray machines change.