This can be run [run on Google Colab using this link](https://colab.research.google.com/github/CS7150/CS7150-Homework-2/blob/main/HW2_3_CIFAR_classifier.ipynb)
# CIFAR-10 Classification (Fully-Connected vs. Convolutional)

In this notebook, we will:
1. Download **CIFAR-10** (a dataset of 32×32 color images in 10 classes).
2. Demonstrate a working classifier using **fully-connected (FC) layers** (a simple MLP).
3. **Exercise**: Students will create a **convolutional** version for better efficiency.
4. Compare **parameter counts** and performance.

This exercise is just an opportunity to understand the power of weight-sharing and play with a standard classification setting that for decades was a focus of machine learning researchers.

Try to improve the test performance of the network without making it more expensive to train.  You will just be graded in your experiment findings at the end.

**Key Points**:
- CIFAR-10 has 60,000 images (50k train, 10k test).
- Each image is 3×32×32 (3 color channels).
- We’ll flatten those 3×32×32 = 3072 pixels as input to a fully-connected MLP.
- Then we’ll invite you to use convolutional layers, which drastically reduce parameters by sharing weights.

---

## 1. Setup
We'll import **PyTorch**, **torchvision**, then load CIFAR-10. We’ll make small transformations (convert to tensors, normalize if desired).

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as T
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

# Basic transforms: ToTensor (range [0,1]), optional normalization.
transform = T.Compose([
    T.ToTensor(),
    # Optionally normalize: T.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

# Download and create datasets
train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

# Dataloaders
batch_size = 64
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=2)


Using device: cpu
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170M/170M [00:02<00:00, 81.6MB/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


## 2. A Simple Fully-Connected (MLP) Classifier
We’ll define a basic MLP:
1. Flatten the 3×32×32 image (3072 dims).
2. Several **fully connected layers**, then 10 outputs (one per CIFAR-10 class).

We can train it for a few epochs—**this won't achieve high accuracy** (CNNs do much better), but it demonstrates the approach.

In [4]:
class SimpleMLP(nn.Module):
    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10):
        super().__init__()
        # A small 2-layer MLP:
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, num_classes)
    def forward(self, x):
        # x: shape (batch, 3, 32, 32)
        batch_size = x.size(0)
        x = x.view(batch_size, -1)  # flatten
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

mlp = SimpleMLP().to(device)
print("MLP parameter count:", sum(p.numel() for p in mlp.parameters() if p.requires_grad))


MLP parameter count: 308310


### 2.1 Training Loop
We define a simple function `train_epoch` and `test_accuracy` to measure performance.

In [5]:
import torch.optim as optim

def train_epoch(model, loader, optimizer, loss_fn=nn.CrossEntropyLoss()):
    model.train()
    total_loss = 0.
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        preds = model(images)
        loss = loss_fn(preds, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

def test_accuracy(model, loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            preds = model(images)
            predicted = preds.argmax(dim=1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)
    return 100.0 * correct / total


Now let's do a short training run on the MLP—**note** that this won't get anywhere close to SOTA accuracy on CIFAR-10, but it demonstrates the pipeline. We'll do maybe **2** or **3** epochs just to see it learns something.

In [6]:
mlp = SimpleMLP().to(device)
optimizer = optim.Adam(mlp.parameters(), lr=1e-3)

epochs = 3  # can increase if you want
for epoch in range(1, epochs+1):
    train_loss = train_epoch(mlp, train_loader, optimizer)
    test_acc = test_accuracy(mlp, test_loader)
    print(f"Epoch {epoch}/{epochs}, train loss={train_loss:.4f}, test acc={test_acc:.2f}%")

Epoch 1/3, train loss=1.8878, test acc=35.29%
Epoch 2/3, train loss=1.7429, test acc=39.77%
Epoch 3/3, train loss=1.6839, test acc=40.44%


## 3. Exercise: Use a Stack of Convolutions

CIFAR-10 was **designed** with 2D images in mind, so we can do **far better** with **convolutional** layers that share weights locally.

### Your Tasks
1. **Construct** a new network (say `ConvNet`) with multiple convolutional layers, optional pooling, etc.
2. **Count** the number of parameters. *(Hint: `sum(p.numel() for p in model.parameters() if p.requires_grad)`.)*
3. **Train** this model on CIFAR-10. Try to achieve comparable or better accuracy than the MLP **with fewer parameters**.

### Suggested Skeleton Code
Below is a minimal skeleton. Feel free to modify layer dimensions, add pooling, or add more conv layers. We provide the class structure for you to fill in.

In [13]:
class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        identity = x  # Preserve input
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += identity  # Add residual
        return F.relu(out)


class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # TODO: define your convolutional layers here.
        # e.g.
        # self.conv1 = nn.Conv2d(in_channels=3, out_channels=8, kernel_size=3, padding=1)
        # self.pool = nn.MaxPool2d(2,2)
        # etc.
        # Then define a final linear layer.
        # You have to figure out the shape after the conv layers.


        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)

        self.res1 = ResidualBlock(32)
        self.res2 = ResidualBlock(32)
        self.pool1 = nn.MaxPool2d(2,2)  # 32x32 → 16x16

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.res3 = ResidualBlock(64)
        self.res4 = ResidualBlock(64)
        self.pool2 = nn.MaxPool2d(2,2)  # 16x16 → 8x8

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.res5 = ResidualBlock(128)
        self.res6 = ResidualBlock(128)
        self.pool3 = nn.MaxPool2d(2,2)  # 8x8 → 4x4

        self.fc1 = nn.Linear(128*4*4, 256)
        self.fc2 = nn.Linear(256, num_classes)


        # self.conv1 = nn.Conv2d(3, 8, kernel_size=3, padding=1)
        # self.pool = nn.MaxPool2d(2,2)
        # self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        # self.pool2 = nn.MaxPool2d(2,2)
        # self.conv3 = nn.Conv2d(16, 16, kernel_size=3, padding=1)
        # self.pool3 = nn.MaxPool2d(2,2)
        # # self.conv5 = nn.Conv2d(32, 16, kernel_size=3, padding=1)
        # # self.conv6 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
        # # self.pool3 = nn.MaxPool2d(2,2)
        # # after 2 conv+pool steps, etc...
        # # But let's suppose we do only 1 pool, etc.

        # self.fc = nn.Linear(16*4*4, 128)  # Just a guess of dimensions.
        # self.fc3 = nn.Linear(128, 64)
        # self.fc2 = nn.Linear(64, num_classes)

    def forward(self, x):
        # # x: (batch, 3, 32, 32)
        # x = F.relu(self.conv1(x))  # (batch,8,32,32)
        # x = self.pool(x)  # (batch,16,16,16)
        # x = F.relu(self.conv2(x))  # (batch,16,16,16)
        # x = self.pool2(x)  # (batch,16,16,16)
        # x = F.relu(self.conv3(x))  # (batch,16,16,16)
        # x = self.pool3(x)  # (batch,16,16,16)
        # batch_size = x.size(0)
        # x = x.view(batch_size, -1)
        # x = self.fc(x)
        # x = self.fc3(x)
        # x = self.fc2(x)
        # return x
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.res1(x)
        x = self.res2(x)
        x = self.pool1(x)

        x = F.relu(self.bn2(self.conv2(x)))
        x = self.res3(x)
        x = self.res4(x)
        x = self.pool2(x)

        x = F.relu(self.bn3(self.conv3(x)))
        x = self.res5(x)
        x = self.res6(x)
        x = self.pool3(x)

        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)

        return x

### 3.1 Code: Train Your ConvNet
**Exercise**: Implement the training loop (similar to the MLP), measure test accuracy, and see how you can reduce or increase parameters to trade off accuracy vs. model size.

Examples:
- Add more conv layers or channels.
- Add more or fewer pooling layers.
- Print out the param count.
- Play with other architectural tricks such as residual connections.
- Tweak the learning rate or optimizer.

Try to see how low you can go in param count while maintaining a decent accuracy!

In [None]:
# STUDENT EXERCISE:
convnet = ConvNet().to(device)
print("ConvNet param count:", sum(p.numel() for p in convnet.parameters() if p.requires_grad))

optimizer_conv = optim.Adam(convnet.parameters(), lr=1e-3)
epochs_conv = 3
for epoch in range(1, epochs_conv+1):
    train_loss = train_epoch(convnet, train_loader, optimizer_conv)
    test_acc = test_accuracy(convnet, test_loader)
    print(f"[ConvNet] Epoch {epoch}/{epochs_conv}, train loss={train_loss:.4f}, test acc={test_acc:.2f}%")

print("\nNow consider adjusting your ConvNet architecture, parameter count, etc. for better results.")

ConvNet param count: 1396746
[ConvNet] Epoch 1/3, train loss=1.3916, test acc=63.92%
[ConvNet] Epoch 2/3, train loss=0.8237, test acc=68.74%


## 4. Report Your Findings

Points to understand:

1. A **fully-connected** approach to image classification (such as CIFAR-10) can work but tends to have **many** parameters (e.g., 3,072×100 just in one layer on tiny images) and typically yields lower accuracy compared to modern **Convolutional** architectures.
2. **Convolution** drastically reduces parameter counts via **weight sharing**, can often achieve much higher accuracy on image tasks, and is typically *translation-equivariant*.
3. Your goal is to **experiment** with different conv net designs to minimize param count while maximizing accuracy.

Report here at least two iterations of your architectural experiments:

1. Using an architecture consisting of $\fbox{your answer}$, I was able to reduce the parameterization to $\fbox{your answer}$ parameters and achieve test accuracy of $\fbox{your answer}$ after three epochs of training.

2. In a second test, I tried an architecture consisting of $\fbox{your answer}$.  That used an even smaller parameterization, with only $\fbox{your answer}$ parameters, and it achieved test accuracy of $\fbox{your answer}$ after three epochs of training.

Good luck!