# CS454-554 Homework 4 Report: KMNIST Classification

**Due:** May 14, 2025, 23:00
**Author:** **DEMBA SOW**

---

## 1. Introduction

The objective of this assignment is to implement and compare multiple neural network architectures on the KMNIST dataset—a 10-class classification problem of 28×28 grayscale Japanese character images—using PyTorch. We evaluate:

1. **Baseline Models**

   * Linear
   * Multilayer Perceptron (40 hidden units)
   * Simple CNN
2. **Extended CNN Variants**

   * CNN‑A (Deeper + BatchNorm + Dropout)
   * CNN‑B (Wider Filters + Global Pooling)
   * CNN‑C (Residual Connections)
3. **Hyperparameter Tuning & Early Stopping**
4. **Comparison & Discussion**

---

## 2. Environment & Data Loading

```python
import torch
from torch import nn, optim
from torchvision.datasets import KMNIST
from torchvision import transforms
from torch.utils.data import DataLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_ds = KMNIST('./data', train=True,  transform=transform, download=True)
test_ds  = KMNIST('./data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_ds, batch_size=64,  shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=64)
```

---

## 3. Model Architectures

### 3.1. Linear Model

* Single `nn.Linear(28×28, 10)` layer

```	python
class LinearModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(28*28, 10)
    def forward(self, x):
        x = x.view(x.size(0), -1)
        return self.fc(x)
```

### 3.2. MLP (40 Hidden Units)

* Flatten → Linear(784→40) → ReLU → Linear(40→10)
```python
class MLP40(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 40),
            nn.ReLU(),
            nn.Linear(40, 10)
        )
    def forward(self, x):
        return self.net(x)
```

### 3.3. Simple CNN

* Conv(1→16,3) → ReLU → MaxPool(2)
* Conv(16→32,3) → ReLU → MaxPool(2)
* FC(32×7×7→64) → ReLU → FC(64→10)

```python
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),  # 16×14×14
            nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2)  # 32×7×7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32*7*7, 64), nn.ReLU(),
            nn.Linear(64, 10)
        )
    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)
```

---

## 4. Training & Evaluation

* **Loss:** `nn.CrossEntropyLoss`
* **Optimizer:** `optim.Adam` (LR=0.001)
* **Batch size:** 64
* **Epochs:** 20 (Baseline), 50 (Extended)
* **Early Stopping:** Monitor validation loss, stop if no improvement for 3 epochs

---

## 5. Baseline Results

### 5.1. Accuracy & Loss Curves

* **Linear:** 
![Linear Accuracy & Loss Curves](./output/linear-model.png)

* **MLP40:** 
![MLP40 Accuracy & Loss Curves](./output/mlp-40-model.png)

* **Simple CNN:** 
![Simple CNN Accuracy & Loss Curves](./output/simple-cnn-model.png)


### 5.2. Numerical Results

| Model      | Epoch | Train Loss | Train Acc | Test Loss | Test Acc |
| ---------- | :---: | :--------: | :-------: | :-------: | :------: |
| Linear     |   10  |    0.560   |   0.835   |   1.036   |   0.694  |
| MLP40      |   10  |    0.159   |   0.954   |   0.545   |   0.847  |
| Simple CNN |   10  |    0.016   |   0.995   |   0.307   |   0.940  |


> **Observations:**
>
> * The Linear model underfits, plateauing at \~69% accuracy.
> * MLP40 improves to \~84.7% by adding non-linearity.
> * Simple CNN achieves \~94%, leveraging spatial features.

---

## 6. Extended CNN Variants

### 6.1. Loss & Accuracy Curves

* **CNN‑A:** 
![CNN-A Loss & Accuracy Curves](./output/cnn_a-model.png)

* **CNN‑B:**
![CNN-B Loss & Accuracy Curves](./output/cnn_b-model.png)

* **CNN‑C:** 
![CNN-C Loss & Accuracy Curves](./output/cnn_c-model.png)

### 6.2. Numerical Comparison

| Model | Best Epoch | Test Loss | Test Acc   |
| ----- | ---------- | --------- | ---------- |
| CNN‑A | 14         | \~0.169   | **0.9566** |
| CNN‑B | 7          | \~0.623   | 0.8206     |
| CNN‑C | 6          | \~0.252   | 0.9347     |


---

## 7. Discussion & Conclusion

Extended CNN models yielded significant performance improvements over the baseline:

* **CNN‑A**, with additional depth, BatchNorm, and Dropout, achieved the highest test accuracy of **95.66%**. Its regularization and normalization helped generalization.
* **CNN‑B**, despite wider filters and global pooling, plateaued early, likely due to over-smoothing from global pooling.
* **CNN‑C** used residual connections and performed strongly at **93.47%**, showing effectiveness of skip connections for stability.

Compared to Simple CNN (94%), both CNN‑A and CNN‑C showed improvements, with CNN‑A emerging as the best model.

**Best Model: CNN‑A** with hyperparameters (lr=0.0005, batch=128), early stopped at epoch 14.

---

## 8. Submission Details

* **Report:** `hw4_report.pdf`
* **Source Code:** `hw4_kmnist.py`
* **Requirements:** `requirements.txt`
* **Environment:** Python >= 3.12, PyTorch >= 1.12.0, torchvision >= 0.13.0
* **Dataset:** KMNIST (downloaded automatically)
* **Hardware:** NVIDIA GeForce RTX 3060
* **Training Time:** \~1 hours for all models
* **GPU Utilization:** 100% during training
* **Memory Usage:** \~4GB
* **Batch Size:** 64 for both baseline and extended models
* **Epochs:** 20 for baseline, 50 for extended models
* **Early Stopping:** Implemented for extended models


## CS454-554 Homework 4: KMNIST Classification with PyTorch
**Student Name:** Demba Sow
**Due:** May 14, 2025, 23:00


---

### 1. Introduction

The goal of this assignment is to implement and compare three neural network architectures of increasing complexity on the KMNIST dataset (10-class Japanese character images, 28×28 grayscale) using PyTorch. We evaluate:

1. **Linear model**
2. **Multilayer Perceptron** (single hidden layer with 40 neurons)
3. **Convolutional Neural Network** (custom CNN)

For each architecture, we present:

* **Model architecture details**
* **Training & test loss curves**
* **Training & test accuracy curves**
* **Comparative analysis**

---

### 2. Setup: Data & Utilities

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision.datasets import KMNIST
from torchvision import transforms
from torch.utils.data import DataLoader

# Device
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyperparameters
BATCH_SIZE = 64
LR = 1e-3
EPOCHS = 20

# Data transforms & loaders
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
train_ds = KMNIST('./data', train=True, transform=transform, download=True)
test_ds  = KMNIST('./data', train=False,transform=transform, download=True)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE)

# Training utilities
def train_epoch(model, loader, criterion, optimizer):
    model.train(); total_loss=total_correct=0
    for X,y in loader:
        X,y = X.to(DEVICE), y.to(DEVICE)
        preds = model(X)
        loss = criterion(preds, y)
        optimizer.zero_grad(); loss.backward(); optimizer.step()
        total_loss   += loss.item()*X.size(0)
        total_correct+= (preds.argmax(1)==y).sum().item()
    n=len(loader.dataset)
    return total_loss/n, total_correct/n

def eval_model(model, loader, criterion):
    model.eval(); total_loss=total_correct=0
    with torch.no_grad():
        for X,y in loader:
            X,y = X.to(DEVICE), y.to(DEVICE)
            out = model(X)
            total_loss   += criterion(out,y).item()*X.size(0)
            total_correct+= (out.argmax(1)==y).sum().item()
    n=len(loader.dataset)
    return total_loss/n, total_correct/n
```

---

### 3. Linear Model

#### 3.1 Architecture

A single fully-connected layer mapping the flattened 28×28 input to 10 outputs.

```python
class LinearModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(28*28, 10)
    def forward(self, x):
        x = x.view(x.size(0), -1)
        return self.fc(x)
```

#### 3.2 Training

```python
model = LinearModel().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)
hist_lin = {'train_loss':[], 'train_acc':[], 'test_loss':[], 'test_acc':[]}

for epoch in range(1, EPOCHS+1):
    tl, ta = train_epoch(model, train_loader, criterion, optimizer)
    vl, va = eval_model(model, test_loader,   criterion)
    hist_lin['train_loss'].append(tl)
    hist_lin['train_acc'].append(ta)
    hist_lin['test_loss'].append(vl)
    hist_lin['test_acc'].append(va)
    print(f"Epoch {epoch}: TL={tl:.3f}, TA={ta:.3f}, VL={vl:.3f}, VA={va:.3f}")
```

#### 3.3 Results & Analysis

| Epoch | Train Loss | Test Loss | Train Acc | Test Acc |
| ----- | ---------- | --------- | --------- | -------- |
| 20    | 0.567      | 1.059     | 82.9%     | 68.8%    |

* **Analysis:** The linear model underfits: low capacity leads to high bias, with a large gap between train and test accuracy (\~14%).

---

### 4. Multilayer Perceptron (MLP40)

#### 4.1 Architecture

One hidden layer of 40 ReLU units, followed by a 10-way output.

```python
class MLP40(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 40), nn.ReLU(),
            nn.Linear(40, 10)
        )
    def forward(self, x):
        return self.net(x)
```

#### 4.2 Training

```python
model = MLP40().to(DEVICE)
optimizer = optim.Adam(model.parameters(), lr=LR)
hist_mlp = {'train_loss':[], 'train_acc':[], 'test_loss':[], 'test_acc':[]}

for epoch in range(1, EPOCHS+1):
    tl, ta = train_epoch(model, train_loader, criterion, optimizer)
    vl, va = eval_model(model, test_loader,   criterion)
    hist_mlp['train_loss'].append(tl)
    hist_mlp['train_acc'].append(ta)
    hist_mlp['test_loss'].append(vl)
    hist_mlp['test_acc'].append(va)
    print(f"Epoch {epoch}: TL={tl:.3f}, TA={ta:.3f}, VL={vl:.3f}, VA={va:.3f}")
```

#### 4.3 Results & Analysis

| Epoch | Train Loss | Test Loss | Train Acc | Test Acc |
| ----- | ---------- | --------- | --------- | -------- |
| 20    | 0.006      | 0.450     | 99.8%     | 94.8%    |

* **Analysis:** The MLP fits training almost perfectly but shows a \~5% drop to test accuracy, indicating overfitting. Spatial structure is underutilized.

---

### 5. Convolutional Neural Network (SimpleCNN)

#### 5.1 Architecture

Two conv→ReLU→pool blocks, then FC(64)→ReLU→FC(10).

```python
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1,16,3,padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(16,32,3,padding=1), nn.ReLU(), nn.MaxPool2d(2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32*7*7,64), nn.ReLU(),
            nn.Linear(64,10)
        )
    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)
```

#### 5.2 Training

```python
model = SimpleCNN().to(DEVICE)
optimizer = optim.Adam(model.parameters(), lr=LR)
hist_cnn = {'train_loss':[], 'train_acc':[], 'test_loss':[], 'test_acc':[]}

for epoch in range(1, EPOCHS+1):
    tl, ta = train_epoch(model, train_loader, criterion, optimizer)
    vl, va = eval_model(model, test_loader,   criterion)
    hist_cnn['train_loss'].append(tl)
    hist_cnn['train_acc'].append(ta)
    hist_cnn['test_loss'].append(vl)
    hist_cnn['test_acc'].append(va)
    print(f"Epoch {epoch}: TL={tl:.3f}, TA={ta:.3f}, VL={vl:.3f}, VA={va:.3f}")
```

#### 5.3 Results & Analysis

| Epoch | Train Loss | Test Loss | Train Acc | Test Acc |
| ----- | ---------- | --------- | --------- | -------- |
| 20    | 0.001      | 0.527     | 99.9%     | 94.6%    |

* **Analysis:** SimpleCNN captures spatial features, matching MLP40’s test accuracy but with more stable generalization and reduced overfitting.

---

### 6. Comparative Analysis

| Model     | Test Acc | Remarks                                 |
| --------- | -------- | --------------------------------------- |
| Linear    | 68.8%    | Underfits—insufficient capacity         |
| MLP40     | 94.8%    | Overfits—lacks spatial inductive bias   |
| SimpleCNN | 94.6%    | Balanced—leverages convolutional layers |

**Conclusion:** The convolutional architecture provides the best trade‑off between capacity and generalization on KMNIST. Future work can explore deeper CNNs, regularization, and data augmentation for further gains.

---

### 6. Extended CNN Variants

We now explore three deeper CNNs: `CNN_A`, `CNN_B`, and `CNN_C`—all trained with early stopping (patience=3), lr=5e-4, bs=128, max epochs=50.

#### 6.1 Architectures

```python
class CNN_A(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1,32,3,padding=1), nn.BatchNorm2d(32), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32,64,3,padding=1), nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2), nn.Dropout(0.25)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(), nn.Linear(64*7*7,128), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(128,10)
        )
    def forward(self,x): return self.classifier(self.features(x))

class CNN_B(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1,16,5,padding=2), nn.ReLU(),
            nn.Conv2d(16,32,5,padding=2), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(32,64,3,padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d(1)
        )
        self.fc = nn.Linear(64,10)
    def forward(self,x): return self.fc(self.features(x).view(x.size(0),-1))

class ResidualBlock(nn.Module):
    def __init__(self,in_ch,out_ch):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch,out_ch,3,padding=1), nn.ReLU(),
            nn.Conv2d(out_ch,out_ch,3,padding=1)
        )
        self.skip = nn.Conv2d(in_ch,out_ch,1) if in_ch!=out_ch else nn.Identity()
    def forward(self,x): return nn.ReLU()(self.conv(x)+self.skip(x))

class CNN_C(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = ResidualBlock(1,16)
        self.pool1  = nn.MaxPool2d(2)
        self.layer2 = ResidualBlock(16,32)
        self.pool2  = nn.MaxPool2d(2)
        self.classifier = nn.Sequential(
            nn.Flatten(), nn.Linear(32*7*7,64), nn.ReLU(), nn.Linear(64,10)
        )
    def forward(self,x):
        x = self.pool1(self.layer1(x))
        x = self.pool2(self.layer2(x))
        return self.classifier(x)
```

#### 6.2 Training with Early Stopping

```python
# Early-stopping loop
def train_es(model, train_dl, val_dl, criterion, optimizer, max_epochs=50, patience=3):
    best_val=float('inf'); counter=0; history={'train_loss':[],'val_loss':[],'val_acc':[]}
    for ep in range(1,max_epochs+1):
        # train
        model.train(); tloss=0
        for X,y in train_dl:
            X,y = X.to(DEVICE), y.to(DEVICE)
            optimizer.zero_grad(); loss=criterion(model(X),y)
            loss.backward(); optimizer.step()
            tloss += loss.item()*X.size(0)
        tloss/=len(train_dl.dataset)
        # val
        model.eval(); vloss=0; correct=0
        with torch.no_grad():
            for X,y in val_dl:
                X,y = X.to(DEVICE), y.to(DEVICE)
                out=model(X)
                vloss += criterion(out,y).item()*X.size(0)
                correct += (out.argmax(1)==y).sum().item()
        vloss/=len(val_dl.dataset); vacc=correct/len(val_dl.dataset)
        history['train_loss'].append(tloss); history['val_loss'].append(vloss); history['val_acc'].append(vacc)
        print(f"Epoch {ep}: train_loss={tloss:.4f}, val_loss={vloss:.4f}, val_acc={vacc:.4f}")
        if vloss<best_val:
            best_val=vloss; counter=0; torch.save(model.state_dict(),'best.pt')
        else:
            counter+=1; 
            if counter>=patience:
                print("Early stopping triggered.")
                break
    model.load_state_dict(torch.load('best.pt'))
    return history

# Instantiate and train
from torch.utils.data import DataLoader
train_dl = DataLoader(train_ds, batch_size=128, shuffle=True)
val_dl   = DataLoader(test_ds,  batch_size=128)
hist_ext={}
best={'model':None,'acc':0}
for cls in (CNN_A,CNN_B,CNN_C):
    mdl=cls().to(DEVICE)
    opt=optim.Adam(mdl.parameters(),lr=5e-4)
    h=train_es(mdl,train_dl,val_dl,nn.CrossEntropyLoss(),opt)
    _,acc = eval_model(mdl,test_loader,nn.CrossEntropyLoss())
    hist_ext[cls.__name__]=h
    print(f"{cls.__name__} final test acc={acc:.4f}")
    if acc>best['acc']: best.update({'model':cls.__name__,'acc':acc})
print(f"Best extended model: {best['model']} with acc={best['acc']:.4f}")
```

#### 6.3 Results & Analysis

| Model  | Stop Epoch | Test Acc |
| ------ | ---------- | -------- |
| CNN\_A | 18         | 95.68%   |
| CNN\_B | 23         | 84.02%   |
| CNN\_C | 8          | 93.61%   |

* **CNN\_A**: Best generalization (95.7%), stable loss, effective BN+Dropout.
* **CNN\_B**: Limited by missing normalization; underperforms.
* **CNN\_C**: Fast converge via residuals, strong performance (93.6%).

---

### 7. Final Comparative Analysis

| Model      | Test Acc  | Notes                               |
| ---------- | --------- | ----------------------------------- |
| Linear     | 68.8%     | Underfits                           |
| MLP40      | 94.8%     | Overfits; ignores spatial structure |
| SimpleCNN  | 94.6%     | Good baseline                       |
| **CNN\_A** | **95.7%** | Top performer with BN & Dropout     |
| CNN\_C     | 93.6%     | Fast converge with residual blocks  |
| CNN\_B     | 84.0%     | Lacks normalization; lower capacity |

**Conclusion:** `CNN_A` is recommended for KMNIST based on highest accuracy and robustness. Future steps: data augmentation, LR schedules, deeper residual stacks.

---

### 8. Plots

```python
# Plot extended histories
def plot_extended(hist,name):
    epochs=range(1,len(hist['train_loss'])+1)
    plt.figure(figsize=(12,4))
    plt.subplot(1,2,1)
    plt.plot(epochs,hist['train_loss'],label='Train Loss')
    plt.plot(epochs,hist['val_loss'],label='Val Loss')
    plt.title(f'{name} Loss'); plt.xlabel('Epoch'); plt.legend()
    plt.subplot(1,2,2)
    plt.plot(epochs,hist['val_acc'],label='Val Acc')
    plt.title(f'{name} Val Acc'); plt.xlabel('Epoch'); plt.legend()
    plt.tight_layout(); plt.show()

for name,h in hist_ext.items(): plot_extended(h,name)
```



### 9. Overall Conclusion & Future Improvements
After evaluating six models of increasing complexity, **CNN_A** emerges as the clear top performer, achieving **95.7%** test accuracy thanks to batch normalization, dropout regularization, and sufficient convolutional depth. Its stable training/validation curves and resistance to overfitting make it the recommended architecture for KMNIST.

**Why CNN_A Performs Best:**
- **BatchNorm** accelerates convergence and smooths the loss landscape.  
- **Dropout** combats overfitting, yielding tighter train–test performance.  
- **Balanced Depth & Width** offers enough capacity for complex patterns without excessive parameters.

**Room for Improvement:**
- **Data Augmentation:** introducing random rotations, shifts, or elastic distortions could further generalize performance.  
- **Learning-Rate Scheduling:** using cosine annealing or step decay may optimize convergence.  
- **Deeper Residual Networks:** stacking more residual blocks (as in CNN_C) could push accuracy even higher with minimal training overhead.  
- **Weight Decay & Regularization:** applying L2 penalty or label smoothing could refine generalization.

---

*End of Assignment Report.*
