# Part 5:  Vibe-Coding Emerging Optimizers on CIFAR Classification

As large language models (LLMs) reshape software engineering, we are entering a new era of AI-assisted programming where human creativity and machine execution merge. Instead of focusing on every low-level implementation detail, developers can prioritize **ideas, exploration, and iteration speed**, while the model handles much of the routine coding. This is the essence of [**vibe coding**](https://en.wikipedia.org/wiki/Vibe_coding?utm_source=chatgpt.com): coding by “vibes” and letting AI accelerate the path from concept to working system. In this homework, you will practice vibe coding by using AI assistants such as [ChatGPT](https://chat.openai.com), [Gemini](https://gemini.google.com/app), or [Claude](https://www.anthropic.com/claude-code) to help design and implement optimizers for CIFAR image classification.  

---

## 1. What is *Vibe Coding*?

**Vibe coding** is a paradigm popularized by Andrej Karpathy in 2025. It emphasizes generating most of the code through LLMs while the human programmer acts as a **guide, tester, and refiner**. Instead of carefully reviewing every line, you iterate based on execution results, keeping the creative process at the center.  

Karpathy put it simply:  
> *“Fully giving in to the vibes, embracing exponentials, and forgetting that the code even exists.”*  

**Learn more :**
- [IBM: What is Vibe Coding?](https://www.ibm.com/think/topics/vibe-coding?utm_source=chatgpt.com)  
- [Replit Blog: What is Vibe Coding?](https://blog.replit.com/what-is-vibe-coding?utm_source=chatgpt.com)  

---

## 2. Task Overview

You will implement and compare four optimizers on **CIFAR-10 classification** with two architectures: a **Transformer**  and a **ResNet** .  

- **Optimizers**: Muon , Scion , Dion , Adam (baseline)  
- **Models**: Transformer, ResNet  
- **Comparison metrics**: convergence speed, final test accuracy , training stability  

---

## 3. Optimizers to Implement

### (1) Muon
Muon applies **Newton–Schulz orthonormalization** to gradient updates of 2D weight matrices, making them invariant to input conditioning.  
References:  
- [Muon Blog (Keller Jordan)](https://kellerjordan.github.io/posts/muon/?utm_source=chatgpt.com)  
- [Deriving Muon (Jeremy Bernstein)](https://jeremybernste.in/writing/deriving-muon?utm_source=chatgpt.com)  
- [Muon GitHub Repo](https://github.com/KellerJordan/Muon?utm_source=chatgpt.com)  
- [Convergence Bound (arXiv)](https://arxiv.org/abs/2507.01598?utm_source=chatgpt.com)  

---

### (2) Scion
Scion constrains updates differently for hidden vs input/output layers, using **spectral norm** for hidden layers and **ℓ∞ norm** for others. This improves stability and hyperparameter transfer.  
References:  
- [Scion Paper](https://arxiv.org/abs/2502.07529)  
- [Scion Official Code](https://github.com/LIONS-EPFL/scion)  

---

### (3) Dion
Dion extends Muon-like orthonormal updates to **distributed training**. It reduces communication overhead while preserving synchronous semantics, making it efficient at large scale.  
References:  
- [Microsoft Research Blog](https://www.microsoft.com/en-us/research/blog/dion-the-distributed-orthonormal-update-revolution-is-here/?utm_source=chatgpt.com)  
- [Dion Paper (arXiv)](https://arxiv.org/html/2504.05295v1?utm_source=chatgpt.com)  
- [Dion GitHub Repo](https://github.com/microsoft/dion?utm_source=chatgpt.com)  

---

### (4) Adam
Adam is the standard baseline optimizer combining momentum and adaptive learning rates. Use either `Adam` or `AdamW` from PyTorch.

---

## 4. Steps & Deliverables

1. **Model Implementation (20 pts)**  
   - Build a Transformer for CIFAR classification.  
   - Build a ResNet (ResNet-18 or similar).  

2. **Optimizer Integration (30 pts)**  
   - Implement Muon, Scion, Dion optimizers using your AI assistant in a form which is compatible with `torch.optimizer`.  
   - Use Adam as baseline.  

3. **Training & Evaluation (30 pts)**  
   - Train both models with all optimizers.  
   - Collect metrics: training loss, validation accuracy, time-to-accuracy.  
   - Present results with plots and a summary table.  

4. **Discussion & Reflection (20 pts)**  
   - Compare optimizers in terms of convergence speed, stability, and accuracy.  
   - Reflect on your experience using **vibe coding** with AI assistants.  
   - What worked well? What challenges did you face?  

---

## 5. Objectives

- Understand and implement **novel optimizers** (Muon, Scion, Dion).  
- Practice **vibe coding** as a workflow with LLMs.  
- Compare optimizer performance on **CIFAR-10** across Transformer and ResNet architectures.  
- Analyze results critically and reflect on the coding process.  

---

Good luck and enjoy vibe-coding your way through optimizers!!


In [3]:
#### Coding starts here.......
import copy
import torch
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
import numpy as np

import torch.nn as nn
import torch.nn.functional as F

from torchvision.datasets import CIFAR10

from torchvision.transforms import v2 as T, transforms

from tqdm import tqdm
from torch.utils.data import DataLoader, Dataset, random_split
device = 'cuda' if torch.cuda.is_available() else 'cpu'

import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split, Subset
from torchvision.models import resnet18, vit_b_16

In [4]:
"""
cifar10_models.py

Two PyTorch model wrappers for the assignment:
 - ResNetClassifier: wrapper around torchvision.models.resnet18
 - ViTClassifier: wrapper around torchvision.models.vit_b_16 (fallback-safe adjustments)

IMPORTANT: Per your instruction, this file DOES NOT define or instantiate any optimizer.
Plug in your professor's custom optimizer/algorithm when training.

Usage:
    from cifar10_models import ResNetClassifier, ViTClassifier
    model = ResNetClassifier(num_classes=10, pretrained=False)
    vit = ViTClassifier(num_classes=10, pretrained=False)

"""

from typing import Optional
import torch
import torch.nn as nn
import torchvision.models as models


class ResNetClassifier(nn.Module):
    """Wrapper for torchvision.models.resnet18 with custom output classes.

    Args:
        num_classes: number of output classes (CIFAR-10 -> 10)
        pretrained: whether to load ImageNet pretrained weights
        in_channels: number of input channels (default 3). If different, a Conv layer is
            prepended to adapt channels.
    """

    def __init__(self, num_classes: int = 10, pretrained: bool = False, in_channels: int = 3):
        super().__init__()
        # load base resnet18
        self.model = models.resnet18(pretrained=pretrained)

        # adapt input channels if necessary
        if in_channels != 3:
            # replace the first conv layer to handle different input channel count
            orig_conv = self.model.conv1
            self.model.conv1 = nn.Conv2d(
                in_channels,
                orig_conv.out_channels,
                kernel_size=orig_conv.kernel_size,
                stride=orig_conv.stride,
                padding=orig_conv.padding,
                bias=orig_conv.bias is not None,
            )

        # replace the final fully-connected layer
        in_features = self.model.fc.in_features
        self.model.fc = nn.Linear(in_features, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Input x shape: (B, C, H, W). For CIFAR-10, H=W=32 normally; if using pretrained weights
        and larger resolution, consider upsampling or using appropriate transforms.
        """
        return self.model(x)


class ViTClassifier(nn.Module):
    """Wrapper for a Vision Transformer from torchvision with adjustable head.

    This class tries to be robust to different torchvision versions by checking common
    attribute names for the classification head and replacing them with a new Linear.

    Args:
        num_classes: number of output classes
        pretrained: whether to load ImageNet pretrained weights
        image_size: expected input image size (used only for doc; transforms should match).
    """

    def __init__(self, num_classes: int = 10, pretrained: bool = False, image_size: int = 224):
        super().__init__()
        # create ViT from torchvision; function name depends on torchvision version
        # We use vit_b_16 (base patch 16). If unavailable in your torchvision version, swap in
        # the appropriate import or use timm.
        try:
            self.model = models.vit_b_16(pretrained=pretrained)
        except Exception as e:
            # Provide a helpful error message if user's environment doesn't have vit_b_16
            raise RuntimeError(
                "Could not load torchvision.models.vit_b_16. Make sure torchvision >= 0.13 "
                "or use an alternative ViT provider (e.g., timm). Original error: {}".format(e)
            )

        # Now patch the classification head depending on attribute names
        replaced = False
        # Common structure: model.heads.head (torchvision >= 0.13+)
        if hasattr(self.model, "heads") and hasattr(self.model.heads, "head"):
            in_features = self.model.heads.head.in_features
            self.model.heads.head = nn.Linear(in_features, num_classes)
            replaced = True

        # Some versions may have model.classifier or model.head
        if not replaced:
            if hasattr(self.model, "classifier") and isinstance(self.model.classifier, nn.Linear):
                in_features = self.model.classifier.in_features
                self.model.classifier = nn.Linear(in_features, num_classes)
                replaced = True

        if not replaced and hasattr(self.model, "head") and isinstance(self.model.head, nn.Linear):
            in_features = self.model.head.in_features
            self.model.head = nn.Linear(in_features, num_classes)
            replaced = True

        if not replaced:
            # Last resort: attach a new attribute `head` and rely on forward passthrough.
            # This may require editing the model.forward in extreme cases; raise a helpful error
            raise RuntimeError(
                "Unable to locate a replaceable classification head on the loaded ViT model."
                " Please inspect the model structure or provide a torchvision version that supports vit_b_16."
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Input x shape: (B, C, H, W). For torchvision ViT default pretrained weights, use H=W=224.
        For CIFAR-10 (32x32) you may need to upsample in transforms or adapt patch embedding.
        """
        return self.model(x)

In [11]:
import tqdm
"""
CIFAR-10 training setup with custom Muon optimizer.

This script:
  - Loads CIFAR-10 DataLoaders (same configuration as before)
  - Defines a custom Muon optimizer compatible with torch.optim
  - Demonstrates how to train a model using it (you can plug in ResNetClassifier or ViTClassifier)
"""

import torch
from torch import nn
from torch.optim import Optimizer, Adam
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
import torchvision.datasets as datasets

# ==============================
# Custom Muon Optimizer
# ==============================
class Muon(Optimizer):
    """Muon optimizer (a momentum-based optimizer with normalized updates)."""

    def __init__(self, params, lr=0.01, momentum=0.9, weight_decay=0.0):
        if lr <= 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        if momentum < 0.0:
            raise ValueError(f"Invalid momentum value: {momentum}")

        defaults = dict(lr=lr, momentum=momentum, weight_decay=weight_decay)
        super(Muon, self).__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            lr = group['lr']
            momentum = group['momentum']
            weight_decay = group['weight_decay']

            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad

                # Apply weight decay
                if weight_decay != 0:
                    grad = grad.add(p, alpha=weight_decay)

                # Get state variables
                state = self.state[p]
                if len(state) == 0:
                    state['velocity'] = torch.zeros_like(p)

                velocity = state['velocity']

                # Momentum update
                velocity.mul_(momentum).add_(grad)

                # Normalized update (Muon-specific idea)
                norm_v = torch.norm(velocity)
                norm_g = torch.norm(grad)
                if norm_v > 0 and norm_g > 0:
                    velocity.mul_(norm_g / norm_v)

                p.add_(velocity, alpha=-lr)

        return loss

In [12]:
class Scion(Optimizer):
    """
    Implements Scion optimizer.

    This is a conceptual optimizer based on the user's 'Muon' example.

    The core idea of this version is to normalize the momentum-based
    velocity vector to a unit vector, so the learning rate
    directly controls the magnitude of the step in that direction.
    """

    def __init__(self, params, lr=0.01, momentum=0.9, weight_decay=0.0, eps=1e-8):
        if lr <= 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        if momentum < 0.0:
            raise ValueError(f"Invalid momentum value: {momentum}")
        if eps < 0.0:
            raise ValueError(f"Invalid epsilon value: {eps}")

        defaults = dict(lr=lr, momentum=momentum, weight_decay=weight_decay, eps=eps)
        super(Scion, self).__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            lr = group['lr']
            momentum = group['momentum']
            weight_decay = group['weight_decay']
            eps = group['eps']

            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad

                # Apply weight decay
                if weight_decay != 0:
                    grad = grad.add(p, alpha=weight_decay)

                # Get state variables
                state = self.state[p]
                if len(state) == 0:
                    state['velocity'] = torch.zeros_like(p)

                velocity = state['velocity']

                # Momentum update
                velocity.mul_(momentum).add_(grad)

                # Normalized update (Scion-specific idea)
                # We normalize the velocity vector to a unit vector.
                # The learning rate then dictates the step size.

                # addcdiv_ is an efficient way to do: p + alpha * (tensor1 / tensor2)
                # Here: p = p + (-lr) * (velocity / (norm(velocity) + eps))
                norm_v = torch.norm(velocity)

                if norm_v > 0:
                    p.addcdiv_(velocity, norm_v.add(eps), alpha=-lr)

        return loss


In [13]:
class Dion(Optimizer):
    """
    Implements Dion optimizer.

    This is a conceptual optimizer based on the user's 'Muon' example.

    The core idea of this version is to scale the momentum-based
    velocity by an exponential moving average of the *squared gradient norm*,
    similar to RMSprop or Adam, but applied to the entire tensor's norm
    rather than per-parameter.
    """

    def __init__(self, params, lr=0.01, momentum=0.9, weight_decay=0.0, beta2=0.999, eps=1e-8):
        if lr <= 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        if momentum < 0.0:
            raise ValueError(f"Invalid momentum value: {momentum}")
        if beta2 < 0.0 or beta2 >= 1.0:
            raise ValueError(f"Invalid beta2 value: {beta2}")
        if eps < 0.0:
            raise ValueError(f"Invalid epsilon value: {eps}")

        defaults = dict(lr=lr, momentum=momentum, weight_decay=weight_decay, beta2=beta2, eps=eps)
        super(Dion, self).__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            lr = group['lr']
            momentum = group['momentum']
            weight_decay = group['weight_decay']
            beta2 = group['beta2']
            eps = group['eps']

            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad

                # Apply weight decay
                if weight_decay != 0:
                    grad = grad.add(p, alpha=weight_decay)

                # Get state variables
                state = self.state[p]
                if len(state) == 0:
                    state['velocity'] = torch.zeros_like(p)
                    # exp_avg_sq_norm is a scalar (tensor(0.))
                    state['exp_avg_sq_norm'] = torch.zeros((), device=p.device)

                velocity = state['velocity']
                exp_avg_sq_norm = state['exp_avg_sq_norm']

                # Momentum update
                velocity.mul_(momentum).add_(grad)

                # Get L2 norm of the current gradient, then square it
                grad_norm_sq = torch.norm(grad).pow(2)

                # Update exponential moving average of squared grad norms
                exp_avg_sq_norm.mul_(beta2).add_(grad_norm_sq, alpha=1 - beta2)

                # Normalized update (Dion-specific idea)
                # We scale the velocity by the root-mean-square of past grad norms

                # Denominator is sqrt(exp_avg_sq_norm) + eps
                denom = exp_avg_sq_norm.sqrt().add(eps)

                # p = p + (-lr) * (velocity / denom)
                p.addcdiv_(velocity, denom, alpha=-lr)

        return loss


In [14]:
# ==============================
# Data setup (same as before)
# ==============================
BATCH_SIZE = 128
NUM_WORKERS = 4
DATA_DIR = './data'

CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD = (0.2470, 0.2435, 0.2616)

train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

train_dataset = datasets.CIFAR10(root=DATA_DIR, train=True, download=True, transform=train_transform)
test_dataset = datasets.CIFAR10(root=DATA_DIR, train=False, download=True, transform=test_transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda




In [16]:
# ==============================
# Example training function
# ==============================
def train_model(model, train_loader, test_loader, epochs=10, lr=0.01, algo=None):
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    if algo is None or algo == "Adam":
      optimizer = Adam(model.parameters(), lr=lr)
    elif algo == "Muon":
      optimizer = Muon(model.parameters(), lr=lr, momentum=0.9, weight_decay=5e-4)
    elif algo == "Scion":
      optimizer = Scion(model.parameters(), lr=lr, momentum=0.9, weight_decay=5e-4)
    elif algo == "Dion":
      optimizer = Dion(model.parameters(), lr=lr, momentum=0.9, weight_decay=5e-4)
    else:
      raise ValueError(f"Invalid algorithm: {algo}, please choose: Adam, Muon, Scion or Dion")

    for epoch in tqdm.tqdm(range(epochs)):
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item() * labels.size(0)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

        train_acc = 100. * correct / total
        avg_loss = total_loss / total

        print(f"Epoch [{epoch+1}/{epochs}] | Loss: {avg_loss:.4f} | Train Acc: {train_acc:.2f}%")

    print("Training completed.")

# ==============================
# Example usage (in notebook)
# ==============================
model = ResNetClassifier(num_classes=10, pretrained=False)
train_model(model, train_loader, test_loader, epochs=10, lr=0.01)


 10%|█         | 1/10 [00:23<03:28, 23.22s/it]

Epoch [1/10] | Loss: 1.8474 | Train Acc: 33.38%


 20%|██        | 2/10 [00:46<03:08, 23.53s/it]

Epoch [2/10] | Loss: 1.4188 | Train Acc: 48.52%


 30%|███       | 3/10 [01:09<02:42, 23.26s/it]

Epoch [3/10] | Loss: 1.1855 | Train Acc: 57.63%


 40%|████      | 4/10 [01:32<02:16, 22.81s/it]

Epoch [4/10] | Loss: 1.0453 | Train Acc: 63.05%


 50%|█████     | 5/10 [01:54<01:53, 22.79s/it]

Epoch [5/10] | Loss: 0.9476 | Train Acc: 66.56%


 60%|██████    | 6/10 [02:17<01:31, 22.81s/it]

Epoch [6/10] | Loss: 0.8717 | Train Acc: 69.25%


 70%|███████   | 7/10 [02:40<01:08, 22.83s/it]

Epoch [7/10] | Loss: 0.8085 | Train Acc: 71.83%


 80%|████████  | 8/10 [03:02<00:45, 22.69s/it]

Epoch [8/10] | Loss: 0.7677 | Train Acc: 73.20%


 90%|█████████ | 9/10 [03:25<00:22, 22.69s/it]

Epoch [9/10] | Loss: 0.7187 | Train Acc: 75.16%


100%|██████████| 10/10 [03:48<00:00, 22.84s/it]

Epoch [10/10] | Loss: 0.6855 | Train Acc: 76.32%
Training completed.





In [19]:
model = ResNetClassifier(num_classes=10, pretrained=False)
train_model(model, train_loader, test_loader, epochs=10, lr=0.01, algo="Muon")

 10%|█         | 1/10 [00:27<04:07, 27.51s/it]

Epoch [1/10] | Loss: 1.7193 | Train Acc: 36.78%


 20%|██        | 2/10 [00:55<03:41, 27.63s/it]

Epoch [2/10] | Loss: 1.4305 | Train Acc: 47.77%


 30%|███       | 3/10 [01:23<03:14, 27.80s/it]

Epoch [3/10] | Loss: 1.2987 | Train Acc: 53.23%


 40%|████      | 4/10 [01:50<02:44, 27.43s/it]

Epoch [4/10] | Loss: 1.1957 | Train Acc: 57.04%


 50%|█████     | 5/10 [02:17<02:17, 27.49s/it]

Epoch [5/10] | Loss: 1.1171 | Train Acc: 59.82%


 60%|██████    | 6/10 [02:45<01:50, 27.59s/it]

Epoch [6/10] | Loss: 1.0493 | Train Acc: 62.48%


 70%|███████   | 7/10 [03:12<01:22, 27.35s/it]

Epoch [7/10] | Loss: 0.9972 | Train Acc: 64.37%


 80%|████████  | 8/10 [03:39<00:54, 27.20s/it]

Epoch [8/10] | Loss: 0.9481 | Train Acc: 66.11%


 90%|█████████ | 9/10 [04:06<00:27, 27.24s/it]

Epoch [9/10] | Loss: 0.9099 | Train Acc: 67.30%


100%|██████████| 10/10 [04:33<00:00, 27.37s/it]

Epoch [10/10] | Loss: 0.8768 | Train Acc: 68.86%
Training completed.





In [20]:
model = ResNetClassifier(num_classes=10, pretrained=False)
train_model(model, train_loader, test_loader, epochs=10, lr=0.01, algo="Scion")

  0%|          | 0/10 [00:00<?, ?it/s]


TypeError: addcdiv_() got an unexpected keyword argument 'alpha'