# Knowledge Distillation for EfficientNetV2 on CIFAR-100 - Version 4 (Grand Finale)

This notebook implements the **"Best of Both Worlds"** approach:

- **Best Teacher:** Teacher v3 (76.02% accuracy)
- **Best Distillation:** Standard KD (v2 pipeline with 98% retention rate)

---

## Version History:

- **v1 (Baseline):** Standard KD + CutMix/Mixup → 72.34% accuracy
- **v2 (Enhanced):** + AutoAugment + Label Smoothing + LR Warmup → 74.20% accuracy (98% retention)
- **v3 (DKD β=8.0):** Decoupled KD → 72.69% (over-regularized)
- **v3.1 (DKD β=2.0):** Tuned DKD → 73.98% (97.3% retention)
- **v4 (Grand Finale):** Strong Teacher v3 + Standard KD → Target: **74.50%+**

---

## Key Strategy in v4:

1. **Use Teacher v3** (76.02%) - the strongest teacher we have
2. **Revert to Standard KD** (not DKD) - proven 98% retention rate with augmentation
3. **Keep v2 augmentation pipeline** - AutoAugment + Mixup/CutMix + Label Smoothing

**Prediction:** 76.02% × 0.98 ≈ **74.50%** (new record!)

---


In [1]:
# Cell 1: Setup and Imports
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision.models import efficientnet_v2_s, efficientnet_v2_l, EfficientNet_V2_S_Weights, EfficientNet_V2_L_Weights
from tqdm import tqdm
import copy
import numpy as np
import matplotlib.pyplot as plt
import glob
import random
from pathlib import Path

# Setup Directories (Local paths)
PROJECT_ROOT = Path('./outputs')
MODEL_DIR = PROJECT_ROOT / 'models'
DATA_DIR = PROJECT_ROOT / 'data'
CHECKPOINT_DIR = PROJECT_ROOT / 'checkpoints'

MODEL_DIR.mkdir(parents=True, exist_ok=True)
DATA_DIR.mkdir(parents=True, exist_ok=True)
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Directories ready at: {PROJECT_ROOT.absolute()}")

# Reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    torch.backends.cudnn.benchmark = True

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Directories ready at: d:\Projects\MasterProject\code\outputs
Using device: cuda
GPU: NVIDIA GeForce RTX 5070 Laptop GPU


In [2]:
# Cell 2: Experiment Configuration (Version 4 - Grand Finale)
# ==========================================
# HYPERPARAMETERS
# ==========================================
NUM_EPOCHS = 200
TEACHER_EPOCHS = 300
BATCH_SIZE = 128
LEARNING_RATE = 0.001
WEIGHT_DECAY = 0.05
PATIENCE = 30
WARMUP_EPOCHS = 5           # LR warmup

# Distillation Params (Standard KD - v2 settings, NOT DKD)
KD_ALPHA = 0.7              # Soft loss weight (v2 setting)
TEMPERATURE = 4.0           # Softmax Temperature
LABEL_SMOOTHING = 0.1       # Label smoothing for hard loss

# Augmentation Params (same as v2)
MIXUP_ALPHA = 0.8
CUTMIX_ALPHA = 1.0
CHECKPOINT_FREQUENCY = 20
NUM_CLASSES = 100

# Version 4 Model Names
TEACHER_NAME = "teacher_v3"              # Use the STRONG teacher (76.02%)
STUDENT_NAME = "student_v4_standard_kd"  # New student with Standard KD

# Flag to use Standard KD instead of DKD
USE_DKD = False  # KEY CHANGE: Disable DKD, use Standard KD

print(f"{'='*60}")
print(f"VERSION 4 CONFIG - Grand Finale (Best Teacher + Best KD)")
print(f"{'='*60}")
print(f"Strategy: Strong Teacher v3 + Standard KD (v2 pipeline)")
print(f"Teacher: {TEACHER_NAME} (Accuracy: 76.02%)")
print(f"Training: Epochs={NUM_EPOCHS} | Batch={BATCH_SIZE} | LR={LEARNING_RATE}")
print(f"Standard KD: Alpha={KD_ALPHA}, Temp={TEMPERATURE} (NOT DKD)")
print(f"Augmentation: AutoAugment + RandomErasing + CutMix/Mixup")
print(f"Student Model: {STUDENT_NAME}")
print(f"Expected: 76.02% × 0.98 ≈ 74.50%")
print(f"{'='*60}")

VERSION 4 CONFIG - Grand Finale (Best Teacher + Best KD)
Strategy: Strong Teacher v3 + Standard KD (v2 pipeline)
Teacher: teacher_v3 (Accuracy: 76.02%)
Training: Epochs=200 | Batch=128 | LR=0.001
Standard KD: Alpha=0.7, Temp=4.0 (NOT DKD)
Augmentation: AutoAugment + RandomErasing + CutMix/Mixup
Student Model: student_v4_standard_kd
Expected: 76.02% × 0.98 ≈ 74.50%


In [3]:
# Cell 3: Data Loading (Enhanced with AutoAugment + RandomErasing)
from torchvision.transforms import autoaugment

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    autoaugment.AutoAugment(policy=autoaugment.AutoAugmentPolicy.CIFAR10),
    transforms.ToTensor(),
    transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
    transforms.RandomErasing(p=0.25, scale=(0.02, 0.2)),  # Cutout-like augmentation
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
])

trainset = torchvision.datasets.CIFAR100(root=str(DATA_DIR), train=True, download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True, 
                                          num_workers=4, pin_memory=True, drop_last=True)

testset = torchvision.datasets.CIFAR100(root=str(DATA_DIR), train=False, download=True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size=BATCH_SIZE, shuffle=False, 
                                         num_workers=4, pin_memory=True)

print(f"Data loaded: {len(trainset)} Training, {len(testset)} Test images")
print(f"  Augmentation: AutoAugment + RandomErasing + CutMix/Mixup")

Data loaded: 50000 Training, 10000 Test images
  Augmentation: AutoAugment + RandomErasing + CutMix/Mixup


In [4]:
# Cell 4: Helper Functions (DKD Loss + Augmentations + Utilities)

# ==========================================
# 1. Decoupled Knowledge Distillation (DKD) Loss
# Based on: Zhao et al. "Decoupled Knowledge Distillation", CVPR 2022
# https://openaccess.thecvf.com/content/CVPR2022/papers/Zhao_Decoupled_Knowledge_Distillation_CVPR_2022_paper.pdf
# ==========================================

def _get_gt_mask(logits, target):
    """Get mask for target (ground truth) class"""
    target = target.reshape(-1)
    mask = torch.zeros_like(logits).scatter_(1, target.unsqueeze(1), 1).bool()
    return mask

def _get_other_mask(logits, target):
    """Get mask for non-target classes"""
    target = target.reshape(-1)
    mask = torch.ones_like(logits).scatter_(1, target.unsqueeze(1), 0).bool()
    return mask

def cat_mask(t, mask1, mask2):
    """Concatenate masked values"""
    t1 = (t * mask1).sum(dim=1, keepdims=True)
    t2 = (t * mask2).sum(1, keepdims=True)
    rt = torch.cat([t1, t2], dim=1)
    return rt

def dkd_loss(student_logits, teacher_logits, target, alpha=1.0, beta=8.0, temperature=4.0):
    """
    Decoupled Knowledge Distillation (DKD) Loss.
    
    Separates knowledge into:
    - TCKD (Target Class Knowledge Distillation): Knowledge about the correct class
    - NCKD (Non-Target Class Knowledge Distillation): Knowledge about class relationships
    
    Args:
        student_logits: Student model outputs
        teacher_logits: Teacher model outputs
        target: Ground truth labels
        alpha: Weight for TCKD (default: 1.0)
        beta: Weight for NCKD (default: 8.0, crucial for performance)
        temperature: Softmax temperature (default: 4.0)
    
    Returns:
        Combined DKD loss
    """
    # Get masks for target and non-target classes
    gt_mask = _get_gt_mask(student_logits, target)
    other_mask = _get_other_mask(student_logits, target)
    
    # Calculate probabilities with temperature scaling
    pred_student = F.softmax(student_logits / temperature, dim=1)
    pred_teacher = F.softmax(teacher_logits / temperature, dim=1)
    
    # Separate target and non-target probabilities
    pred_student_cat = cat_mask(pred_student, gt_mask, other_mask)
    pred_teacher_cat = cat_mask(pred_teacher, gt_mask, other_mask)
    
    # Target Class Knowledge Distillation (TCKD)
    log_pred_student = torch.log(pred_student_cat + 1e-8)
    tckd_loss = (
        F.kl_div(log_pred_student, pred_teacher_cat, reduction='batchmean')
        * (temperature ** 2)
    )
    
    # Non-Target Class Knowledge Distillation (NCKD)
    # Mask out target class with large negative value
    pred_teacher_part2 = F.softmax(
        teacher_logits / temperature - 1000.0 * gt_mask, dim=1
    )
    log_pred_student_part2 = F.log_softmax(
        student_logits / temperature - 1000.0 * gt_mask, dim=1
    )
    
    nckd_loss = (
        F.kl_div(log_pred_student_part2, pred_teacher_part2, reduction='batchmean')
        * (temperature ** 2)
    )
    
    # Combined DKD loss
    return alpha * tckd_loss + beta * nckd_loss

def dkd_loss_mixup(student_logits, teacher_logits, labels_a, labels_b, lam,
                   alpha=1.0, beta=8.0, temperature=4.0, label_smoothing=0.1):
    """
    DKD Loss for mixed samples (CutMix/Mixup compatible)
    Uses weighted combination of DKD losses for both label sets
    """
    # DKD for both label sets
    dkd_a = dkd_loss(student_logits, teacher_logits, labels_a, alpha, beta, temperature)
    dkd_b = dkd_loss(student_logits, teacher_logits, labels_b, alpha, beta, temperature)
    
    # Weighted DKD loss
    soft_loss = lam * dkd_a + (1 - lam) * dkd_b
    
    # Hard loss with label smoothing
    hard_loss = lam * F.cross_entropy(student_logits, labels_a, label_smoothing=label_smoothing) + \
                (1 - lam) * F.cross_entropy(student_logits, labels_b, label_smoothing=label_smoothing)
    
    # Combined loss (DKD already weighted by alpha/beta, so we just add hard loss)
    # Using 0.1 weight for hard loss to not overpower DKD
    loss = soft_loss + 0.1 * hard_loss
    
    # NaN safety check
    if torch.isnan(loss):
        return hard_loss
    
    return loss

# ==========================================
# 2. Standard KD Loss (kept for comparison)
# ==========================================
def kd_loss(student_logits, teacher_logits, labels, temp=4.0, alpha=0.7, label_smoothing=0.1):
    """Standard Knowledge Distillation Loss (for comparison)"""
    soft_student = F.log_softmax(student_logits / temp, dim=1)
    soft_teacher = F.softmax(teacher_logits / temp, dim=1)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (temp ** 2)
    hard_loss = F.cross_entropy(student_logits, labels, label_smoothing=label_smoothing)
    return alpha * soft_loss + (1 - alpha) * hard_loss

# ==========================================
# 3. Augmentations: Mixup & CutMix
# ==========================================
def mixup_data(x, y, alpha=1.0):
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1
    batch_size = x.size()[0]
    index = torch.randperm(batch_size).to(device)
    mixed_x = lam * x + (1 - lam) * x[index, :]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

def rand_bbox(size, lam):
    W = size[2]
    H = size[3]
    cut_rat = np.sqrt(1. - lam)
    cut_w = int(W * cut_rat)
    cut_h = int(H * cut_rat)
    cx = np.random.randint(W)
    cy = np.random.randint(H)
    bbx1 = np.clip(cx - cut_w // 2, 0, W)
    bby1 = np.clip(cy - cut_h // 2, 0, H)
    bbx2 = np.clip(cx + cut_w // 2, 0, W)
    bby2 = np.clip(cy + cut_h // 2, 0, H)
    return bbx1, bby1, bbx2, bby2

def cutmix_data(x, y, alpha=1.0):
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1
    batch_size = x.size()[0]
    index = torch.randperm(batch_size).to(device)
    bbx1, bby1, bbx2, bby2 = rand_bbox(x.size(), lam)
    x[:, :, bbx1:bbx2, bby1:bby2] = x[index, :, bbx1:bbx2, bby1:bby2]
    lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (x.size()[-1] * x.size()[-2]))
    y_a, y_b = y, y[index]
    return x, y_a, y_b, lam

# ==========================================
# 4. Utilities (Save/Load/Evaluate)
# ==========================================
def evaluate_model_with_loss(model, dataloader, criterion):
    model.eval()
    correct = 0
    total = 0
    running_loss = 0.0
    with torch.no_grad():
        for data in dataloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    acc = 100 * correct / total
    avg_loss = running_loss / len(dataloader)
    return acc, avg_loss

def save_checkpoint(model, optimizer, scheduler, epoch, best_acc, history, model_name, epochs_no_improve):
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict(),
        'best_acc': best_acc,
        'history': history,
        'epochs_no_improve': epochs_no_improve
    }
    path = CHECKPOINT_DIR / f"{model_name}_epoch{epoch+1}.pth"
    torch.save(checkpoint, path)
    print(f"  Checkpoint saved: {path}")
    
def load_checkpoint(model, optimizer, scheduler, model_name):
    checkpoints = sorted(glob.glob(str(CHECKPOINT_DIR / f"{model_name}_epoch*.pth")))
    if not checkpoints:
        return None
    latest = checkpoints[-1]
    print(f"  Loading checkpoint: {latest}")
    checkpoint = torch.load(latest, map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    return checkpoint

def cleanup_old_checkpoints(model_name, keep=3):
    checkpoints = sorted(glob.glob(str(CHECKPOINT_DIR / f"{model_name}_epoch*.pth")))
    if len(checkpoints) > keep:
        for chk in checkpoints[:-keep]:
            os.remove(chk)
            print(f"  Cleaned up: {os.path.basename(chk)}")

print("Helper functions loaded:")
print("  - DKD Loss (Decoupled Knowledge Distillation)")
print("  - Standard KD Loss (for comparison)")
print("  - CutMix, Mixup augmentations")
print("  - Checkpoint utilities")

Helper functions loaded:
  - DKD Loss (Decoupled Knowledge Distillation)
  - Standard KD Loss (for comparison)
  - CutMix, Mixup augmentations
  - Checkpoint utilities


In [5]:
# Cell 5: Training Loop (Version 3 - with DKD support)
def train_model_v3(model, dataloader, optimizer, scheduler, num_epochs, model_name, 
                   teacher_model=None, use_dkd=True,
                   dkd_alpha=1.0, dkd_beta=8.0, temperature=4.0, label_smoothing=0.1,
                   patience=30, grad_clip=1.0, warmup_epochs=5):
    """
    Version 3 Training loop with:
    - Decoupled Knowledge Distillation (DKD) loss
    - CutMix/Mixup augmentation
    - Learning Rate Warmup
    - Mixed precision training
    - Gradient clipping
    - NaN detection and recovery
    """
    
    # 1. Load Checkpoint
    checkpoint = load_checkpoint(model, optimizer, scheduler, model_name)
    if checkpoint:
        start_epoch = checkpoint['epoch'] + 1
        best_acc = checkpoint['best_acc']
        history = checkpoint['history']
        epochs_no_improve = checkpoint['epochs_no_improve']
        best_model_wts = copy.deepcopy(model.state_dict())
        print(f"  Resuming from epoch {start_epoch}, Best Acc: {best_acc:.2f}%")
    else:
        start_epoch = 0
        best_acc = 0.0
        history = {'train_loss': [], 'val_loss': [], 'val_accuracy': []}
        epochs_no_improve = 0
        best_model_wts = copy.deepcopy(model.state_dict())
        print(f"  Starting fresh training...")

    # 2. Setup
    scaler = torch.amp.GradScaler('cuda', enabled=torch.cuda.is_available())
    val_criterion = nn.CrossEntropyLoss()
    
    # Store base LR for warmup
    base_lr = optimizer.param_groups[0]['lr']
    
    if teacher_model:
        teacher_model.eval()
        for param in teacher_model.parameters():
            param.requires_grad = False
        loss_type = "DKD" if use_dkd else "Standard KD"
        print(f"  Using {loss_type} loss with Teacher guidance")

    # 3. Training Loop
    for epoch in range(start_epoch, num_epochs):
        model.train()
        running_loss = 0.0
        valid_batches = 0
        
        # Learning Rate Warmup
        if epoch < warmup_epochs:
            warmup_lr = base_lr * (epoch + 1) / warmup_epochs
            for param_group in optimizer.param_groups:
                param_group['lr'] = warmup_lr
        
        loop = tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
        
        for inputs, labels in loop:
            inputs, labels = inputs.to(device), labels.to(device)
            
            # Randomly choose Mixup (50%) or CutMix (50%)
            use_cutmix = np.random.rand() > 0.5
            if use_cutmix:
                inputs_aug, labels_a, labels_b, lam = cutmix_data(inputs.clone(), labels, alpha=CUTMIX_ALPHA)
            else:
                inputs_aug, labels_a, labels_b, lam = mixup_data(inputs, labels, alpha=MIXUP_ALPHA)
            
            optimizer.zero_grad()
            
            with torch.amp.autocast('cuda', enabled=torch.cuda.is_available()):
                # Student Forward
                student_outputs = model(inputs_aug)
                
                # Loss Calculation
                if teacher_model:
                    # Teacher Forward (no grad)
                    with torch.no_grad():
                        teacher_outputs = teacher_model(inputs_aug)
                    
                    if use_dkd:
                        # DKD Loss (Version 3)
                        loss = dkd_loss_mixup(
                            student_outputs, teacher_outputs, 
                            labels_a, labels_b, lam,
                            alpha=dkd_alpha, beta=dkd_beta, 
                            temperature=temperature, label_smoothing=label_smoothing
                        )
                    else:
                        # Standard KD Loss (for comparison)
                        loss = lam * kd_loss(student_outputs, teacher_outputs, labels_a, temperature, 0.7, label_smoothing) + \
                               (1 - lam) * kd_loss(student_outputs, teacher_outputs, labels_b, temperature, 0.7, label_smoothing)
                else:
                    # Standard CE with Label Smoothing for Teacher training
                    loss = lam * F.cross_entropy(student_outputs, labels_a, label_smoothing=label_smoothing) + \
                           (1 - lam) * F.cross_entropy(student_outputs, labels_b, label_smoothing=label_smoothing)
            
            # Skip batch if loss is NaN
            if torch.isnan(loss):
                loop.set_postfix(loss="NaN-skip")
                continue
            
            # Backward
            scaler.scale(loss).backward()
            
            if grad_clip > 0:
                scaler.unscale_(optimizer)
                nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
                
            scaler.step(optimizer)
            scaler.update()
            
            running_loss += loss.item()
            valid_batches += 1
            loop.set_postfix(loss=f"{loss.item():.3f}")

        # Step Scheduler (only after warmup)
        if epoch >= warmup_epochs:
            scheduler.step()
        
        # Validation
        train_loss = running_loss / max(valid_batches, 1)
        val_acc, val_loss = evaluate_model_with_loss(model, testloader, val_criterion)
        
        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['val_accuracy'].append(val_acc)
        
        current_lr = optimizer.param_groups[0]['lr']
        print(f"Epoch {epoch+1}/{num_epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}% | LR: {current_lr:.6f}")
        
        # Save Best
        if val_acc > best_acc:
            best_acc = val_acc
            best_model_wts = copy.deepcopy(model.state_dict())
            torch.save(model.state_dict(), MODEL_DIR / f"{model_name}.pth")
            epochs_no_improve = 0
            print(f"  ★ New best model saved! Accuracy: {best_acc:.2f}%")
        else:
            epochs_no_improve += 1
            
        # Checkpointing
        if (epoch + 1) % CHECKPOINT_FREQUENCY == 0:
            save_checkpoint(model, optimizer, scheduler, epoch, best_acc, history, model_name, epochs_no_improve)
            cleanup_old_checkpoints(model_name)
            
        # Early Stopping
        if epochs_no_improve >= patience:
            print(f"Early stopping triggered after {epoch+1} epochs.")
            break
        
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            
    print(f"\n{'='*50}")
    print(f"Training complete. Best accuracy: {best_acc:.2f}%")
    print(f"{'='*50}")
    model.load_state_dict(best_model_wts)
    return model, history

print("Training function loaded (Version 3: DKD + LR Warmup + Label Smoothing)")

Training function loaded (Version 3: DKD + LR Warmup + Label Smoothing)


In [6]:
# Cell 6: Initialize Models
print("Loading Teacher (EfficientNetV2-L)...")
teacher_model = efficientnet_v2_l(weights=EfficientNet_V2_L_Weights.IMAGENET1K_V1)
teacher_model.classifier[1] = nn.Linear(teacher_model.classifier[1].in_features, NUM_CLASSES)
teacher_model = teacher_model.to(device)

print("Loading Student (EfficientNetV2-S)...")
student_model = efficientnet_v2_s(weights=EfficientNet_V2_S_Weights.IMAGENET1K_V1)
student_model.classifier[1] = nn.Linear(student_model.classifier[1].in_features, NUM_CLASSES)
student_model = student_model.to(device)

print("Models loaded")

Loading Teacher (EfficientNetV2-L)...
Loading Student (EfficientNetV2-S)...
Models loaded


In [7]:
# Cell 7: Train/Load Teacher Model (Version 3)
print("\n" + "="*70)
print("TEACHER MODEL (Version 3)")
print("="*70)

teacher_path = MODEL_DIR / f"{TEACHER_NAME}.pth"

# Also check for existing v1/v2 teacher to use as starting point
existing_teacher = MODEL_DIR / "teacher_model.pth"

if teacher_path.exists():
    print(f"Found existing Teacher v3: {teacher_path}")
    teacher_model.load_state_dict(torch.load(teacher_path, map_location=device))
elif existing_teacher.exists():
    print(f"Found existing Teacher (v1/v2): {existing_teacher}")
    print(f"Loading and continuing training for {TEACHER_EPOCHS} epochs...")
    teacher_model.load_state_dict(torch.load(existing_teacher, map_location=device))
    
    # Continue training with extended epochs
    opt_t = optim.AdamW(teacher_model.parameters(), lr=LEARNING_RATE * 0.1, weight_decay=WEIGHT_DECAY)  # Lower LR for fine-tuning
    sch_t = optim.lr_scheduler.CosineAnnealingLR(opt_t, T_max=100)  # Additional 100 epochs
    
    teacher_model, teacher_history = train_model_v3(
        teacher_model, trainloader, opt_t, sch_t, 
        num_epochs=100,  # Additional epochs
        model_name=TEACHER_NAME, 
        teacher_model=None,
        warmup_epochs=0  # No warmup for fine-tuning
    )
else:
    print(f"Training Teacher Model v3 from scratch ({TEACHER_EPOCHS} epochs)...")
    print("This may take a while...")
    
    # Re-initialize teacher
    teacher_model = efficientnet_v2_l(weights=EfficientNet_V2_L_Weights.IMAGENET1K_V1)
    teacher_model.classifier[1] = nn.Linear(teacher_model.classifier[1].in_features, NUM_CLASSES)
    teacher_model = teacher_model.to(device)
    
    opt_t = optim.AdamW(teacher_model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
    sch_t = optim.lr_scheduler.CosineAnnealingLR(opt_t, T_max=TEACHER_EPOCHS - WARMUP_EPOCHS)
    
    teacher_model, teacher_history = train_model_v3(
        teacher_model, trainloader, opt_t, sch_t, 
        num_epochs=TEACHER_EPOCHS, 
        model_name=TEACHER_NAME, 
        teacher_model=None,
        warmup_epochs=WARMUP_EPOCHS
    )

# Evaluate Teacher
teacher_model.eval()
teacher_accuracy, _ = evaluate_model_with_loss(teacher_model, testloader, nn.CrossEntropyLoss())
print(f"\nTeacher v3 Accuracy: {teacher_accuracy:.2f}%")


TEACHER MODEL (Version 3)
Found existing Teacher v3: outputs\models\teacher_v3.pth

Teacher v3 Accuracy: 76.02%


In [8]:
# Cell 8: Train Student with Standard KD (Version 4 - Grand Finale)
print("\n" + "="*70)
print("STUDENT MODEL with Standard KD (Version 4 - Grand Finale)")
print("="*70)
print("Strategy: Best Teacher (v3) + Best KD Method (Standard KD from v2)")

student_path = MODEL_DIR / f"{STUDENT_NAME}.pth"

if student_path.exists():
    print(f"Found existing Student v4: {student_path}")
    student_model.load_state_dict(torch.load(student_path, map_location=device))
    distilled_accuracy, _ = evaluate_model_with_loss(student_model, testloader, nn.CrossEntropyLoss())
    print(f"Student v4 (Standard KD) Accuracy: {distilled_accuracy:.2f}%")
else:
    print(f"\nStarting Standard Knowledge Distillation (v4)...")
    print(f"  Using Standard KD (NOT DKD) - proven 98% retention rate")
    print(f"  KD Alpha (soft loss weight): {KD_ALPHA}")
    print(f"  Temperature: {TEMPERATURE}")
    print(f"  Label Smoothing: {LABEL_SMOOTHING}")
    print(f"  LR Warmup: {WARMUP_EPOCHS} epochs")
    print(f"  Augmentation: AutoAugment + RandomErasing + CutMix/Mixup")
    
    # Re-initialize student model (fresh weights from ImageNet)
    student_model = efficientnet_v2_s(weights=EfficientNet_V2_S_Weights.IMAGENET1K_V1)
    student_model.classifier[1] = nn.Linear(student_model.classifier[1].in_features, NUM_CLASSES)
    student_model = student_model.to(device)
    
    # Optimizer & Scheduler
    opt_s = optim.AdamW(student_model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
    sch_s = optim.lr_scheduler.CosineAnnealingLR(opt_s, T_max=NUM_EPOCHS - WARMUP_EPOCHS)
    
    # Run Standard KD Training (NOT DKD)
    trained_student, distilled_history = train_model_v3(
        model=student_model,
        dataloader=trainloader,
        optimizer=opt_s,
        scheduler=sch_s,
        num_epochs=NUM_EPOCHS,
        model_name=STUDENT_NAME,
        teacher_model=teacher_model,
        use_dkd=False,  # KEY: Use Standard KD, not DKD
        temperature=TEMPERATURE,
        label_smoothing=LABEL_SMOOTHING,
        patience=PATIENCE,
        warmup_epochs=WARMUP_EPOCHS
    )
    
    distilled_accuracy, _ = evaluate_model_with_loss(trained_student, testloader, nn.CrossEntropyLoss())
    print(f"\nStudent v4 (Standard KD) Final Accuracy: {distilled_accuracy:.2f}%")
    
    # Compare with prediction
    predicted = 76.02 * 0.98
    print(f"Predicted: {predicted:.2f}% | Actual: {distilled_accuracy:.2f}%")


STUDENT MODEL with Standard KD (Version 4 - Grand Finale)
Strategy: Best Teacher (v3) + Best KD Method (Standard KD from v2)

Starting Standard Knowledge Distillation (v4)...
  Using Standard KD (NOT DKD) - proven 98% retention rate
  KD Alpha (soft loss weight): 0.7
  Temperature: 4.0
  Label Smoothing: 0.1
  LR Warmup: 5 epochs
  Augmentation: AutoAugment + RandomErasing + CutMix/Mixup
  Starting fresh training...
  Using Standard KD loss with Teacher guidance


Epoch 1/200: 100%|██████████| 390/390 [01:00<00:00,  6.45it/s, loss=1.654]   


Epoch 1/200 | Train Loss: 1.7803 | Val Loss: 3.3957 | Val Acc: 23.59% | LR: 0.000200
  ★ New best model saved! Accuracy: 23.59%


Epoch 2/200: 100%|██████████| 390/390 [00:55<00:00,  7.03it/s, loss=1.616]   


Epoch 2/200 | Train Loss: 1.5911 | Val Loss: 2.6438 | Val Acc: 37.90% | LR: 0.000400
  ★ New best model saved! Accuracy: 37.90%


Epoch 3/200: 100%|██████████| 390/390 [00:55<00:00,  7.03it/s, loss=NaN-skip]


Epoch 3/200 | Train Loss: 1.5096 | Val Loss: 2.3371 | Val Acc: 44.11% | LR: 0.000600
  ★ New best model saved! Accuracy: 44.11%


Epoch 4/200: 100%|██████████| 390/390 [00:55<00:00,  7.01it/s, loss=1.409]   


Epoch 4/200 | Train Loss: 1.4841 | Val Loss: 2.3555 | Val Acc: 45.04% | LR: 0.000800
  ★ New best model saved! Accuracy: 45.04%


Epoch 5/200: 100%|██████████| 390/390 [00:55<00:00,  7.07it/s, loss=1.488]   


Epoch 5/200 | Train Loss: 1.4863 | Val Loss: 2.4051 | Val Acc: 45.84% | LR: 0.001000
  ★ New best model saved! Accuracy: 45.84%


Epoch 6/200: 100%|██████████| 390/390 [00:54<00:00,  7.12it/s, loss=NaN-skip]


Epoch 6/200 | Train Loss: 1.4368 | Val Loss: 2.2301 | Val Acc: 47.96% | LR: 0.001000
  ★ New best model saved! Accuracy: 47.96%


Epoch 7/200: 100%|██████████| 390/390 [00:54<00:00,  7.11it/s, loss=1.493]   


Epoch 7/200 | Train Loss: 1.4322 | Val Loss: 2.2842 | Val Acc: 47.38% | LR: 0.001000


Epoch 8/200: 100%|██████████| 390/390 [00:55<00:00,  7.08it/s, loss=1.469]   


Epoch 8/200 | Train Loss: 1.4203 | Val Loss: 2.0975 | Val Acc: 50.16% | LR: 0.000999
  ★ New best model saved! Accuracy: 50.16%


Epoch 9/200: 100%|██████████| 390/390 [00:55<00:00,  7.05it/s, loss=1.419]   


Epoch 9/200 | Train Loss: 1.4068 | Val Loss: 2.1822 | Val Acc: 50.23% | LR: 0.000999
  ★ New best model saved! Accuracy: 50.23%


Epoch 10/200: 100%|██████████| 390/390 [00:54<00:00,  7.13it/s, loss=1.422]   


Epoch 10/200 | Train Loss: 1.4040 | Val Loss: 2.1453 | Val Acc: 50.79% | LR: 0.000998
  ★ New best model saved! Accuracy: 50.79%


Epoch 11/200: 100%|██████████| 390/390 [01:17<00:00,  5.04it/s, loss=1.474]   


Epoch 11/200 | Train Loss: 1.3955 | Val Loss: 2.2668 | Val Acc: 49.18% | LR: 0.000998


Epoch 12/200: 100%|██████████| 390/390 [00:56<00:00,  6.96it/s, loss=1.435]   


Epoch 12/200 | Train Loss: 1.3888 | Val Loss: 2.2636 | Val Acc: 50.98% | LR: 0.000997
  ★ New best model saved! Accuracy: 50.98%


Epoch 13/200: 100%|██████████| 390/390 [00:58<00:00,  6.67it/s, loss=1.312]   


Epoch 13/200 | Train Loss: 1.3727 | Val Loss: 2.0858 | Val Acc: 52.36% | LR: 0.000996
  ★ New best model saved! Accuracy: 52.36%


Epoch 14/200: 100%|██████████| 390/390 [01:08<00:00,  5.71it/s, loss=1.391]   


Epoch 14/200 | Train Loss: 1.3743 | Val Loss: 2.0714 | Val Acc: 52.26% | LR: 0.000995


Epoch 15/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.965]   


Epoch 15/200 | Train Loss: 1.3746 | Val Loss: 2.0223 | Val Acc: 53.45% | LR: 0.000994
  ★ New best model saved! Accuracy: 53.45%


Epoch 16/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=1.402]   


Epoch 16/200 | Train Loss: 1.3541 | Val Loss: 1.9891 | Val Acc: 53.78% | LR: 0.000992
  ★ New best model saved! Accuracy: 53.78%


Epoch 17/200: 100%|██████████| 390/390 [01:15<00:00,  5.18it/s, loss=NaN-skip]


Epoch 17/200 | Train Loss: 1.3495 | Val Loss: 1.9458 | Val Acc: 54.85% | LR: 0.000991
  ★ New best model saved! Accuracy: 54.85%


Epoch 18/200: 100%|██████████| 390/390 [01:18<00:00,  5.00it/s, loss=1.421]   


Epoch 18/200 | Train Loss: 1.3380 | Val Loss: 1.8358 | Val Acc: 55.87% | LR: 0.000989
  ★ New best model saved! Accuracy: 55.87%


Epoch 19/200: 100%|██████████| 390/390 [01:10<00:00,  5.57it/s, loss=1.355]   


Epoch 19/200 | Train Loss: 1.3273 | Val Loss: 1.9143 | Val Acc: 56.73% | LR: 0.000987
  ★ New best model saved! Accuracy: 56.73%


Epoch 20/200: 100%|██████████| 390/390 [01:12<00:00,  5.41it/s, loss=NaN-skip]


Epoch 20/200 | Train Loss: 1.3244 | Val Loss: 1.8321 | Val Acc: 57.50% | LR: 0.000985
  ★ New best model saved! Accuracy: 57.50%
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch20.pth


Epoch 21/200: 100%|██████████| 390/390 [01:11<00:00,  5.43it/s, loss=1.326]   


Epoch 21/200 | Train Loss: 1.3233 | Val Loss: 1.9409 | Val Acc: 56.01% | LR: 0.000983


Epoch 22/200: 100%|██████████| 390/390 [01:12<00:00,  5.40it/s, loss=1.312]   


Epoch 22/200 | Train Loss: 1.3185 | Val Loss: 1.8739 | Val Acc: 57.68% | LR: 0.000981
  ★ New best model saved! Accuracy: 57.68%


Epoch 23/200: 100%|██████████| 390/390 [01:11<00:00,  5.45it/s, loss=1.273]   


Epoch 23/200 | Train Loss: 1.3234 | Val Loss: 1.9005 | Val Acc: 57.14% | LR: 0.000979


Epoch 24/200: 100%|██████████| 390/390 [01:11<00:00,  5.42it/s, loss=1.089]   


Epoch 24/200 | Train Loss: 1.3251 | Val Loss: 2.8875 | Val Acc: 55.06% | LR: 0.000977


Epoch 25/200: 100%|██████████| 390/390 [01:12<00:00,  5.38it/s, loss=1.267]   


Epoch 25/200 | Train Loss: 1.3255 | Val Loss: 1.9689 | Val Acc: 56.68% | LR: 0.000974


Epoch 26/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=1.623]   


Epoch 26/200 | Train Loss: 1.3104 | Val Loss: 1.7865 | Val Acc: 58.51% | LR: 0.000972
  ★ New best model saved! Accuracy: 58.51%


Epoch 27/200: 100%|██████████| 390/390 [01:11<00:00,  5.42it/s, loss=1.619]   


Epoch 27/200 | Train Loss: 1.2988 | Val Loss: 1.8533 | Val Acc: 59.26% | LR: 0.000969
  ★ New best model saved! Accuracy: 59.26%


Epoch 28/200: 100%|██████████| 390/390 [01:12<00:00,  5.36it/s, loss=1.361]   


Epoch 28/200 | Train Loss: 1.2915 | Val Loss: 1.7341 | Val Acc: 59.06% | LR: 0.000966


Epoch 29/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=1.187]   


Epoch 29/200 | Train Loss: 1.2808 | Val Loss: 1.8973 | Val Acc: 59.14% | LR: 0.000963


Epoch 30/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=NaN-skip]


Epoch 30/200 | Train Loss: 1.2769 | Val Loss: 1.7997 | Val Acc: 58.98% | LR: 0.000960


Epoch 31/200: 100%|██████████| 390/390 [01:12<00:00,  5.40it/s, loss=1.144]   


Epoch 31/200 | Train Loss: 1.2898 | Val Loss: 1.7722 | Val Acc: 57.87% | LR: 0.000957


Epoch 32/200: 100%|██████████| 390/390 [01:11<00:00,  5.47it/s, loss=1.230]   


Epoch 32/200 | Train Loss: 1.2694 | Val Loss: 1.7988 | Val Acc: 58.55% | LR: 0.000953


Epoch 33/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=1.359]   


Epoch 33/200 | Train Loss: 1.2661 | Val Loss: 1.8971 | Val Acc: 58.31% | LR: 0.000950


Epoch 34/200: 100%|██████████| 390/390 [01:11<00:00,  5.43it/s, loss=1.069]   


Epoch 34/200 | Train Loss: 1.2750 | Val Loss: 1.7337 | Val Acc: 60.40% | LR: 0.000946
  ★ New best model saved! Accuracy: 60.40%


Epoch 35/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=NaN-skip]


Epoch 35/200 | Train Loss: 1.2752 | Val Loss: 1.8935 | Val Acc: 58.95% | LR: 0.000943


Epoch 36/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=1.291]   


Epoch 36/200 | Train Loss: 1.2718 | Val Loss: 1.8039 | Val Acc: 60.22% | LR: 0.000939


Epoch 37/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=1.294]   


Epoch 37/200 | Train Loss: 1.2802 | Val Loss: 1.7650 | Val Acc: 59.41% | LR: 0.000935


Epoch 38/200: 100%|██████████| 390/390 [01:12<00:00,  5.37it/s, loss=1.398]   


Epoch 38/200 | Train Loss: 1.2727 | Val Loss: 1.8156 | Val Acc: 60.51% | LR: 0.000931
  ★ New best model saved! Accuracy: 60.51%


Epoch 39/200: 100%|██████████| 390/390 [01:11<00:00,  5.43it/s, loss=1.363]   


Epoch 39/200 | Train Loss: 1.2602 | Val Loss: 1.9566 | Val Acc: 58.90% | LR: 0.000927


Epoch 40/200: 100%|██████████| 390/390 [01:12<00:00,  5.36it/s, loss=1.242]   


Epoch 40/200 | Train Loss: 1.2641 | Val Loss: 2.0325 | Val Acc: 60.37% | LR: 0.000923
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch40.pth


Epoch 41/200: 100%|██████████| 390/390 [01:12<00:00,  5.40it/s, loss=NaN-skip]


Epoch 41/200 | Train Loss: 1.2652 | Val Loss: 1.7359 | Val Acc: 60.79% | LR: 0.000918
  ★ New best model saved! Accuracy: 60.79%


Epoch 42/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=1.143]   


Epoch 42/200 | Train Loss: 1.2722 | Val Loss: 1.6732 | Val Acc: 62.01% | LR: 0.000914
  ★ New best model saved! Accuracy: 62.01%


Epoch 43/200: 100%|██████████| 390/390 [01:11<00:00,  5.44it/s, loss=NaN-skip]


Epoch 43/200 | Train Loss: 1.2657 | Val Loss: 1.7267 | Val Acc: 61.97% | LR: 0.000909


Epoch 44/200: 100%|██████████| 390/390 [01:08<00:00,  5.66it/s, loss=NaN-skip]


Epoch 44/200 | Train Loss: 1.2407 | Val Loss: 1.6042 | Val Acc: 62.75% | LR: 0.000905
  ★ New best model saved! Accuracy: 62.75%


Epoch 45/200: 100%|██████████| 390/390 [01:10<00:00,  5.54it/s, loss=0.840]   


Epoch 45/200 | Train Loss: 1.2412 | Val Loss: 1.6259 | Val Acc: 62.53% | LR: 0.000900


Epoch 46/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=1.198]   


Epoch 46/200 | Train Loss: 1.2331 | Val Loss: 1.6687 | Val Acc: 61.40% | LR: 0.000895


Epoch 47/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=NaN-skip]


Epoch 47/200 | Train Loss: 1.2463 | Val Loss: 1.8568 | Val Acc: 61.43% | LR: 0.000890


Epoch 48/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=1.301]   


Epoch 48/200 | Train Loss: 1.2498 | Val Loss: 1.9589 | Val Acc: 61.43% | LR: 0.000885


Epoch 49/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=1.293]   


Epoch 49/200 | Train Loss: 1.2563 | Val Loss: 1.8714 | Val Acc: 61.89% | LR: 0.000880


Epoch 50/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=1.375]   


Epoch 50/200 | Train Loss: 1.2404 | Val Loss: 1.9250 | Val Acc: 62.26% | LR: 0.000874


Epoch 51/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=1.160]   


Epoch 51/200 | Train Loss: 1.2298 | Val Loss: 1.8101 | Val Acc: 61.71% | LR: 0.000869


Epoch 52/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=1.202]   


Epoch 52/200 | Train Loss: 1.2241 | Val Loss: 1.5780 | Val Acc: 63.87% | LR: 0.000863
  ★ New best model saved! Accuracy: 63.87%


Epoch 53/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=1.356]   


Epoch 53/200 | Train Loss: 1.2271 | Val Loss: 1.7075 | Val Acc: 62.54% | LR: 0.000858


Epoch 54/200: 100%|██████████| 390/390 [01:14<00:00,  5.21it/s, loss=NaN-skip]


Epoch 54/200 | Train Loss: 1.2333 | Val Loss: 1.8102 | Val Acc: 63.19% | LR: 0.000852


Epoch 55/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=1.684]   


Epoch 55/200 | Train Loss: 1.2208 | Val Loss: 1.6182 | Val Acc: 63.19% | LR: 0.000846


Epoch 56/200: 100%|██████████| 390/390 [01:10<00:00,  5.57it/s, loss=1.306]   


Epoch 56/200 | Train Loss: 1.2357 | Val Loss: 1.6686 | Val Acc: 62.62% | LR: 0.000841


Epoch 57/200: 100%|██████████| 390/390 [01:09<00:00,  5.57it/s, loss=1.263]   


Epoch 57/200 | Train Loss: 1.2110 | Val Loss: 1.6887 | Val Acc: 63.16% | LR: 0.000835


Epoch 58/200: 100%|██████████| 390/390 [01:09<00:00,  5.58it/s, loss=1.022]   


Epoch 58/200 | Train Loss: 1.2102 | Val Loss: 1.6950 | Val Acc: 63.16% | LR: 0.000829


Epoch 59/200: 100%|██████████| 390/390 [01:09<00:00,  5.64it/s, loss=1.272]   


Epoch 59/200 | Train Loss: 1.2316 | Val Loss: 1.9777 | Val Acc: 62.77% | LR: 0.000822


Epoch 60/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=1.067]   


Epoch 60/200 | Train Loss: 1.2226 | Val Loss: 1.5789 | Val Acc: 63.78% | LR: 0.000816
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch60.pth


Epoch 61/200: 100%|██████████| 390/390 [01:09<00:00,  5.60it/s, loss=1.317]   


Epoch 61/200 | Train Loss: 1.2151 | Val Loss: 1.5857 | Val Acc: 63.78% | LR: 0.000810


Epoch 62/200: 100%|██████████| 390/390 [01:09<00:00,  5.61it/s, loss=1.441]   


Epoch 62/200 | Train Loss: 1.2208 | Val Loss: 1.7051 | Val Acc: 63.09% | LR: 0.000804


Epoch 63/200: 100%|██████████| 390/390 [01:09<00:00,  5.61it/s, loss=1.049]   


Epoch 63/200 | Train Loss: 1.2090 | Val Loss: 1.6047 | Val Acc: 63.90% | LR: 0.000797
  ★ New best model saved! Accuracy: 63.90%


Epoch 64/200: 100%|██████████| 390/390 [01:10<00:00,  5.56it/s, loss=1.245]   


Epoch 64/200 | Train Loss: 1.2204 | Val Loss: 2.0664 | Val Acc: 61.41% | LR: 0.000791


Epoch 65/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=1.023]   


Epoch 65/200 | Train Loss: 1.2289 | Val Loss: 1.7653 | Val Acc: 63.92% | LR: 0.000784
  ★ New best model saved! Accuracy: 63.92%


Epoch 66/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=1.275]   


Epoch 66/200 | Train Loss: 1.2095 | Val Loss: 1.6016 | Val Acc: 64.32% | LR: 0.000777
  ★ New best model saved! Accuracy: 64.32%


Epoch 67/200: 100%|██████████| 390/390 [01:09<00:00,  5.63it/s, loss=1.365]   


Epoch 67/200 | Train Loss: 1.2123 | Val Loss: 1.6652 | Val Acc: 63.29% | LR: 0.000771


Epoch 68/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=1.282]   


Epoch 68/200 | Train Loss: 1.1965 | Val Loss: 1.5824 | Val Acc: 63.92% | LR: 0.000764


Epoch 69/200: 100%|██████████| 390/390 [01:09<00:00,  5.58it/s, loss=NaN-skip]


Epoch 69/200 | Train Loss: 1.2144 | Val Loss: 1.4914 | Val Acc: 65.48% | LR: 0.000757
  ★ New best model saved! Accuracy: 65.48%


Epoch 70/200: 100%|██████████| 390/390 [01:10<00:00,  5.56it/s, loss=1.194]   


Epoch 70/200 | Train Loss: 1.2000 | Val Loss: 1.6874 | Val Acc: 63.37% | LR: 0.000750


Epoch 71/200: 100%|██████████| 390/390 [01:09<00:00,  5.60it/s, loss=1.348]   


Epoch 71/200 | Train Loss: 1.1814 | Val Loss: 1.6966 | Val Acc: 63.64% | LR: 0.000743


Epoch 72/200: 100%|██████████| 390/390 [01:11<00:00,  5.47it/s, loss=NaN-skip]


Epoch 72/200 | Train Loss: 1.1842 | Val Loss: 1.5837 | Val Acc: 64.15% | LR: 0.000736


Epoch 73/200: 100%|██████████| 390/390 [01:11<00:00,  5.46it/s, loss=0.948]   


Epoch 73/200 | Train Loss: 1.2046 | Val Loss: 1.5159 | Val Acc: 65.59% | LR: 0.000729
  ★ New best model saved! Accuracy: 65.59%


Epoch 74/200: 100%|██████████| 390/390 [01:12<00:00,  5.40it/s, loss=0.736]   


Epoch 74/200 | Train Loss: 1.1973 | Val Loss: 1.5696 | Val Acc: 66.22% | LR: 0.000722
  ★ New best model saved! Accuracy: 66.22%


Epoch 75/200: 100%|██████████| 390/390 [01:11<00:00,  5.44it/s, loss=NaN-skip]


Epoch 75/200 | Train Loss: 1.2040 | Val Loss: 1.5205 | Val Acc: 65.13% | LR: 0.000714


Epoch 76/200: 100%|██████████| 390/390 [01:11<00:00,  5.44it/s, loss=1.328]   


Epoch 76/200 | Train Loss: 1.1993 | Val Loss: 1.5431 | Val Acc: 65.04% | LR: 0.000707


Epoch 77/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=1.180]   


Epoch 77/200 | Train Loss: 1.1999 | Val Loss: 1.5776 | Val Acc: 65.58% | LR: 0.000700


Epoch 78/200: 100%|██████████| 390/390 [01:11<00:00,  5.45it/s, loss=1.207]   


Epoch 78/200 | Train Loss: 1.1938 | Val Loss: 1.6017 | Val Acc: 63.98% | LR: 0.000692


Epoch 79/200: 100%|██████████| 390/390 [01:11<00:00,  5.45it/s, loss=1.237]   


Epoch 79/200 | Train Loss: 1.1817 | Val Loss: 1.5861 | Val Acc: 64.82% | LR: 0.000685


Epoch 80/200: 100%|██████████| 390/390 [01:12<00:00,  5.35it/s, loss=1.332]   


Epoch 80/200 | Train Loss: 1.1812 | Val Loss: 1.5419 | Val Acc: 65.23% | LR: 0.000677
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch80.pth
  Cleaned up: student_v4_standard_kd_epoch20.pth


Epoch 81/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=1.246]   


Epoch 81/200 | Train Loss: 1.1830 | Val Loss: 1.4961 | Val Acc: 66.24% | LR: 0.000670
  ★ New best model saved! Accuracy: 66.24%


Epoch 82/200: 100%|██████████| 390/390 [01:08<00:00,  5.67it/s, loss=1.401]   


Epoch 82/200 | Train Loss: 1.1753 | Val Loss: 1.6072 | Val Acc: 64.34% | LR: 0.000662


Epoch 83/200: 100%|██████████| 390/390 [01:09<00:00,  5.61it/s, loss=1.382]   


Epoch 83/200 | Train Loss: 1.1810 | Val Loss: 1.5297 | Val Acc: 65.73% | LR: 0.000655


Epoch 84/200: 100%|██████████| 390/390 [01:10<00:00,  5.54it/s, loss=1.181]   


Epoch 84/200 | Train Loss: 1.1947 | Val Loss: 1.5218 | Val Acc: 65.53% | LR: 0.000647


Epoch 85/200: 100%|██████████| 390/390 [01:09<00:00,  5.64it/s, loss=0.837]   


Epoch 85/200 | Train Loss: 1.1788 | Val Loss: 1.4808 | Val Acc: 66.79% | LR: 0.000639
  ★ New best model saved! Accuracy: 66.79%


Epoch 86/200: 100%|██████████| 390/390 [01:09<00:00,  5.65it/s, loss=1.248]   


Epoch 86/200 | Train Loss: 1.1779 | Val Loss: 1.5627 | Val Acc: 65.36% | LR: 0.000631


Epoch 87/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=1.165]   


Epoch 87/200 | Train Loss: 1.1807 | Val Loss: 1.5660 | Val Acc: 65.32% | LR: 0.000624


Epoch 88/200: 100%|██████████| 390/390 [01:09<00:00,  5.64it/s, loss=0.906]   


Epoch 88/200 | Train Loss: 1.1463 | Val Loss: 1.4615 | Val Acc: 67.09% | LR: 0.000616
  ★ New best model saved! Accuracy: 67.09%


Epoch 89/200: 100%|██████████| 390/390 [01:10<00:00,  5.56it/s, loss=1.240]   


Epoch 89/200 | Train Loss: 1.1771 | Val Loss: 1.5058 | Val Acc: 66.43% | LR: 0.000608


Epoch 90/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=1.281]   


Epoch 90/200 | Train Loss: 1.1782 | Val Loss: 1.4790 | Val Acc: 66.77% | LR: 0.000600


Epoch 91/200: 100%|██████████| 390/390 [01:09<00:00,  5.63it/s, loss=NaN-skip]


Epoch 91/200 | Train Loss: 1.1799 | Val Loss: 1.4966 | Val Acc: 66.55% | LR: 0.000592


Epoch 92/200: 100%|██████████| 390/390 [01:10<00:00,  5.56it/s, loss=NaN-skip]


Epoch 92/200 | Train Loss: 1.1521 | Val Loss: 1.4894 | Val Acc: 67.13% | LR: 0.000584
  ★ New best model saved! Accuracy: 67.13%


Epoch 93/200: 100%|██████████| 390/390 [01:09<00:00,  5.57it/s, loss=1.291]   


Epoch 93/200 | Train Loss: 1.1745 | Val Loss: 1.4769 | Val Acc: 67.06% | LR: 0.000576


Epoch 94/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=0.830]   


Epoch 94/200 | Train Loss: 1.1654 | Val Loss: 1.4754 | Val Acc: 66.72% | LR: 0.000568


Epoch 95/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=0.983]   


Epoch 95/200 | Train Loss: 1.1581 | Val Loss: 1.4884 | Val Acc: 66.61% | LR: 0.000560


Epoch 96/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=1.128]   


Epoch 96/200 | Train Loss: 1.1453 | Val Loss: 1.3943 | Val Acc: 67.87% | LR: 0.000552
  ★ New best model saved! Accuracy: 67.87%


Epoch 97/200: 100%|██████████| 390/390 [01:09<00:00,  5.63it/s, loss=1.083]   


Epoch 97/200 | Train Loss: 1.1686 | Val Loss: 1.4881 | Val Acc: 66.59% | LR: 0.000544


Epoch 98/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=1.115]   


Epoch 98/200 | Train Loss: 1.1646 | Val Loss: 1.4558 | Val Acc: 67.67% | LR: 0.000536


Epoch 99/200: 100%|██████████| 390/390 [01:09<00:00,  5.63it/s, loss=NaN-skip]


Epoch 99/200 | Train Loss: 1.1591 | Val Loss: 1.4946 | Val Acc: 66.89% | LR: 0.000528


Epoch 100/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=1.691]   


Epoch 100/200 | Train Loss: 1.1559 | Val Loss: 1.4987 | Val Acc: 67.07% | LR: 0.000520
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch100.pth
  Cleaned up: student_v4_standard_kd_epoch100.pth


Epoch 101/200: 100%|██████████| 390/390 [01:18<00:00,  5.00it/s, loss=1.191]   


Epoch 101/200 | Train Loss: 1.1660 | Val Loss: 1.4481 | Val Acc: 67.21% | LR: 0.000512


Epoch 102/200: 100%|██████████| 390/390 [01:09<00:00,  5.60it/s, loss=NaN-skip]


Epoch 102/200 | Train Loss: 1.1451 | Val Loss: 1.4669 | Val Acc: 67.39% | LR: 0.000504


Epoch 103/200: 100%|██████████| 390/390 [01:11<00:00,  5.48it/s, loss=1.340]   


Epoch 103/200 | Train Loss: 1.1346 | Val Loss: 1.4953 | Val Acc: 66.97% | LR: 0.000496


Epoch 104/200: 100%|██████████| 390/390 [01:11<00:00,  5.46it/s, loss=1.201]   


Epoch 104/200 | Train Loss: 1.1351 | Val Loss: 1.4522 | Val Acc: 67.61% | LR: 0.000488


Epoch 105/200: 100%|██████████| 390/390 [01:12<00:00,  5.40it/s, loss=1.248]   


Epoch 105/200 | Train Loss: 1.1417 | Val Loss: 1.4771 | Val Acc: 67.28% | LR: 0.000480


Epoch 106/200: 100%|██████████| 390/390 [01:11<00:00,  5.45it/s, loss=1.249]   


Epoch 106/200 | Train Loss: 1.1482 | Val Loss: 1.4142 | Val Acc: 68.71% | LR: 0.000472
  ★ New best model saved! Accuracy: 68.71%


Epoch 107/200: 100%|██████████| 390/390 [01:11<00:00,  5.47it/s, loss=1.101]   


Epoch 107/200 | Train Loss: 1.1362 | Val Loss: 1.3789 | Val Acc: 68.09% | LR: 0.000464


Epoch 108/200: 100%|██████████| 390/390 [01:11<00:00,  5.46it/s, loss=1.217]   


Epoch 108/200 | Train Loss: 1.1444 | Val Loss: 1.4862 | Val Acc: 68.17% | LR: 0.000456


Epoch 109/200: 100%|██████████| 390/390 [01:11<00:00,  5.42it/s, loss=0.823]   


Epoch 109/200 | Train Loss: 1.1421 | Val Loss: 1.4082 | Val Acc: 68.10% | LR: 0.000448


Epoch 110/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=NaN-skip]


Epoch 110/200 | Train Loss: 1.1251 | Val Loss: 1.3391 | Val Acc: 69.20% | LR: 0.000440
  ★ New best model saved! Accuracy: 69.20%


Epoch 111/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=1.160]   


Epoch 111/200 | Train Loss: 1.1323 | Val Loss: 1.4599 | Val Acc: 68.84% | LR: 0.000432


Epoch 112/200: 100%|██████████| 390/390 [01:11<00:00,  5.43it/s, loss=1.459]   


Epoch 112/200 | Train Loss: 1.1389 | Val Loss: 1.4193 | Val Acc: 68.71% | LR: 0.000424


Epoch 113/200: 100%|██████████| 390/390 [01:12<00:00,  5.42it/s, loss=NaN-skip]


Epoch 113/200 | Train Loss: 1.1357 | Val Loss: 1.4578 | Val Acc: 68.60% | LR: 0.000416


Epoch 114/200: 100%|██████████| 390/390 [01:11<00:00,  5.47it/s, loss=0.997]   


Epoch 114/200 | Train Loss: 1.1256 | Val Loss: 1.5008 | Val Acc: 69.34% | LR: 0.000408
  ★ New best model saved! Accuracy: 69.34%


Epoch 115/200: 100%|██████████| 390/390 [01:11<00:00,  5.43it/s, loss=1.165]   


Epoch 115/200 | Train Loss: 1.1203 | Val Loss: 1.4519 | Val Acc: 68.64% | LR: 0.000400


Epoch 116/200: 100%|██████████| 390/390 [01:12<00:00,  5.40it/s, loss=0.718]   


Epoch 116/200 | Train Loss: 1.1265 | Val Loss: 1.3881 | Val Acc: 69.19% | LR: 0.000392


Epoch 117/200: 100%|██████████| 390/390 [01:11<00:00,  5.43it/s, loss=NaN-skip]


Epoch 117/200 | Train Loss: 1.1172 | Val Loss: 1.3954 | Val Acc: 69.62% | LR: 0.000384
  ★ New best model saved! Accuracy: 69.62%


Epoch 118/200: 100%|██████████| 390/390 [01:12<00:00,  5.39it/s, loss=0.977]   


Epoch 118/200 | Train Loss: 1.1196 | Val Loss: 1.4178 | Val Acc: 69.13% | LR: 0.000376


Epoch 119/200: 100%|██████████| 390/390 [01:11<00:00,  5.49it/s, loss=1.121]   


Epoch 119/200 | Train Loss: 1.1010 | Val Loss: 1.3605 | Val Acc: 69.74% | LR: 0.000369
  ★ New best model saved! Accuracy: 69.74%


Epoch 120/200: 100%|██████████| 390/390 [01:12<00:00,  5.37it/s, loss=1.279]   


Epoch 120/200 | Train Loss: 1.0996 | Val Loss: 1.3870 | Val Acc: 69.42% | LR: 0.000361
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch120.pth
  Cleaned up: student_v4_standard_kd_epoch120.pth


Epoch 121/200: 100%|██████████| 390/390 [01:11<00:00,  5.49it/s, loss=0.759]   


Epoch 121/200 | Train Loss: 1.1136 | Val Loss: 1.3964 | Val Acc: 68.91% | LR: 0.000353


Epoch 122/200: 100%|██████████| 390/390 [01:10<00:00,  5.51it/s, loss=1.245]   


Epoch 122/200 | Train Loss: 1.0994 | Val Loss: 1.4180 | Val Acc: 68.85% | LR: 0.000345


Epoch 123/200: 100%|██████████| 390/390 [01:16<00:00,  5.13it/s, loss=1.164]   


Epoch 123/200 | Train Loss: 1.1121 | Val Loss: 1.4779 | Val Acc: 68.27% | LR: 0.000338


Epoch 124/200: 100%|██████████| 390/390 [00:59<00:00,  6.60it/s, loss=1.043]   


Epoch 124/200 | Train Loss: 1.1103 | Val Loss: 1.3629 | Val Acc: 69.60% | LR: 0.000330


Epoch 125/200: 100%|██████████| 390/390 [01:10<00:00,  5.51it/s, loss=1.119]   


Epoch 125/200 | Train Loss: 1.1159 | Val Loss: 1.3614 | Val Acc: 69.34% | LR: 0.000323


Epoch 126/200: 100%|██████████| 390/390 [01:11<00:00,  5.42it/s, loss=1.229]   


Epoch 126/200 | Train Loss: 1.0980 | Val Loss: 1.4870 | Val Acc: 69.47% | LR: 0.000315


Epoch 127/200: 100%|██████████| 390/390 [01:26<00:00,  4.53it/s, loss=NaN-skip]


Epoch 127/200 | Train Loss: 1.1117 | Val Loss: 1.3549 | Val Acc: 70.31% | LR: 0.000308
  ★ New best model saved! Accuracy: 70.31%


Epoch 128/200: 100%|██████████| 390/390 [01:21<00:00,  4.79it/s, loss=1.116]   


Epoch 128/200 | Train Loss: 1.1237 | Val Loss: 1.3314 | Val Acc: 70.23% | LR: 0.000300


Epoch 129/200: 100%|██████████| 390/390 [01:23<00:00,  4.67it/s, loss=1.154]   


Epoch 129/200 | Train Loss: 1.1266 | Val Loss: 1.5209 | Val Acc: 69.31% | LR: 0.000293


Epoch 130/200: 100%|██████████| 390/390 [01:24<00:00,  4.61it/s, loss=1.020]   


Epoch 130/200 | Train Loss: 1.1009 | Val Loss: 1.3444 | Val Acc: 70.43% | LR: 0.000286
  ★ New best model saved! Accuracy: 70.43%


Epoch 131/200: 100%|██████████| 390/390 [01:23<00:00,  4.66it/s, loss=NaN-skip]


Epoch 131/200 | Train Loss: 1.1047 | Val Loss: 1.3689 | Val Acc: 70.12% | LR: 0.000278


Epoch 132/200: 100%|██████████| 390/390 [00:59<00:00,  6.54it/s, loss=0.985]   


Epoch 132/200 | Train Loss: 1.1059 | Val Loss: 1.5019 | Val Acc: 69.74% | LR: 0.000271


Epoch 133/200: 100%|██████████| 390/390 [01:12<00:00,  5.38it/s, loss=NaN-skip]


Epoch 133/200 | Train Loss: 1.0826 | Val Loss: 1.2774 | Val Acc: 71.30% | LR: 0.000264
  ★ New best model saved! Accuracy: 71.30%


Epoch 134/200: 100%|██████████| 390/390 [01:14<00:00,  5.21it/s, loss=1.218]   


Epoch 134/200 | Train Loss: 1.1026 | Val Loss: 1.3211 | Val Acc: 71.06% | LR: 0.000257


Epoch 135/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=1.174]   


Epoch 135/200 | Train Loss: 1.0897 | Val Loss: 1.2844 | Val Acc: 71.12% | LR: 0.000250


Epoch 136/200: 100%|██████████| 390/390 [01:14<00:00,  5.20it/s, loss=0.895]   


Epoch 136/200 | Train Loss: 1.1038 | Val Loss: 1.3196 | Val Acc: 71.16% | LR: 0.000243


Epoch 137/200: 100%|██████████| 390/390 [01:14<00:00,  5.27it/s, loss=1.062]   


Epoch 137/200 | Train Loss: 1.1002 | Val Loss: 1.3246 | Val Acc: 70.91% | LR: 0.000236


Epoch 138/200: 100%|██████████| 390/390 [01:12<00:00,  5.38it/s, loss=1.134]   


Epoch 138/200 | Train Loss: 1.0859 | Val Loss: 1.3609 | Val Acc: 70.88% | LR: 0.000229


Epoch 139/200: 100%|██████████| 390/390 [01:21<00:00,  4.77it/s, loss=1.219]   


Epoch 139/200 | Train Loss: 1.0806 | Val Loss: 1.3298 | Val Acc: 70.75% | LR: 0.000223


Epoch 140/200: 100%|██████████| 390/390 [01:09<00:00,  5.63it/s, loss=1.078]   


Epoch 140/200 | Train Loss: 1.0839 | Val Loss: 1.2421 | Val Acc: 71.52% | LR: 0.000216
  ★ New best model saved! Accuracy: 71.52%
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch140.pth
  Cleaned up: student_v4_standard_kd_epoch140.pth


Epoch 141/200: 100%|██████████| 390/390 [01:08<00:00,  5.71it/s, loss=NaN-skip]


Epoch 141/200 | Train Loss: 1.0974 | Val Loss: 1.2685 | Val Acc: 71.88% | LR: 0.000209
  ★ New best model saved! Accuracy: 71.88%


Epoch 142/200: 100%|██████████| 390/390 [01:08<00:00,  5.68it/s, loss=NaN-skip]


Epoch 142/200 | Train Loss: 1.0787 | Val Loss: 1.3336 | Val Acc: 71.39% | LR: 0.000203


Epoch 143/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=0.931]   


Epoch 143/200 | Train Loss: 1.0803 | Val Loss: 1.2983 | Val Acc: 72.01% | LR: 0.000196
  ★ New best model saved! Accuracy: 72.01%


Epoch 144/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=1.127]   


Epoch 144/200 | Train Loss: 1.0798 | Val Loss: 1.2960 | Val Acc: 72.27% | LR: 0.000190
  ★ New best model saved! Accuracy: 72.27%


Epoch 145/200: 100%|██████████| 390/390 [01:08<00:00,  5.66it/s, loss=1.472]   


Epoch 145/200 | Train Loss: 1.0760 | Val Loss: 1.2869 | Val Acc: 71.80% | LR: 0.000184


Epoch 146/200: 100%|██████████| 390/390 [01:10<00:00,  5.54it/s, loss=0.897]   


Epoch 146/200 | Train Loss: 1.0858 | Val Loss: 1.2996 | Val Acc: 71.31% | LR: 0.000178


Epoch 147/200: 100%|██████████| 390/390 [01:09<00:00,  5.65it/s, loss=0.985]   


Epoch 147/200 | Train Loss: 1.0778 | Val Loss: 1.2695 | Val Acc: 72.13% | LR: 0.000171


Epoch 148/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=0.763]   


Epoch 148/200 | Train Loss: 1.0699 | Val Loss: 1.2605 | Val Acc: 72.29% | LR: 0.000165
  ★ New best model saved! Accuracy: 72.29%


Epoch 149/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=0.950]   


Epoch 149/200 | Train Loss: 1.0754 | Val Loss: 1.2794 | Val Acc: 72.07% | LR: 0.000159


Epoch 150/200: 100%|██████████| 390/390 [01:10<00:00,  5.54it/s, loss=NaN-skip]


Epoch 150/200 | Train Loss: 1.0613 | Val Loss: 1.2878 | Val Acc: 72.16% | LR: 0.000154


Epoch 151/200: 100%|██████████| 390/390 [01:09<00:00,  5.65it/s, loss=1.213]   


Epoch 151/200 | Train Loss: 1.0672 | Val Loss: 1.3635 | Val Acc: 71.71% | LR: 0.000148


Epoch 152/200: 100%|██████████| 390/390 [01:07<00:00,  5.75it/s, loss=1.000]   


Epoch 152/200 | Train Loss: 1.0533 | Val Loss: 1.2743 | Val Acc: 72.84% | LR: 0.000142
  ★ New best model saved! Accuracy: 72.84%


Epoch 153/200: 100%|██████████| 390/390 [00:55<00:00,  6.98it/s, loss=1.017]   


Epoch 153/200 | Train Loss: 1.0612 | Val Loss: 1.2884 | Val Acc: 72.77% | LR: 0.000137


Epoch 154/200: 100%|██████████| 390/390 [00:55<00:00,  7.02it/s, loss=1.184]   


Epoch 154/200 | Train Loss: 1.0912 | Val Loss: 1.2973 | Val Acc: 72.39% | LR: 0.000131


Epoch 155/200: 100%|██████████| 390/390 [00:56<00:00,  6.94it/s, loss=0.705]   


Epoch 155/200 | Train Loss: 1.0668 | Val Loss: 1.2632 | Val Acc: 72.20% | LR: 0.000126


Epoch 156/200: 100%|██████████| 390/390 [00:55<00:00,  7.00it/s, loss=0.695]   


Epoch 156/200 | Train Loss: 1.0770 | Val Loss: 1.3488 | Val Acc: 71.41% | LR: 0.000120


Epoch 157/200: 100%|██████████| 390/390 [00:55<00:00,  6.97it/s, loss=1.210]   


Epoch 157/200 | Train Loss: 1.0581 | Val Loss: 1.2720 | Val Acc: 72.48% | LR: 0.000115


Epoch 158/200: 100%|██████████| 390/390 [00:56<00:00,  6.96it/s, loss=1.140]   


Epoch 158/200 | Train Loss: 1.0676 | Val Loss: 1.3179 | Val Acc: 72.19% | LR: 0.000110


Epoch 159/200: 100%|██████████| 390/390 [00:55<00:00,  7.02it/s, loss=1.112]   


Epoch 159/200 | Train Loss: 1.0423 | Val Loss: 1.2936 | Val Acc: 72.65% | LR: 0.000105


Epoch 160/200: 100%|██████████| 390/390 [00:55<00:00,  6.98it/s, loss=NaN-skip]


Epoch 160/200 | Train Loss: 1.0524 | Val Loss: 1.2694 | Val Acc: 72.63% | LR: 0.000100
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch160.pth
  Cleaned up: student_v4_standard_kd_epoch160.pth


Epoch 161/200: 100%|██████████| 390/390 [00:55<00:00,  7.03it/s, loss=0.670]   


Epoch 161/200 | Train Loss: 1.0475 | Val Loss: 1.2382 | Val Acc: 73.21% | LR: 0.000095
  ★ New best model saved! Accuracy: 73.21%


Epoch 162/200: 100%|██████████| 390/390 [01:10<00:00,  5.51it/s, loss=NaN-skip]


Epoch 162/200 | Train Loss: 1.0495 | Val Loss: 1.2969 | Val Acc: 72.56% | LR: 0.000091


Epoch 163/200: 100%|██████████| 390/390 [01:09<00:00,  5.60it/s, loss=0.726]   


Epoch 163/200 | Train Loss: 1.0526 | Val Loss: 1.2363 | Val Acc: 72.93% | LR: 0.000086


Epoch 164/200: 100%|██████████| 390/390 [01:10<00:00,  5.56it/s, loss=1.725]   


Epoch 164/200 | Train Loss: 1.0658 | Val Loss: 1.3001 | Val Acc: 73.00% | LR: 0.000082


Epoch 165/200: 100%|██████████| 390/390 [01:08<00:00,  5.65it/s, loss=NaN-skip]


Epoch 165/200 | Train Loss: 1.0680 | Val Loss: 1.3472 | Val Acc: 72.82% | LR: 0.000077


Epoch 166/200: 100%|██████████| 390/390 [01:08<00:00,  5.69it/s, loss=0.911]   


Epoch 166/200 | Train Loss: 1.0297 | Val Loss: 1.2655 | Val Acc: 72.62% | LR: 0.000073


Epoch 167/200: 100%|██████████| 390/390 [01:09<00:00,  5.57it/s, loss=0.622]   


Epoch 167/200 | Train Loss: 1.0628 | Val Loss: 1.2861 | Val Acc: 72.65% | LR: 0.000069


Epoch 168/200: 100%|██████████| 390/390 [01:09<00:00,  5.64it/s, loss=0.792]   


Epoch 168/200 | Train Loss: 1.0340 | Val Loss: 1.2566 | Val Acc: 73.11% | LR: 0.000065


Epoch 169/200: 100%|██████████| 390/390 [01:09<00:00,  5.63it/s, loss=0.907]   


Epoch 169/200 | Train Loss: 1.0268 | Val Loss: 1.2431 | Val Acc: 73.30% | LR: 0.000061
  ★ New best model saved! Accuracy: 73.30%


Epoch 170/200: 100%|██████████| 390/390 [01:08<00:00,  5.67it/s, loss=1.082]   


Epoch 170/200 | Train Loss: 1.0342 | Val Loss: 1.2345 | Val Acc: 73.40% | LR: 0.000057
  ★ New best model saved! Accuracy: 73.40%


Epoch 171/200: 100%|██████████| 390/390 [01:09<00:00,  5.61it/s, loss=0.976]   


Epoch 171/200 | Train Loss: 1.0445 | Val Loss: 1.2557 | Val Acc: 72.98% | LR: 0.000054


Epoch 172/200: 100%|██████████| 390/390 [01:09<00:00,  5.60it/s, loss=0.932]   


Epoch 172/200 | Train Loss: 1.0440 | Val Loss: 1.2748 | Val Acc: 73.30% | LR: 0.000050


Epoch 173/200: 100%|██████████| 390/390 [01:09<00:00,  5.59it/s, loss=0.770]   


Epoch 173/200 | Train Loss: 1.0282 | Val Loss: 1.2176 | Val Acc: 73.43% | LR: 0.000047
  ★ New best model saved! Accuracy: 73.43%


Epoch 174/200: 100%|██████████| 390/390 [01:10<00:00,  5.56it/s, loss=0.677]   


Epoch 174/200 | Train Loss: 1.0169 | Val Loss: 1.2114 | Val Acc: 73.56% | LR: 0.000043
  ★ New best model saved! Accuracy: 73.56%


Epoch 175/200: 100%|██████████| 390/390 [01:09<00:00,  5.64it/s, loss=1.177]   


Epoch 175/200 | Train Loss: 1.0339 | Val Loss: 1.2764 | Val Acc: 73.29% | LR: 0.000040


Epoch 176/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=1.174]   


Epoch 176/200 | Train Loss: 1.0244 | Val Loss: 1.2123 | Val Acc: 73.86% | LR: 0.000037
  ★ New best model saved! Accuracy: 73.86%


Epoch 177/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=1.179]   


Epoch 177/200 | Train Loss: 1.0544 | Val Loss: 1.2652 | Val Acc: 73.44% | LR: 0.000034


Epoch 178/200: 100%|██████████| 390/390 [01:09<00:00,  5.63it/s, loss=0.782]   


Epoch 178/200 | Train Loss: 1.0429 | Val Loss: 1.2519 | Val Acc: 73.48% | LR: 0.000031


Epoch 179/200: 100%|██████████| 390/390 [01:09<00:00,  5.58it/s, loss=1.155]   


Epoch 179/200 | Train Loss: 1.0329 | Val Loss: 1.3694 | Val Acc: 73.07% | LR: 0.000028


Epoch 180/200: 100%|██████████| 390/390 [01:10<00:00,  5.53it/s, loss=0.670]   


Epoch 180/200 | Train Loss: 1.0399 | Val Loss: 1.2717 | Val Acc: 73.17% | LR: 0.000026
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch180.pth
  Cleaned up: student_v4_standard_kd_epoch180.pth


Epoch 181/200: 100%|██████████| 390/390 [01:10<00:00,  5.57it/s, loss=1.167]   


Epoch 181/200 | Train Loss: 1.0405 | Val Loss: 1.3936 | Val Acc: 73.11% | LR: 0.000023


Epoch 182/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=1.156]   


Epoch 182/200 | Train Loss: 1.0328 | Val Loss: 1.3015 | Val Acc: 73.25% | LR: 0.000021


Epoch 183/200: 100%|██████████| 390/390 [01:11<00:00,  5.43it/s, loss=0.617]   


Epoch 183/200 | Train Loss: 1.0169 | Val Loss: 1.2441 | Val Acc: 73.73% | LR: 0.000019


Epoch 184/200: 100%|██████████| 390/390 [01:09<00:00,  5.61it/s, loss=0.963]   


Epoch 184/200 | Train Loss: 1.0426 | Val Loss: 1.3018 | Val Acc: 73.40% | LR: 0.000017


Epoch 185/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=NaN-skip]


Epoch 185/200 | Train Loss: 1.0302 | Val Loss: 1.2726 | Val Acc: 73.73% | LR: 0.000015


Epoch 186/200: 100%|██████████| 390/390 [01:08<00:00,  5.67it/s, loss=0.732]   


Epoch 186/200 | Train Loss: 1.0300 | Val Loss: 1.2137 | Val Acc: 74.09% | LR: 0.000013
  ★ New best model saved! Accuracy: 74.09%


Epoch 187/200: 100%|██████████| 390/390 [01:09<00:00,  5.61it/s, loss=1.429]   


Epoch 187/200 | Train Loss: 1.0205 | Val Loss: 1.2283 | Val Acc: 73.99% | LR: 0.000011


Epoch 188/200: 100%|██████████| 390/390 [01:10<00:00,  5.56it/s, loss=1.549]   


Epoch 188/200 | Train Loss: 1.0250 | Val Loss: 1.2617 | Val Acc: 73.59% | LR: 0.000009


Epoch 189/200: 100%|██████████| 390/390 [01:08<00:00,  5.66it/s, loss=NaN-skip]


Epoch 189/200 | Train Loss: 1.0361 | Val Loss: 1.2678 | Val Acc: 73.64% | LR: 0.000008


Epoch 190/200: 100%|██████████| 390/390 [01:08<00:00,  5.65it/s, loss=0.697]   


Epoch 190/200 | Train Loss: 1.0093 | Val Loss: 1.3414 | Val Acc: 73.63% | LR: 0.000006


Epoch 191/200: 100%|██████████| 390/390 [01:08<00:00,  5.67it/s, loss=1.045]   


Epoch 191/200 | Train Loss: 1.0189 | Val Loss: 1.2436 | Val Acc: 73.97% | LR: 0.000005


Epoch 192/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=0.674]   


Epoch 192/200 | Train Loss: 1.0175 | Val Loss: 1.2891 | Val Acc: 73.73% | LR: 0.000004


Epoch 193/200: 100%|██████████| 390/390 [01:08<00:00,  5.69it/s, loss=1.000]   


Epoch 193/200 | Train Loss: 1.0218 | Val Loss: 1.2576 | Val Acc: 73.82% | LR: 0.000003


Epoch 194/200: 100%|██████████| 390/390 [01:08<00:00,  5.65it/s, loss=1.155]   


Epoch 194/200 | Train Loss: 1.0283 | Val Loss: 1.2374 | Val Acc: 74.07% | LR: 0.000002


Epoch 195/200: 100%|██████████| 390/390 [01:09<00:00,  5.62it/s, loss=NaN-skip]


Epoch 195/200 | Train Loss: 1.0276 | Val Loss: 1.2362 | Val Acc: 73.82% | LR: 0.000002


Epoch 196/200: 100%|██████████| 390/390 [01:08<00:00,  5.67it/s, loss=0.686]   


Epoch 196/200 | Train Loss: 1.0254 | Val Loss: 1.2281 | Val Acc: 73.89% | LR: 0.000001


Epoch 197/200: 100%|██████████| 390/390 [01:08<00:00,  5.67it/s, loss=NaN-skip]


Epoch 197/200 | Train Loss: 1.0177 | Val Loss: 1.2182 | Val Acc: 74.02% | LR: 0.000001


Epoch 198/200: 100%|██████████| 390/390 [01:08<00:00,  5.68it/s, loss=1.039]   


Epoch 198/200 | Train Loss: 1.0422 | Val Loss: 1.2757 | Val Acc: 73.66% | LR: 0.000000


Epoch 199/200: 100%|██████████| 390/390 [01:11<00:00,  5.44it/s, loss=0.829]   


Epoch 199/200 | Train Loss: 1.0183 | Val Loss: 1.2129 | Val Acc: 74.00% | LR: 0.000000


Epoch 200/200: 100%|██████████| 390/390 [00:55<00:00,  6.98it/s, loss=NaN-skip]


Epoch 200/200 | Train Loss: 1.0146 | Val Loss: 1.2770 | Val Acc: 73.64% | LR: 0.000000
  Checkpoint saved: outputs\checkpoints\student_v4_standard_kd_epoch200.pth
  Cleaned up: student_v4_standard_kd_epoch200.pth

Training complete. Best accuracy: 74.09%

Student v4 (Standard KD) Final Accuracy: 74.09%
Predicted: 74.50% | Actual: 74.09%


In [9]:
# Cell 9: Results Summary (Version 4 - Grand Finale)
print("\n" + "="*80)
print("VERSION 4 RESULTS - Grand Finale (Best Teacher + Best KD)")
print("="*80)

# Evaluate all available models for comparison
criterion = nn.CrossEntropyLoss()

print("\n┌─────────────────────────────────────────────────────────────────────┐")
print("│                    MODEL COMPARISON TABLE                           │")
print("├──────────────────────────────┬──────────────┬───────────────────────┤")
print("│ Model                        │ Accuracy (%) │ Notes                 │")
print("├──────────────────────────────┼──────────────┼───────────────────────┤")

# Teacher v3
try:
    print(f"│ Teacher v3 (EfficientNet-L)  │ {teacher_accuracy:12.2f} │ Extended training     │")
except NameError:
    print(f"│ Teacher v3 (EfficientNet-L)  │        76.02 │ Extended training     │")

# Student v4 (Standard KD with strong teacher)
try:
    print(f"│ Student v4 Standard KD       │ {distilled_accuracy:12.2f} │ Best Teacher + Best KD│")
except NameError:
    print(f"│ Student v4 Standard KD       │ {'N/A':>12} │                       │")

print("├──────────────────────────────┴──────────────┴───────────────────────┤")
print("│                    ALL VERSIONS COMPARISON                          │")
print("├──────────────────────────────┬──────────────┬───────────────────────┤")
print("│ Student v1 (Baseline)        │        72.34 │ Standard KD           │")
print("│ Student v2 (Enhanced)        │        74.20 │ + AutoAug + Warmup    │")
print("│ Student v3 (DKD β=8.0)       │        72.69 │ Over-regularized      │")
print("│ Student v3.1 (DKD β=2.0)     │        73.98 │ Tuned DKD             │")
try:
    print(f"│ Student v4 (Grand Finale)    │ {distilled_accuracy:12.2f} │ ★ BEST RESULT ★       │")
except NameError:
    print(f"│ Student v4 (Grand Finale)    │ {'TBD':>12} │ ★ Target: 74.50% ★    │")
print("└──────────────────────────────┴──────────────┴───────────────────────┘")

print("\n" + "="*80)
print("VERSION 4 ANALYSIS:")
try:
    gap = distilled_accuracy - teacher_accuracy
    retention = (distilled_accuracy / teacher_accuracy) * 100
    print(f"  Teacher v3 Accuracy:       {teacher_accuracy:.2f}%")
    print(f"  Student v4 Accuracy:       {distilled_accuracy:.2f}%")
    print(f"  Gap from Teacher:          {gap:+.2f}%")
    print(f"  Teacher Retention:         {retention:.1f}%")
    
    # Compare with previous best (v2)
    v2_accuracy = 74.20
    improvement = distilled_accuracy - v2_accuracy
    print(f"\n  vs Previous Best (v2):     {improvement:+.2f}%")
    
    if distilled_accuracy > v2_accuracy:
        print(f"\n  🎉 NEW RECORD ACHIEVED! 🎉")
except NameError:
    print("  Run training cells first to see results.")
print("="*80)

print("\nVersion 4 Strategy:")
print(f"  ✓ Best Teacher: Teacher v3 (76.02%)")
print(f"  ✓ Best KD Method: Standard KD (α={KD_ALPHA}, T={TEMPERATURE})")
print(f"  ✓ Best Augmentation: AutoAugment + Mixup/CutMix")
print(f"  ✓ Hypothesis: 98% retention → 74.50% expected")

print(f"\nModel Files:")
print(f"  Teacher v3:   {MODEL_DIR / TEACHER_NAME}.pth")
print(f"  Student v4:   {MODEL_DIR / STUDENT_NAME}.pth")
print(f"\nCheckpoints: {CHECKPOINT_DIR.absolute()}")


VERSION 4 RESULTS - Grand Finale (Best Teacher + Best KD)

┌─────────────────────────────────────────────────────────────────────┐
│                    MODEL COMPARISON TABLE                           │
├──────────────────────────────┬──────────────┬───────────────────────┤
│ Model                        │ Accuracy (%) │ Notes                 │
├──────────────────────────────┼──────────────┼───────────────────────┤
│ Teacher v3 (EfficientNet-L)  │        76.02 │ Extended training     │
│ Student v4 Standard KD       │        74.09 │ Best Teacher + Best KD│
├──────────────────────────────┴──────────────┴───────────────────────┤
│                    ALL VERSIONS COMPARISON                          │
├──────────────────────────────┬──────────────┬───────────────────────┤
│ Student v1 (Baseline)        │        72.34 │ Standard KD           │
│ Student v2 (Enhanced)        │        74.20 │ + AutoAug + Warmup    │
│ Student v3 (DKD β=8.0)       │        72.69 │ Over-regularized      │
│ St